### Levenshtein distance

In [1]:
#pip install python-Levenshtein

In [2]:
import Levenshtein

In [3]:
Sent_1 = 'Hello Wordl'
Sent_2 = "Hello World"

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other

In [4]:
Levenshtein.distance(Sent_1, Sent_2)

2

In [5]:
Sent_3 = 'This is my sentence'
Sent_4 = 'This sentence is similar to my sentence'

Levenshtein distance is not suitable for long strings as it shows large distances even though sentences give similar information.

In [6]:
Levenshtein.distance(Sent_3, Sent_4)

20

To find similarity between larger sentences we can use Cosine Similarity

### Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them

In [7]:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

In [8]:
sentences = [
    'This is a foo bar sentence!!',
    'This sentence is similar to a foo bar sentence.',
    'This is another string, but it is not quite similar to the previous ones.',
    'I am also just another string.'
]

In [9]:
def clean_string(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

cleaned = list(map(clean_string, sentences))

cleaned

['foo bar sentence',
 'sentence similar foo bar sentence',
 'another string quite similar previous ones',
 'also another string']

Calculating Count Vectors

In [10]:
count_vectors = CountVectorizer().fit_transform(cleaned).toarray()

count_vectors

array([[0, 0, 1, 1, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 2, 1, 0],
       [0, 1, 0, 0, 1, 1, 1, 0, 1, 1],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 1]])

In [11]:
count_csim = cosine_similarity(count_vectors)

count_csim

array([[1.        , 0.87287156, 0.        , 0.        ],
       [0.87287156, 1.        , 0.15430335, 0.        ],
       [0.        , 0.15430335, 1.        , 0.47140452],
       [0.        , 0.        , 0.47140452, 1.        ]])

In [12]:
def cosine_sim_vectors(vec1, vec2):
    vec1 = vec1.reshape(1, -1)
    vec2 = vec2.reshape(1, -1)
    
    return cosine_similarity(vec1, vec2)[0][0]

In [13]:
cosine_sim_vectors(count_vectors[0], count_vectors[1])

0.8728715609439696

In [14]:
cosine_similarity([count_vectors[0]], [count_vectors[1]])[0][0]

0.8728715609439696

Calculating TF-IDF

In [15]:
tfidf_vectors = TfidfTransformer().fit_transform(count_vectors).toarray()

tfidf_vectors

array([[0.        , 0.        , 0.57735027, 0.57735027, 0.        ,
        0.        , 0.        , 0.57735027, 0.        , 0.        ],
       [0.        , 0.        , 0.37796447, 0.37796447, 0.        ,
        0.        , 0.        , 0.75592895, 0.37796447, 0.        ],
       [0.        , 0.35745504, 0.        , 0.        , 0.4533864 ,
        0.4533864 , 0.4533864 , 0.        , 0.35745504, 0.35745504],
       [0.66767854, 0.52640543, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.52640543]])

In [16]:
tfidf_csim = cosine_similarity(tfidf_vectors)

tfidf_csim

array([[1.        , 0.87287156, 0.        , 0.        ],
       [0.87287156, 1.        , 0.13510531, 0.        ],
       [0.        , 0.13510531, 1.        , 0.37633255],
       [0.        , 0.        , 0.37633255, 1.        ]])

In [17]:
cosine_similarity([tfidf_vectors[0]], [tfidf_vectors[1]])[0][0]

0.8728715609439697