## Similarity Calculation Metrics [in Python]

### (1)	Levenshtein Distance

### Definition

Metric for measuring the difference between two sequences. 

The Levenshtein distance between two words is the **minimum number of single-character edits** (i.e. insertions, deletions or substitutions) **required** to change one word into the other.

### Implementation

`pip install python-Levenshtein`

In [1]:
import Levenshtein

Levenshtein.distance('Levenst','Lefensd')

2

### (2)	Cosine Similarity

### Definition

Metric for measuring the similarity between two sentences.

The Cosine similarity is a metric for the similarity between two non-zero vectors of an inner product space. It measures the **cosine of the angle between them**.

### Implementation

1. Create a list of all sentences for which simlarity should be calculated. 

2. Construct count matrix with dimensions m x n, where 
   
   ... m are the number of sentences to be compared and
   
   ... n are the number of unique words appearing in all sentences
   
   ... 0 not appearing, 1 once, 2 twice, etc.
   
   
3. Construct a cosine-similarity matrix m x n, where 

   ... m and n are the number of sentences
   
   ... the values are the cosine similarities between them, i.e. how much the are similar (between 0 = not similar and 1 = exactly the same)

List of sentences to be compared:

In [2]:
sentences = ['This is a sentence.',
            'This is another sentence.',
            'Is this one different?']

In [3]:
#remove punctuation
sentences = (str(sentences).lower()).replace("]", "").replace("[", "").replace("'", "").replace("?", "").replace(".", "")
sentences = list(sentences.split(","))
print(sentences) #clean list

['this is a sentence', ' this is another sentence', ' is this one different']


Create count vector:

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer #for creating count vectors

count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(sentences)
count_matrix = count_matrix.todense()
count_matrix = pd.DataFrame(count_matrix, 
                  columns=count_vectorizer.get_feature_names(), 
                  index=['sent_1', 'sent_2', 'sent_3'])
count_matrix

Unnamed: 0,another,different,is,one,sentence,this
sent_1,0,0,1,0,1,1
sent_2,1,0,1,0,1,1
sent_3,0,1,1,1,0,1


Create cosine similarity matrix:

In [5]:
from sklearn.metrics.pairwise import cosine_similarity #for calculating cosine similarity

co_sim = pd.DataFrame(cosine_similarity(count_matrix, count_matrix))
co_sim

Unnamed: 0,0,1,2
0,1.0,0.866025,0.57735
1,0.866025,1.0,0.5
2,0.57735,0.5,1.0


### (3)	Soft Cosine Measure

### Definition

Metric for measuring the similarity between two sentences, but gives higher scores for words with similar meaning. For Example, ‘President’ vs ‘Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ should be considered similar. 

Here, the words are converted into respective word vectors, and then the similarities are computed. 

<img src="https://www.machinelearningplus.com/wp-content/uploads/2018/10/soft-cosine.png" alt="Comparison between 3-dimensional cosine similarity and soft cosine measure" title="Cosine Similarity vs. Soft Cosine Measure" />

### Implementation

To compute soft cosines, we need to create

1. a dictionary that maps each word to a unique ID
2. the corpus i.e. the word counts for each sentence
3. the similarity matrix

To get the word vectors, we need a word emedding model. Let's download the `FastText` model using gensim's downloader api.

In [None]:
import gensim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess

In [10]:
#download the FastText model - about 960MB
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')

Prepare a dictionary and a corpus: the dictionary extracts automatically all the unique words into a corpus.

In [44]:
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in sentences])

Prepare the similarity matrix:

In [32]:
similarity_matrix = fasttext_model300.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

Convert the sentences into bag-of-words vectors:

In [38]:
sent = []
def convert(sentences):
    for i in sentences:
        x = dictionary.doc2bow(simple_preprocess(i))
        sent.append(x)

In [39]:
convert(sentences)
print(sent)

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (1, 1), (2, 1), (3, 1)], [(0, 1), (2, 1), (4, 1), (5, 1)]]


Create function for creating the soft cosine similarity matrix:

In [41]:
import numpy as np
import pandas as pd

#print(softcossim(sent_1, sent_2, similarity_matrix))

def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

create_soft_cossim_matrix(sent)

Unnamed: 0,0,1,2
0,1.0,0.94,0.78
1,0.94,1.0,0.84
2,0.78,0.84,1.0


## Analysis and comparison

#### The sentences we compared:

1. 'This is a sentence.'    
2. 'This is another sentence.'    
3. 'Is this one different?'

#### Metric comparison

Remark: normally, we would remove `Stopwords` in the sentences before analyzing. Stopwords are words like "I", "Me", "This", "a", "is", "for",... etc. Since these don't give us a good measure about similarity. In our example, we did not remove them, which  gives us a higher similarity for the last sentence.

- Levenshtein Distance is good for finding typos in words or short sentences 

- Cosine similarity compares and measures the distance of the exact same words:
    - for our example, this results in lower similarity for the three sentences
    - this metric fits the problem of plagiarism testing very well.

In [52]:
co_sim

Unnamed: 0,0,1,2
0,1.0,0.866025,0.57735
1,0.866025,1.0,0.5
2,0.57735,0.5,1.0


- Soft cosine measure takes into account the menaing of the words
    - here the similarity is much higher, since the meaning is taken into account
    - this metric fits the problem of text categorization or clustering

In [51]:
create_soft_cossim_matrix(sent)

Unnamed: 0,0,1,2
0,1.0,0.94,0.78
1,0.94,1.0,0.84
2,0.78,0.84,1.0
