# Cosine Similarity

Cosine similarity represents a frequently used measure to indicate how (dis)similair two documents (e.g., social media posts, news media articles, blogs) are. 

Mathematically, we write: 


$$
\text { similarity }=\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}
$$


Next, an example of an application in Python is provided. Here, we will calculate the similarity between two stings. 

In [1]:
import math
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

doc1 = "When I eat breakfast, I usually drink some tea".lower()
doc2 = "I like my tea with my breakfast".lower()
doc3 = "She likes cereal and coffee".lower()




Next, we need to transform the textuel data to vector representations (that is, move from words to numbers). You can think of different ways to do this. Next, we will apply `CountVectorizer`. 

In [2]:
vec = CountVectorizer(stop_words='english')
count_matrix = vec.fit_transform([doc1, doc2, doc3])

 In the following code snippet, we transform the sparse output to a dense df object **for educational purposes**. Specifically, this allows you to investigate what is the data looks like. Please don't try to do this if you work with large data (as forcing large datasets from a sparse to a dense format would be very memory inefficient). 
 

In [3]:
print(pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).to_string())

   breakfast  cereal  coffee  drink  eat  like  likes  tea  usually
0          1       0       0      1    1     0      0    1        1
1          1       0       0      0    0     1      0    1        0
2          0       1       1      0    0     0      1    0        0


### 1. Calculate Cosine Similarity from scratch
That is, without the help of third-party packages

First, we will convert each row (= document) to a one-dimensional array (vector)

In [4]:
doc1_vector = pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).T[0].to_list()
doc2_vector = pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).T[1].to_list()
doc3_vector = pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).T[2].to_list()

print(f"The vector belonging to doc1: {doc1_vector}")
print(f"The vector belonging to doc2: {doc2_vector}")
print(f"The vector belonging to doc2: {doc3_vector}")

The vector belonging to doc1: [1, 0, 0, 1, 1, 0, 0, 1, 1]
The vector belonging to doc2: [1, 0, 0, 0, 0, 1, 0, 1, 0]
The vector belonging to doc2: [0, 1, 1, 0, 0, 0, 1, 0, 0]


Now, lets populate the formula.


    1.Execute the part of the formula in the numerator. Specifically, take the dot product of the vectors:
$$
\sum_{i=1}^{n} A_{i} B_{i}
$$

In [5]:
dot_product = sum([num1 * num2 for num1, num2 in zip(doc1_vector, doc2_vector)])
print(dot_product)

2


    2.Execute the part of the formula in the denumerator. Take the product of the lengths of the vectors.
    
$$
\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}
$$

In [6]:
import math
doc1_ = math.sqrt(sum( [i**2 for i in doc1_vector]) )
doc2_ = math.sqrt(sum( [i**2 for i in doc2_vector]) )

    3. finally:

In [7]:
cos_sim = dot_product / (doc1_ * doc2_)

print(f"We calcuated cosine similarity between the following documents:\n---\n{doc1}\n---\n{doc2}\n---\nSimilarity is:\n\n\n{cos_sim}")

We calcuated cosine similarity between the following documents:
---
when i eat breakfast, i usually drink some tea
---
i like my tea with my breakfast
---
Similarity is:


0.5163977794943222


### 2. Calculate Cosine Similarity using `sklearn`

We can also do this using `sklearn`'s `cosine_similarity`. Let's validate our results.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity([doc1_vector, doc2_vector, doc3_vector]))

[[1.         0.51639778 0.        ]
 [0.51639778 1.         0.        ]
 [0.         0.         1.        ]]


<u>Question</u> 
<br>
<br>
<div class="alert-info">
What is the similarity score between doc1 and doc3? Does that make sense to you?
</div>

# Soft-Cosine Similarity

In [21]:
import gensim
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import WordEmbeddingSimilarityIndex
print(gensim.__version__)

from gensim.corpora import Dictionary
import numpy as np

4.1.2


    ## 1.Load a pre-trained embedding model.
   
    
First, we need to load an embedding model. There are several pre-trained models available, in multiple languages.
lets try this one. 

<div class="alert-danger">
Loading this model may takes some time....
</div>

To download the model, make sure that your VPN is off--sometimes that hinders the downloading process. 

In [12]:
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')

    ## 2. Create a dictionary 
We need a dictionary mapping words to id's for the documents we are working with. Let's use `gensim`'s `Dictionary` mapper for this. First, however, we need to break our documents down to tokens, that we can work with. Here, we use `gensim`'s `simple_preprocess`, but you can do this manually as well (e.g., using a tokenizer/ stemmer/ pruner of your own choice).



`simple_preprocess`: lowercases, tokenizes and de-accents (see [here](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html));
It returns a `list` of tokens. 

`corpora.Dictionary` : Construct word<->id mappings (see [here](https://radimrehurek.com/gensim/corpora/dictionary.html) )


In [13]:
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in [doc1, doc2, doc3]]) #initialize a Dictionary. This step assigns a token_id to each word

In [15]:
# inspect what is in the dataset
dictionary.doc2idx(['hi','cereal']) # this indicates that `hi` is not in the dictionary, but `students` has an idx of 3

[-1, 11]

In [16]:
for idx,w in dictionary.items():
    print(idx, w)

0 breakfast
1 drink
2 eat
3 some
4 tea
5 usually
6 when
7 like
8 my
9 with
10 and
11 cereal
12 coffee
13 likes
14 she


In [17]:
'digital' in dictionary.token2id

False

In [18]:
'coffee' in dictionary.token2id

True

In [19]:
bag_of_words_vectors = [ dictionary.doc2bow(simple_preprocess(doc)) for doc in [doc1, doc2, doc3]] # represent each document by (token_id, token_count) tuples

`doc2bow` Convert document into the bag-of-words (BoW) format (this is a list of (token_id, token_count) tuples)

In [22]:
## This step also takes quite a while.... 
similarity_index = WordEmbeddingSimilarityIndex(fasttext_model300)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary) # Build a term similarity matrix and compute the Soft Cosine Measure.

100%|██████████| 15/15 [00:02<00:00,  6.62it/s]


    ### 3. Calculate soft cosine similarity 

In [36]:
#between doc1 and doc2
scm_doc1_doc2 = similarity_matrix.inner_product(bag_of_words_vectors[0], bag_of_words_vectors[1], normalized=(True, True))

#between doc1 and doc3
scm_doc1_doc3 = similarity_matrix.inner_product(bag_of_words_vectors[0], bag_of_words_vectors[2], normalized=(True, True))

#between doc2 and doc3
scm_doc2_doc3 = similarity_matrix.inner_product(bag_of_words_vectors[1], bag_of_words_vectors[2], normalized=(True, True))

print(f"SCM between:\ndoc1 <-> doc2: {scm_doc1_doc2:.2f}\ndoc1 <-> doc3: {scm_doc1_doc3:.2f}\ndoc2 <-> doc3: {scm_doc2_doc3:.2f}")

SCM between:
doc1 <-> doc2: 0.29
doc1 <-> doc3: 0.15
doc2 <-> doc3: 0.28


or, if you like, you can create a matrix (similar to the output of `sklearn`'s `cosine_similarity`)

In [37]:
# reference: https://www.machinelearningplus.com/nlp/cosine-similarity/
def create_soft_cossim_matrix(documents):
    len_array = np.arange(len(documents))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(similarity_matrix.inner_product(documents[i],documents[j], normalized=(True, True)) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

df = create_soft_cossim_matrix(bag_of_words_vectors)
df.columns =['doc1', 'doc2', 'doc2']
df.index =['doc1', 'doc2', 'doc2']
df

Unnamed: 0,doc1,doc2,doc2.1
doc1,1.0,0.29,0.15
doc2,0.29,1.0,0.28
doc2,0.15,0.28,1.0


<u>Question</u> 
<br>
<br>
<div class="alert-info">
Inspect the soft-cosine results, and compare with the cosine results. What makes more sense?
</div>

<u>Question</u> 
<br>
<br>
<div class="alert-info">
Replace the `str` objects in `doc1`, `doc2`, and `doc3` for different sentences (that you can make up yourself). Do you expect high or low similarity? Run the cells, and inspect the results. Are findings in line with what you expected?
</div>

<u>Question</u> 
<br>
<br>
<div class="alert-info">
Play around with different type of `vectorizer`s (e.g., compare count and tfidf). Does this influence the results, and how?
</div>

<u>Question</u> 
<br>
<br>
<div class="alert-info">
Finally, can you transform the output to cosine distance?
</div>