## Cosine Similarity

Cosine similarity represents a frequently used measure to indicate how (dis)similair two documents (e.g., social media posts, news media articles, blogs) are. 

Mathematically, we write: 


$$
\text { similarity }=\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}
$$


Next, an example of an application in Python is provided. Here, we will calculate the similarity between two stings. 

In [35]:
import math
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

doc1 = "Students that you like the digital society".lower()
doc2 = "Students of the minor communication in the digital society".lower()

   communication  digital  like  minor  society  students
0              0        1     1      0        1         1
1              1        1     0      1        1         1





Next, we need to transform the textuel data to vector representations (that is, move from words to numbers). You can think of different ways to do this. Next, we will apply `CountVectorizer`. 

In [37]:
vec = CountVectorizer(stop_words='english')
count_matrix = vec.fit_transform([doc1, doc2])

   communication  digital  like  minor  society  students
0              0        1     1      0        1         1
1              1        1     0      1        1         1


 In the following code snippet, we transform the sparse output to a dense df object **for educational purposes**. Specifically, this allows you to investigate what is the data looks like. Please don't try to do this if you work with large data (as forcing large datasets from a sparse to a dense format would be very memory inefficient). 
 

In [45]:
print(pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).to_string())

   communication  digital  like  minor  society  students
0              0        1     1      0        1         1
1              1        1     0      1        1         1


### 1. Calculate Cosine Similarity from scratch
That is, without the help of third-party packages

First, we will convert each row (= document) to a one-dimensional array (vector)

In [51]:
doc1_vector = pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).T[0].to_list()
doc2_vector = pd.DataFrame(count_matrix.A, columns=vec.get_feature_names()).T[1].to_list()

print(f"The vector belonging to doc1: {doc1_vector}")
print(f"The vector belonging to doc2: {doc2_vector}")

The vector belonging to doc1: [0, 1, 1, 0, 1, 1]
The vector belonging to doc2: [1, 1, 0, 1, 1, 1]


Now, lets populate the formula.


    1.Execute the part of the formula in the numerator. Specifically, take the dot product of the vectors:
$$
\sum_{i=1}^{n} A_{i} B_{i}
$$

In [53]:
dot_product = sum([num1 * num2 for num1, num2 in zip(doc1_vector, doc2_vector)])
print(dot_product)

3


    2.Execute the part of the formula in the denumerator. Take the product of the lengths of the vectors.
    
$$
\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}
$$

In [54]:
import math
doc1_ = math.sqrt(sum( [i**2 for i in doc1_vector]) )
doc2_ = math.sqrt(sum( [i**2 for i in doc2_vector]) )

    3. finally:

In [65]:
cos_sim = dot_product / (doc1_ * doc2_)
print(f"We calcuated cosine similarity between the following documents:\n---\n{doc1}\n---\n{doc2}\n---\nSimilarity is:\n\n\n{cos_sim}")

We calcuated cosine similarity between the following documents:
---
Students that you like the digital society
---
students of the minor communication in the digital society
---
Similarity is:


0.6708203932499369


## 2. Calculate Cosine Similarity using `sklearn`

We can also do this using `sklearn`'s `cosine_similarity`. Let's validate our results.

In [71]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([doc1_vector, doc2_vector])[0][1]

0.6708203932499369