Document similarity:

A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. But this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.

The cosine similarity helps overcome this fundamental flaw in the 'count-the-common-words' or Euclidean distance approach.

What is Cosine Similarity and why is it advantageous?

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. 

This metric is a measurement of orientation and not magnitude.
The two vectors I am talking about are arrays containing the word counts of two documents.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. 

Smaller the angle, higher the similarity.

Two documents are similar if all the words are same in both the documents

In [1]:
doc_trump = "Mr. Trump became president after winning the political election. \
Though he lost the support of some republican friends, Trump is friends with President Putin"

doc_election = "President Trump says Putin had no political interference in the election outcome.\
He says it was a witchhunt by political parties. \
He claimed President Putin is a friend who had nothing to do with the election"

doc_putin = "Post elections, Vladimir Putin became President of Russia.\
President Putin had served as the Prime Minister earlier in his political career"

In [19]:
documents = [doc_trump,doc_election,doc_putin]

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [21]:
count_vect = CountVectorizer(stop_words="english")
X = count_vect.fit_transform(documents)

In [25]:
df = pd.DataFrame(X.toarray(),columns=count_vect.get_feature_names(),
                 index = ['doc_trump','doc_election','doc_putin'])
df

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0,0,0,1,0,0,2,0,1,0,...,1,1,0,0,0,1,2,0,1,0
doc_election,0,1,0,2,0,1,0,1,0,0,...,2,0,0,2,0,0,1,0,0,1
doc_putin,1,0,1,0,1,0,0,0,0,1,...,2,0,1,0,1,0,0,1,0,0


In [26]:
from sklearn.metrics.pairwise import cosine_similarity

In [27]:
print(cosine_similarity(df))

[[1.         0.51639778 0.36893239]
 [0.51639778 1.         0.45360921]
 [0.36893239 0.45360921 1.        ]]


In [None]:
cosine_similarity(df)

In [None]:
# Get the document which is closest to document 0

In [66]:
#Cosine similarity between document 0 and the rest
print((cosine_similarity(df[0:1], df[1:]).flatten().argsort()[::-1]+1)[:1])

[1]


In [None]:
#Similarity between the words accross the corpus, 
#co-occourrence of words accross all documents

In [68]:
sim_mat = cosine_similarity(df.T)
sim_mat = pd.DataFrame(sim_mat, columns= df.columns, index= df.columns)

In [69]:
sim_mat

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
career,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.666667,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
claimed,0.0,1.0,0.0,0.894427,0.0,1.0,0.0,1.0,0.0,0.0,...,0.666667,0.0,0.0,1.0,0.0,0.0,0.447214,0.0,0.0,1.0
earlier,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.666667,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
election,0.0,0.894427,0.0,1.0,0.0,0.894427,0.447214,0.894427,0.447214,0.0,...,0.745356,0.447214,0.0,0.894427,0.0,0.447214,0.8,0.0,0.447214,0.894427
elections,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.666667,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
friend,0.0,1.0,0.0,0.894427,0.0,1.0,0.0,1.0,0.0,0.0,...,0.666667,0.0,0.0,1.0,0.0,0.0,0.447214,0.0,0.0,1.0
friends,0.0,0.0,0.0,0.447214,0.0,0.0,1.0,0.0,1.0,0.0,...,0.333333,1.0,0.0,0.0,0.0,1.0,0.894427,0.0,1.0,0.0
interference,0.0,1.0,0.0,0.894427,0.0,1.0,0.0,1.0,0.0,0.0,...,0.666667,0.0,0.0,1.0,0.0,0.0,0.447214,0.0,0.0,1.0
lost,0.0,0.0,0.0,0.447214,0.0,0.0,1.0,0.0,1.0,0.0,...,0.333333,1.0,0.0,0.0,0.0,1.0,0.894427,0.0,1.0,0.0
minister,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.666667,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
