In [62]:
#DingDong

import string
import math
import pandas as pd
import numpy as np

corpus = ["this is the first document. The first document is a great story and it is not boring", 
          "This is the second document. The second document is also a great story but the first is better ", 
          "this was the third document",
          "Rag-beast document document document document"]

In [63]:
def build_vectors(corpus=list,searchstring=str):
    
    #Build word counters for textcorpus 
    counters = []
    
    for document in corpus:
        counter = {}
        #Comprehension list:
        #Taking the corpus list, and creating seperate documents with one word per element. 
        document = [word.strip(string.punctuation).lower() for word in document.split()]
        
        
        #Loop, counting the dictionary values(counter), based on words present in the dictionary. 
        #Words as keys, number of presentations are values
        for word in document:
            if word not in counter:
                counter[word] = 1
            else:
                counter[word] += 1  
        #Dictionary is appended to list
        counters.append(counter)
    
    #Build word counter for searchstring 
    searchstring_counter = {}
    searchstring = [word.strip(string.punctuation).lower() for word in searchstring.split()]
    
    #Same loop as above, concerning the searchstring
    for word in searchstring:
            if word not in searchstring_counter:
                searchstring_counter[word] = 1
            else:
                searchstring_counter[word] += 1
    
    #Set searchstring as last element in counters for performing caluculations later
    counters.append(searchstring_counter)

    #Build combined dict
    #Combining the list of dict by taking a set(unique representation of keys), and unionize 
    combined_dict = set().union(*counters)
    
    #Build vectors
    #Building vectors in a comprehension list with conditions. 
    vector_list = []
    i=0
    for c in range(len(counters)): # change 'i + counters[c][word]' to 'i + 1' to change counting
        vector = [i + counters[c][word] if word in counters[c] else i + 0 for word in combined_dict] 
        vector_list.append(vector)

    return counters, combined_dict, vector_list




In [64]:
#Function for finding dot product, taking vector list as parameter
def dotproduct(vl):
    dp_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        dot_product = sum(n1 * n2 for n1, n2 in zip(vector, vl[-1]))
        dp_dict[doc] = dot_product
    return dp_dict


This function we take a list of vectors and compute the dotproduct between search document vector and the different corpus document vectors. 

$D$ = document vectors have points $d_1, d_2 ... d_n$, 

$S$ = searchdocument vector have points $s_1, s_2 ... s_n$,

Algebraic definition:

$$ D \cdot S = \displaystyle\sum_{i=1}^{n} d_i s_i = d_1 s_1 + d_2 s_2 + ... + d_n s_n $$


In [65]:
def euclideandistance(vl):
    ed_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        euclidean_distance = math.sqrt(sum(((n1 - n2)**2) for n1, n2 in zip(vector, vl[-1])))
        ed_dict[doc] = euclidean_distance
    return ed_dict

The euclidian function takes two vectors to compute the distance between endpoints of the different corpus documents vectors and the search document. 


$d$ = document vectors points, 

$s$ = searchdocument vector points,

Mathematical definition:


$$ distance(d,s) = \sqrt{(d_1-s_1)^2+(d_2-s_2)^2+(d_n-s_n)^2} $$

In [66]:
def cosinesimilarity(vl):
    cs_dict = {}
    for vector in vl[:-1]:
        doc = 'Doc' + str((vl.index(vector)+1))
        cosine_similarity = sum(n1 * n2 for n1, n2 in zip(vector, vl[-1]))/(math.sqrt(sum(n ** 2 for n in vector)) * math.sqrt(sum(n ** 2 for n in vl[-1])))
        cs_dict[doc] = cosine_similarity
    return cs_dict

Cosine similarity takes to vectors to compute the angle between the vectors. A value of 1 means that the vectors are similar, implying that the vectors are on top of eachother, where a value of 0 causes orthogonal vectors.


$D$ = document vectors 

$S$ = searchdocument vector

Algebraic definition:

$$ cos(\theta)=\frac{D \cdot S}{\| D \| \| S \|} =\frac{\displaystyle\sum_{i=1}^{n} D_i S_i}{\sqrt{\displaystyle\sum_{i=1}^{n} D_i^2}\sqrt{\displaystyle\sum_{i=1}^{n} S_i^2}}$$


In [67]:
#The following code pieces displays the results
# Set search string
counters, combined_dict, vector_list = build_vectors(corpus,'this was the third document, this was the third document, this was the third document, this was the third document, this was the third document')

#Visualize data
combined_dict_list = [combined_dict]

df_vectors = pd.DataFrame(vector_list)
df_counters = pd.DataFrame(counters)
df_combined = pd.DataFrame(combined_dict_list)

In [68]:
df_counters

Unnamed: 0,a,also,and,better,boring,but,document,first,great,is,it,not,rag-beast,second,story,the,third,this,was
0,1.0,,1.0,,1.0,,2,2.0,1.0,3.0,1.0,1.0,,,1.0,2.0,,1.0,
1,1.0,1.0,,1.0,,1.0,2,1.0,1.0,3.0,,,,2.0,1.0,3.0,,1.0,
2,,,,,,,1,,,,,,,,,1.0,1.0,1.0,1.0
3,,,,,,,4,,,,,,1.0,,,,,,
4,,,,,,,5,,,,,,,,,5.0,5.0,5.0,5.0


In [69]:
df_combined

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,great,a,story,the,this,better,second,rag-beast,but,it,not,was,and,first,third,document,also,is,boring


In [70]:
df_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,1,1,1,2,1,0,0,0,0,1,1,0,1,2,0,2,0,3,1
1,1,1,1,3,1,1,2,0,1,0,0,0,0,1,0,2,1,3,0
2,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,4,0,0,0
4,0,0,0,5,5,0,0,0,0,0,0,5,0,0,5,5,0,0,0


In [71]:
df_dotproduct = pd.DataFrame([dotproduct(vector_list)])
df_dotproduct
#df_dotproduct.sort_values(by)

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,25,30,25,20


In [72]:
df_euclideandistance = pd.DataFrame([euclideandistance(vector_list)])
df_euclideandistance

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,10.198039,9.949874,8.944272,10.099505


In [73]:
df_cosinesimilarity = pd.DataFrame([cosinesimilarity(vector_list)])
df_cosinesimilarity

Unnamed: 0,Doc1,Doc2,Doc3,Doc4
0,0.415227,0.460179,1.0,0.433861


In [74]:
#validate  with numpy
doc1 = np.array(vector_list[0])
doc2 = np.array(vector_list[1])
doc3 = np.array(vector_list[2])
doc4 = np.array(vector_list[3])
sstr = np.array(vector_list[4])

#Doc 3 vs. search documents
dotproduct = np.dot(doc3,sstr)
euclideandistance = np.linalg.norm(doc3-sstr)
cosinesimilarity = np.dot(doc3, sstr) / (np.linalg.norm(doc3) * np.linalg.norm(sstr))

print(euclideandistance)
print(dotproduct)
print(cosinesimilarity)

8.94427190999916
25
0.9999999999999999


## Discussions of formulas and measurements 

### Dot product
In text similarity the dot product is a powerful way to determine if there is any similarity between two documents. A positive dot product means that there is similarity between the documents, in  some way. However, the product itself does not give an indication in what way the documents is similar. E.g. when document A consist of same words as document B, but document A has 10x more words. In this case the theta would be 0o ( cos(0) = 1 ) but the big distances would imply a high dot product, due to a large number of the words occurrences. Therefore, when using the dot product as a similarity measure, it is important to bring in additional measures in order to establish the type of similarity.


$$ A \cdot B =\| A \| \| B \| cos(\theta) $$

 
By this definition we know, that a dot product equals to zero, implies two orthogonal vectors, which means that the theta is 90 degrees. And as we know that the greater our theta is, the less the value of cosine of theta, thus the similarity decreases between two documents.

### Euclidean distance
The Euclidean distance tells about the length between the endpoints of the vectors, which give an indication of how long the distance between words in two documents is. It is important, to note if one term is represented many times in the document but the rest of words is similar to the searching document, the distance will go up, which could give a false indication of the similarity. Therefore when using the Euclidean measure it is important to look at the cosine similarity in order to assess if the abovementioned hypothesis is correct. 

$$ \| q - p \| = \sqrt{(q - p)^2 * (q - p)^2}$$
 
### Cosine similiarity
This brings us to using Cosine similarity as a distance measure, as the cosine similarity actually tells us how great the angle between two vectors is. We can then say how similar or far away two documents is. With a cosine of theta = 1 indicating that there is similarity between two documents, when it comes occurrence of words. This however does not give a proper indication of the difference in number of word occurrences in each document. 

$$ cos(\theta) = \frac{A \cdot B} {\| A \| \| B \|} $$
 

### Assessment
On the basis of abovementioned, it is clear that various measures needs to be done in order to find an accurate similarity grade. Using Euclidean distance and cosine similarity individually, and a assessing the similarity on these measures, we find a proper indication of document similarities.  
