###### the main purpose of this code is to find whether the two documents are similar or not !
###### In this notebook I will discuss two ways to achieve this task, first is finding "euclidean distance" and second is "cosine similarity".


#### Euclidean Distance 


![image](euclidean_distance.png)
to get the similarity via this method- 
- find frequency of words in each document
- calculate euclidean distance
- if the distance is less than the threshold value then the two documents are similar.

But there is a problem with this method which is suppose there are two documents, doc1 contains total 20,000 words and doc2 is summary of doc1 and it contains only 2000 words. Now, let's consider that word "apple" occuring 2000 times in doc1 and 200 time is doc2. 

the euclidean distance will be more for this case though both vectors are pointing in the same direction! 

### Cosine Similarity 

The above problem can be solved by using cosine similarity. 

![img](Cosine-Similarity-Formula-1.png)

cosine similarity is the angle between two vectors and since both vectors <apple> of two documents are same they will lie in same plane and their angle will be less.
    
- if the value is close to 1 or 1 then the two documents are similar 
- if the value is 0 then the documents are not similar

refer below image for better understanding  
![](The-difference-between-Euclidean-distance-and-cosine-similarity.png)


In [2]:
# first step is to get some text samples
d1 = "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫;pinyin: dà xióng māo),[4] \
also known as panda bear or simply panda, is a bear[5] native to south central China.[1]\
It is easily recognized by the large, distinctive black patches around its eyes, \
over the ears, and across its round body. The name 'giant panda' \
is sometimes used to distinguish it from the red panda. Though it belongs to the order Carnivora,\
the giant panda is a folivore, with bamboo shoots and leaves making up more than 99% of its diet.[6] Giant\
pandas in the wild will occasionally eat other grasses, wild tubers, or even meat in the form of birds, rodents, or carrion. \
In captivity, they may receive honey, eggs, fish, yams, shrub leaves, oranges, or bananas along with specially prepared food."



d2 = "he giant panda lives in a few mountain ranges in central China, mainly in Sichuan, but also in \
neighbouring Shaanxi and Gansu.[9] As a result of farming, deforestation, and other development, the \
giant panda has been driven out of the lowland areas where it once lived.\
The giant panda is a conservation-reliant vulnerable species.[10][11] A 2007 report showed 239 pandas living \
in captivity inside China and another 27 outside the country.[12] As of December 2014, 49 giant pandas lived in \
captivity outside China, living in 18 zoos in 13 different countries.[13] Wild population estimates vary; one \
estimate shows that there are about 1,590 individuals living in the wild,[12] while a 2006 study via DNA \
analysis estimated that this figure could be as high as 2,000 to 3,000.[14] Some reports also show that \
the number of giant pandas in the wild is on the rise.[15] In March 2015, conservation news site Mongabay \
stated that the wild giant panda population had increased by 268, or 16.8%, to 1,864.[16] In 2016, the IUCN \
reclassified the species from 'endangered' to 'vulnerable'."
# documents = [d1, d2, d3, d4]

# documents

In [3]:
# import libraries 
import re
import nltk
from nltk.corpus import stopwords
import math
import numpy as np

In [30]:
# preprocessing steps 
# step 1:-> get the words from the documents and remove punctuations and make all the words lower case 
def get_words_fron_text(document):
    document = document.lower()
    vectors = re.findall(r'\w+', document)
    return vectors

vectors = {}
vectors['v1'] = get_words_fron_text(d1)
vectors['v2'] = get_words_fron_text(d2)
print(vectors['v1'])


['the', 'giant', 'panda', 'ailuropoda', 'melanoleuca', 'chinese', '大熊猫', 'pinyin', 'dà', 'xióng', 'māo', '4', 'also', 'known', 'as', 'panda', 'bear', 'or', 'simply', 'panda', 'is', 'a', 'bear', '5', 'native', 'to', 'south', 'central', 'china', '1', 'it', 'is', 'easily', 'recognized', 'by', 'the', 'large', 'distinctive', 'black', 'patches', 'around', 'its', 'eyes', 'over', 'the', 'ears', 'and', 'across', 'its', 'round', 'body', 'the', 'name', 'giant', 'panda', 'is', 'sometimes', 'used', 'to', 'distinguish', 'it', 'from', 'the', 'red', 'panda', 'though', 'it', 'belongs', 'to', 'the', 'order', 'carnivora', 'the', 'giant', 'panda', 'is', 'a', 'folivore', 'with', 'bamboo', 'shoots', 'and', 'leaves', 'making', 'up', 'more', 'than', '99', 'of', 'its', 'diet', '6', 'giantpandas', 'in', 'the', 'wild', 'will', 'occasionally', 'eat', 'other', 'grasses', 'wild', 'tubers', 'or', 'even', 'meat', 'in', 'the', 'form', 'of', 'birds', 'rodents', 'or', 'carrion', 'in', 'captivity', 'they', 'may', 'receiv

In [31]:

# remove stop words 
def remove_stopwords(vector):
    stpwrds = set(stopwords.words('english'))
#     print("stopwords\n", stpwrds)
#     print('-----------------------------------------------------------------------------------------------------------------------')
    filtered_vectors = {}
    word_lst = []
    for word in vector:
        if word not in stpwrds:
            word_lst.append(word)
    
    return word_lst

filtered_vectors = {}
filtered_vectors['v1'] = remove_stopwords(vectors['v1'])
filtered_vectors['v2'] = remove_stopwords(vectors['v2'])

print(filtered_vectors['v1'])

['giant', 'panda', 'ailuropoda', 'melanoleuca', 'chinese', '大熊猫', 'pinyin', 'dà', 'xióng', 'māo', '4', 'also', 'known', 'panda', 'bear', 'simply', 'panda', 'bear', '5', 'native', 'south', 'central', 'china', '1', 'easily', 'recognized', 'large', 'distinctive', 'black', 'patches', 'around', 'eyes', 'ears', 'across', 'round', 'body', 'name', 'giant', 'panda', 'sometimes', 'used', 'distinguish', 'red', 'panda', 'though', 'belongs', 'order', 'carnivora', 'giant', 'panda', 'folivore', 'bamboo', 'shoots', 'leaves', 'making', '99', 'diet', '6', 'giantpandas', 'wild', 'occasionally', 'eat', 'grasses', 'wild', 'tubers', 'even', 'meat', 'form', 'birds', 'rodents', 'carrion', 'captivity', 'may', 'receive', 'honey', 'eggs', 'fish', 'yams', 'shrub', 'leaves', 'oranges', 'bananas', 'along', 'specially', 'prepared', 'food']


In [58]:
# get bag of words
def get_bag_of_words(filtered_vectors):
    bag_of_words = set([word for vector, words_lst in filtered_vectors.items() for word in words_lst])
    print('total words', len(bag_of_words))
    return bag_of_words


# now calculate frequency of words in each vector 
def calculate_frequency(filtered_vector, bag_of_words):
    vectors_with_word_count = {}
    word_frequency = {}
    for word in filtered_vector:
        if word not in word_frequency:
            word_frequency[word] = 1
        else:
            word_frequency[word] += 1
#     vectors_with_word_count[vector] = word_frequency
        
#     print(word_frequency)
    """ 
    create equal length of all vectors
    if some word is not present in the vector then put its value (which is the frequency of that word in the document) as zero.

    """
    temp_dict = {}
    keys = word_frequency.keys()
    for word in bag_of_words:
        if word not in keys:
            temp_dict[word] = 0

    word_frequency.update(temp_dict)
#     print("length of dictionary:", len(word_frequency), word_frequency)
    return word_frequency



In [59]:
bag_of_words = get_bag_of_words(filtered_vectors)
v1 = calculate_frequency(filtered_vectors['v1'], bag_of_words)
v2 = calculate_frequency(filtered_vectors['v2'], bag_of_words)

total words 160



let's check similarity via both methods discussed above

- euclidean distance 
- cosine similarity



##### Euclidean Distance

In [60]:
# function to claculate euclidean distance
def calculate_euclidean_dist(v1, v2):
    total_sum = 0
    for word, count in v1.items():
        total_sum += (v1[word] - v2[word])**2
    euclidean_distance = math.sqrt(total_sum)
    return euclidean_distance





In [61]:
v2

{'giant': 6,
 'panda': 4,
 'lives': 1,
 'mountain': 1,
 'ranges': 1,
 'central': 1,
 'china': 3,
 'mainly': 1,
 'sichuan': 1,
 'also': 2,
 'neighbouring': 1,
 'shaanxi': 1,
 'gansu': 1,
 '9': 1,
 'result': 1,
 'farming': 1,
 'deforestation': 1,
 'development': 1,
 'driven': 1,
 'lowland': 1,
 'areas': 1,
 'lived': 2,
 'conservation': 2,
 'reliant': 1,
 'vulnerable': 2,
 'species': 2,
 '10': 1,
 '11': 1,
 '2007': 1,
 'report': 1,
 'showed': 1,
 '239': 1,
 'pandas': 3,
 'living': 3,
 'captivity': 2,
 'inside': 1,
 'another': 1,
 '27': 1,
 'outside': 2,
 'country': 1,
 '12': 2,
 'december': 1,
 '2014': 1,
 '49': 1,
 '18': 1,
 'zoos': 1,
 '13': 2,
 'different': 1,
 'countries': 1,
 'wild': 4,
 'population': 2,
 'estimates': 1,
 'vary': 1,
 'one': 1,
 'estimate': 1,
 'shows': 1,
 '1': 2,
 '590': 1,
 'individuals': 1,
 '2006': 1,
 'study': 1,
 'via': 1,
 'dna': 1,
 'analysis': 1,
 'estimated': 1,
 'figure': 1,
 'could': 1,
 'high': 1,
 '2': 1,
 '000': 2,
 '3': 1,
 '14': 1,
 'reports': 1,
 's

##### Cosine Similarity

In [122]:
def calculate_cosine_similarity(v1, v2):
    dot_product = 0
    sqr_v1 = 0
    sqr_v2 = 0
    for word in v1:
        dot_product += v1[word] * v2[word]
        sqr_v1 += v1[word] * v1[word]
        sqr_v2 += v2[word] * v2[word]
    magnitude_v1 = math.sqrt(sqr_v1)
    magnitude_v2 = math.sqrt(sqr_v2)
    
    cosine_dist = dot_product / (magnitude_v1 * magnitude_v2)
    return cosine_dist



In [130]:

print("Euclidean distance:", calculate_euclidean_dist(v1, v2))
print("cosine similarity:", calculate_cosine_similarity(v1, v2))

Euclidean distance: 15.0996688705415
cosine similarity: 0.35754847096709713


- here the angle between two text is  0.35 the distance is 15.1
- by the angle one can simply say that two docs are similar but in case of ditance we need to decide threshold value to claim the similarity


In [138]:
# text on rose
d3 = "The leaves are borne alternately on the stem. In most species they are 5 to 15 centimetres (2.0 to 5.9 in) long,\
pinnate, with (3–) 5–9 (–13) leaflets and basal stipules; the leaflets usually have a serrated margin, and often a few \
small prickles on the underside of the stem. Most roses are deciduous but a few (particularly from Southeast Asia) \
are evergreen or nearly so. \
\
The flowers of most species have five petals, with the exception of Rosa sericea,\
which usually has only four. Each petal is divided into two distinct lobes and is usually white or \
pink, though in a few species yellow or red. Beneath the petals are five sepals (or in the case of \
some Rosa sericea, four). These may be long enough to be visible when viewed from above and appear as\
green points alternating with the rounded petals. There are multiple superior ovaries that develop \
into achenes.[4] Roses are insect-pollinated in nature. \
\
The aggregate fruit of the rose is a berry-like structure called a rose hip. \
Many of the domestic cultivars do not produce hips, as the flowers are so tightly \
petalled that they do not provide access for pollination. The hips of most species are red,\
but a few (e.g. Rosa pimpinellifolia) have dark purple to black hips. Each hip comprises an outer \
fleshy layer, the hypanthium, which contains 5–160 (technically dry single-seeded fruits called achenes) \
embedded in a matrix of fine, but stiff, hairs. Rose hips of some species, especially the dog rose (Rosa canina) \
and rugosa rose (Rosa rugosa), are very rich in vitamin C, among the richest sources of any plant. \
The hips are eaten by fruit-eating birds such as thrushes and waxwings, which then disperse the seeds\
in their droppings. Some birds, particularly finches, also eat the seeds.\
\
The sharp growths along a rose stem, though commonly called  \
, are technically prickles, outgrowths of the epidermis (the outer layer of tissue of the stem), \
unlike true thorns, which are modified stems. Rose prickles are typically sickle-shaped hooks, \
which aid the rose in hanging onto other vegetation when growing over it. Some species such as Rosa \
rugosa and Rosa pimpinellifolia have densely packed straight prickles, probably an adaptation to \
reduce browsing by animals, but also possibly an adaptation to trap wind-blown sand and so reduce erosion \
and protect their roots (both of these species grow naturally on coastal sand dunes). Despite the presence \
of prickles, roses are frequently browsed by deer. A few species of roses have only vestigial prickles that have no points."

# text on tiger 
d4 = "The tiger (Panthera tigris) is the largest species among the Felidae and \
classified in the genus Panthera. It is most recognisable for its dark vertical \
stripes on reddish-orange fur with a lighter underside. It is an apex predator,  \
primarily preying on ungulates such as deer and wild boar. It is territorial and  \
generally a solitary but social predator, requiring large contiguous areas of  \
habitat, which support its requirements for prey and rearing of its offspring. \
Tiger cubs stay with their \
mother for about two years, before they become independent and leave their mother's home range to establish their own."

In [139]:
v3_words = get_words_fron_text(d3)
v4_words = get_words_fron_text(d4)
filtered_v3 = remove_stopwords(v3_words)
filtered_v4 = remove_stopwords(v4_words)
bow = get_bag_of_words({'v3': filtered_v3, 'v4': filtered_v4})
v3 = calculate_frequency(filtered_v3, bow)
v4 = calculate_frequency(filtered_v4, bow)


total words 226


In [140]:

print("Euclidean distance:", calculate_euclidean_dist(v3, v4))
print("cosine similarity:", calculate_cosine_similarity(v3, v4))

Euclidean distance: 24.43358344574123
cosine similarity: 0.06921162787499348


In [144]:
d5 = " this weather is quite beauitful and I would really like to go out and have some fun"
d6 = " no way, i am not reading any book just for the sake of marks!!!!! "

In [146]:
v5_words = get_words_fron_text(d5)
v6_words = get_words_fron_text(d6)
filtered_v5 = remove_stopwords(v5_words)
filtered_v6 = remove_stopwords(v6_words)
bow = get_bag_of_words({'v3': filtered_v5, 'v4': filtered_v6})
v5 = calculate_frequency(filtered_v5, bow)
v6 = calculate_frequency(filtered_v6, bow)

print("Euclidean distance:", calculate_euclidean_dist(v5, v6))
print("cosine similarity:", calculate_cosine_similarity(v5, v6))

total words 13
Euclidean distance: 3.605551275463989
cosine similarity: 0.0


in the above examples we can see that cosine similarity is giving good results. if it is close to zero then the documents are not similar and if close to 1 then they are similar 