### <u>Terms used in this project</u>

#### <u>Corpus:</u>
The collection of documents

#### <u>Vocabulary:</u>
The collection of all words present in a corpus

#### <u>Bag of words:</u>
In this model, a text (such as a sentence or a document) is represented as the bag of its words,
disregarding grammar and even word order but keeping count of words

#### <u>Term Frequency(tf)</u>
The number of times a word occurs in a document
tf(t,d) = f(t,d)
    where t: term
            d: document
            
#### <u>Inverse Document Frequency(idf)</u>
The number of documents that contain a particular word.
idf(t,D) = log(N/|{d &#8712; D:t &#8712; d}|)
    where N: number of total documents
            t: term
            d: document
            
#### <u>Term frequency-Inverse Document Frequency(idf)</u>
tf-idf(t,d,C) = tf(t,d).Idf(t,C)
<i>Uncommon words in a document will have  a higher tf-idf</i>
<i>And common words across documents will have an almost zero tf-idf</i>
<i>Using the tf-idf measure we can create a vector representation (bag of words representation) for a document.</i>
<i>For a document each (word)term can be replaced by their tf-idf score and create a vector:</i>
<i>Doc -> {tf-idf(w1), tf-idf(w2), ..,tf-idf(wn)}</i>
The number of documents that contain a particular word.
idf(t,D) = log(N/|{d &#8712; D:t &#8712; d}|)
    where N: number of total documents
            t: term
            d: document
            
#### <u>Cosine Similarity</u>
Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is :
<i>Cos(A, B) = A . B / ||A|| * ||B||</i>

Text Matching:
<ul>
    <li>Text Matching: vector A and B are the <strong>tf-idf</strong> vectors of the documents.</li>
    <li>cosine similarity of two documents: is in range 0 to 1 and 
        the value is close to one when the documents are too similar as Cosine of zero is 1
    </li>
    <li>tf-idf values are not negative</li>
</ul>

#### Imports

In [1]:
import nltk

In [2]:
# download nltk data

#nltk.download()

In [3]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import string
import pandas as pd
import docx2txt

In [4]:
# define the corpus variable that will hold the documents
corpus = []

# read the files
doc1 = docx2txt.process("test_doc1.docx")
doc2 = docx2txt.process("test_doc2.docx")

In [5]:
# display documents
documents = [doc1, doc2]
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_columns', 0)
docs_df = pd.DataFrame(documents, columns=['documents'])

In [6]:
docs_df

Unnamed: 0,documents
0,"ECCLESIASTES 2\n\n\n\n1I said in my heart, Come now, I will prove thee with mirth; therefore enjoy pleasure: and, behold, this also was vanity. 2I said of laughter, It is mad; and of mirth, What doeth it? 3I searched in my heart how to cheer my flesh with wine, my heart yet guiding me with wisdom, and how to lay hold on folly, till I might see what it was good for the sons of men that they should do under heaven all the days of their life. 4I made me great works; I builded me houses; I planted me vineyards; 5I made me gardens and parks, and I planted trees in them of all kinds of fruit; 6I made me pools of water, to water therefrom the forest where trees were reared; 7I bought men-servants and maid-servants, and had servants born in my house; also I had great possessions of herds and flocks, above all that were before me in Jerusalem; 8I gathered me also silver and gold, and the treasure of kings and of the provinces; I gat me men-singers and women-singers, and the delights of the sons of men, musical instruments, and that of all sorts. 9So I was great, and increased more than all that were before me in Jerusalem: also my wisdom remained with me. 10And whatsoever mine eyes desired I kept not from them; I withheld not my heart from any joy; for my heart rejoiced because of all my labor; and this was my portion from all my labor."
1,"ECCLESIASTES 2\n\n\n\n1I said to myself, “Have fun and enjoy yourself!” But this didn't make sense. 2Laughing and having fun is crazy. What good does it do? 3I wanted to find out what was best for us during the short time we have on this earth. So I decided to make myself happy with wine and find out what it means to be foolish, without really being foolish myself.\n\n4 I did some great things. I built houses and planted vineyards. 5I had flower gardens and orchards full of fruit trees. 6And I had pools where I could get water for the trees. 7 I owned slaves, and their sons and daughters became my slaves. I had more sheep and goats than anyone who had ever lived in Jerusalem. 8 Foreign rulers brought me silver, gold, and precious treasures. Men and women sang for me, and I had many wives who gave me great pleasure.\n\n 9 I was the most famous person who had ever lived in Jerusalem, and I was very wise. 10I got whatever I wanted and did whatever made me happy. But most of all, I enjoyed my work."


In [7]:
import re
punctuation = re.compile(r'[-[\].?#!,:;()|0-9]')
words = set(nltk.corpus.words.words())
sw = stopwords.words('english')
def pre_cleaner(text):
    # remove non-english words
    new_text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words)
    # remove punctuation
    new_text = " ".join([punctuation.sub("",item) for item in nltk.wordpunct_tokenize(text) if len(punctuation.sub("",item))>1])
    # remove stopwords
    new_text = new_text.lower()
    new_text = ' '.join([word for word in new_text.split() if word not in sw])
    return new_text

In [8]:
# append them to the corpus
corpus.append(doc1)
corpus.append(doc2)
corpus = list(map(pre_cleaner, corpus))

#### Vocabulary of the corpus

In [9]:
def createVocab(docList):
    vocab = {}
    for doc in docList:
        doc = doc.translate(str.maketrans('','', string.punctuation))
        words = word_tokenize(doc.lower())
        for word in words:
            if(word in vocab.keys()):
                vocab[word] = vocab[word] + 1
            else:
                vocab[word] = 1
    return vocab
vocab = createVocab(corpus)

vocab

{'ecclesiastes': 2,
 'said': 3,
 'heart': 5,
 'come': 1,
 'prove': 1,
 'thee': 1,
 'mirth': 2,
 'therefore': 1,
 'enjoy': 2,
 'pleasure': 2,
 'behold': 1,
 'also': 4,
 'vanity': 1,
 'laughter': 1,
 'mad': 1,
 'doeth': 1,
 'searched': 1,
 'cheer': 1,
 'flesh': 1,
 'wine': 2,
 'yet': 1,
 'guiding': 1,
 'wisdom': 2,
 'lay': 1,
 'hold': 1,
 'folly': 1,
 'till': 1,
 'might': 1,
 'see': 1,
 'good': 2,
 'sons': 3,
 'men': 5,
 'heaven': 1,
 'days': 1,
 'life': 1,
 'made': 4,
 'great': 5,
 'works': 1,
 'builded': 1,
 'houses': 2,
 'planted': 3,
 'vineyards': 2,
 'gardens': 2,
 'parks': 1,
 'trees': 4,
 'kinds': 1,
 'fruit': 2,
 'pools': 2,
 'water': 3,
 'therefrom': 1,
 'forest': 1,
 'reared': 1,
 'bought': 1,
 'servants': 3,
 'maid': 1,
 'born': 1,
 'house': 1,
 'possessions': 1,
 'herds': 1,
 'flocks': 1,
 'jerusalem': 4,
 'gathered': 1,
 'silver': 2,
 'gold': 2,
 'treasure': 1,
 'kings': 1,
 'provinces': 1,
 'gat': 1,
 'singers': 2,
 'women': 2,
 'delights': 1,
 'musical': 1,
 'instruments':

#### Document term frequency matrix
<i>The matrix has a dimension of numberOfDocs by numberOfTermsInCorpus</i>

In [10]:
termDict = {}

docsTFMat = np.zeros((len(corpus),len(vocab)))
docsIdMat = np.zeros((len(vocab),len(corpus)))

docTermDf = pd.DataFrame(docsTFMat, columns = sorted(vocab.keys()))
docCount = 0

for doc in corpus:   
    doc = doc.translate(str.maketrans('','',string.punctuation))
    words = word_tokenize(doc.lower())
    for word in words:
        if(word in vocab.keys()):
            docTermDf[word][docCount] = docTermDf[word][docCount] + 1
    docCount = docCount + 1
docTermDf

Unnamed: 0,also,anyone,became,behold,best,born,bought,brought,builded,built,cheer,come,could,crazy,daughters,days,decided,delights,desired,doeth,earth,ecclesiastes,enjoy,enjoyed,ever,eyes,famous,find,flesh,flocks,flower,folly,foolish,foreign,forest,fruit,full,fun,gardens,gat,gathered,gave,get,goats,gold,good,got,great,guiding,happy,heart,heaven,herds,hold,house,houses,increased,instruments,jerusalem,joy,...,might,mine,mirth,musical,orchards,owned,parks,person,planted,pleasure,pools,portion,possessions,precious,prove,provinces,really,reared,rejoiced,remained,rulers,said,sang,searched,see,sense,servants,sheep,short,silver,singers,slaves,sons,sorts,thee,therefore,therefrom,things,till,time,treasure,treasures,trees,us,vanity,vineyards,wanted,water,whatever,whatsoever,wine,wisdom,wise,withheld,without,wives,women,work,works,yet
0,4.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,3.0,1.0,0.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,...,1.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,3.0,0.0,0.0,1.0,2.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,2.0,0.0,0.0,1.0,0.0,2.0,1.0,0.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,1.0,0.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0


#### Computing idf for each word in the vocab

In [11]:
idfDict={}
for column in docTermDf.columns:
    idfDict[column] = np.log((len(corpus) + 1)/(1 + (docTermDf[column] != 0).sum()))+1

# Computer tf.idf matrix
docsTfIdfMat = np.zeros((len(corpus),len(vocab)))
docTfIdfDf = pd.DataFrame(docsTfIdfMat, columns = sorted(vocab.keys()))

docCount = 0
for doc in corpus:
    for key in idfDict.keys():
        docTfIdfDf[key][docCount] = docTermDf[key][docCount] * idfDict[key]
    docCount = docCount + 1
    
docTfIdfDf

Unnamed: 0,also,anyone,became,behold,best,born,bought,brought,builded,built,cheer,come,could,crazy,daughters,days,decided,delights,desired,doeth,earth,ecclesiastes,enjoy,enjoyed,ever,eyes,famous,find,flesh,flocks,flower,folly,foolish,foreign,forest,fruit,full,fun,gardens,gat,gathered,gave,get,goats,gold,good,got,great,guiding,happy,heart,heaven,herds,hold,house,houses,increased,instruments,jerusalem,joy,...,might,mine,mirth,musical,orchards,owned,parks,person,planted,pleasure,pools,portion,possessions,precious,prove,provinces,really,reared,rejoiced,remained,rulers,said,sang,searched,see,sense,servants,sheep,short,silver,singers,slaves,sons,sorts,thee,therefore,therefrom,things,till,time,treasure,treasures,trees,us,vanity,vineyards,wanted,water,whatever,whatsoever,wine,wisdom,wise,withheld,without,wives,women,work,works,yet
0,5.62186,0.0,0.0,1.405465,0.0,1.405465,1.405465,0.0,1.405465,0.0,1.405465,1.405465,0.0,0.0,0.0,1.405465,0.0,1.405465,1.405465,1.405465,0.0,1.0,1.0,0.0,0.0,1.405465,0.0,0.0,1.405465,1.405465,0.0,1.405465,0.0,0.0,1.405465,1.0,0.0,0.0,1.0,1.405465,1.405465,0.0,0.0,0.0,1.0,1.0,0.0,3.0,1.405465,0.0,7.027326,1.405465,1.405465,1.405465,1.405465,1.0,1.405465,1.405465,2.0,1.405465,...,1.405465,1.405465,2.81093,1.405465,0.0,0.0,1.405465,0.0,2.0,1.0,1.0,1.405465,1.405465,0.0,1.405465,1.405465,0.0,1.405465,1.405465,1.405465,0.0,2.0,0.0,1.405465,1.405465,0.0,4.216395,0.0,0.0,1.0,2.81093,0.0,2.0,1.405465,1.405465,1.405465,1.405465,0.0,1.405465,0.0,1.405465,0.0,2.0,0.0,1.405465,1.0,0.0,2.0,0.0,1.405465,1.0,2.81093,0.0,1.405465,0.0,0.0,1.0,0.0,1.405465,1.405465
1,0.0,1.405465,1.405465,0.0,1.405465,0.0,0.0,1.405465,0.0,1.405465,0.0,0.0,1.405465,1.405465,1.405465,0.0,1.405465,0.0,0.0,0.0,1.405465,1.0,1.0,1.405465,2.81093,0.0,1.405465,2.81093,0.0,0.0,1.405465,0.0,2.81093,1.405465,0.0,1.0,1.405465,2.81093,1.0,0.0,0.0,1.405465,1.405465,1.405465,1.0,1.0,1.405465,2.0,0.0,2.81093,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,1.405465,1.405465,0.0,1.405465,1.0,1.0,1.0,0.0,0.0,1.405465,0.0,0.0,1.405465,0.0,0.0,0.0,1.405465,1.0,1.405465,0.0,0.0,1.405465,0.0,1.405465,1.405465,1.0,0.0,2.81093,1.0,0.0,0.0,0.0,0.0,1.405465,0.0,1.405465,0.0,1.405465,2.0,1.405465,0.0,1.0,2.81093,1.0,2.81093,0.0,1.0,0.0,1.405465,0.0,1.405465,1.405465,1.0,1.405465,0.0,0.0


## Using TfifVectorizer 

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
vectorizer = TfidfVectorizer(analyzer='word',norm=None, use_idf=True, smooth_idf=True)
tfIdfMat = vectorizer.fit_transform(corpus)
feature_names = sorted(vectorizer.get_feature_names())
docList=['test_doc1','test_doc2']
skDocsTfIdfdf = pd.DataFrame(tfIdfMat.todense(), index=sorted(docList), columns=feature_names)
skDocsTfIdfdf

Unnamed: 0,also,anyone,became,behold,best,born,bought,brought,builded,built,cheer,come,could,crazy,daughters,days,decided,delights,desired,doeth,earth,ecclesiastes,enjoy,enjoyed,ever,eyes,famous,find,flesh,flocks,flower,folly,foolish,foreign,forest,fruit,full,fun,gardens,gat,gathered,gave,get,goats,gold,good,got,great,guiding,happy,heart,heaven,herds,hold,house,houses,increased,instruments,jerusalem,joy,...,might,mine,mirth,musical,orchards,owned,parks,person,planted,pleasure,pools,portion,possessions,precious,prove,provinces,really,reared,rejoiced,remained,rulers,said,sang,searched,see,sense,servants,sheep,short,silver,singers,slaves,sons,sorts,thee,therefore,therefrom,things,till,time,treasure,treasures,trees,us,vanity,vineyards,wanted,water,whatever,whatsoever,wine,wisdom,wise,withheld,without,wives,women,work,works,yet
test_doc1,5.62186,0.0,0.0,1.405465,0.0,1.405465,1.405465,0.0,1.405465,0.0,1.405465,1.405465,0.0,0.0,0.0,1.405465,0.0,1.405465,1.405465,1.405465,0.0,1.0,1.0,0.0,0.0,1.405465,0.0,0.0,1.405465,1.405465,0.0,1.405465,0.0,0.0,1.405465,1.0,0.0,0.0,1.0,1.405465,1.405465,0.0,0.0,0.0,1.0,1.0,0.0,3.0,1.405465,0.0,7.027326,1.405465,1.405465,1.405465,1.405465,1.0,1.405465,1.405465,2.0,1.405465,...,1.405465,1.405465,2.81093,1.405465,0.0,0.0,1.405465,0.0,2.0,1.0,1.0,1.405465,1.405465,0.0,1.405465,1.405465,0.0,1.405465,1.405465,1.405465,0.0,2.0,0.0,1.405465,1.405465,0.0,4.216395,0.0,0.0,1.0,2.81093,0.0,2.0,1.405465,1.405465,1.405465,1.405465,0.0,1.405465,0.0,1.405465,0.0,2.0,0.0,1.405465,1.0,0.0,2.0,0.0,1.405465,1.0,2.81093,0.0,1.405465,0.0,0.0,1.0,0.0,1.405465,1.405465
test_doc2,0.0,1.405465,1.405465,0.0,1.405465,0.0,0.0,1.405465,0.0,1.405465,0.0,0.0,1.405465,1.405465,1.405465,0.0,1.405465,0.0,0.0,0.0,1.405465,1.0,1.0,1.405465,2.81093,0.0,1.405465,2.81093,0.0,0.0,1.405465,0.0,2.81093,1.405465,0.0,1.0,1.405465,2.81093,1.0,0.0,0.0,1.405465,1.405465,1.405465,1.0,1.0,1.405465,2.0,0.0,2.81093,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,1.405465,1.405465,0.0,1.405465,1.0,1.0,1.0,0.0,0.0,1.405465,0.0,0.0,1.405465,0.0,0.0,0.0,1.405465,1.0,1.405465,0.0,0.0,1.405465,0.0,1.405465,1.405465,1.0,0.0,2.81093,1.0,0.0,0.0,0.0,0.0,1.405465,0.0,1.405465,0.0,1.405465,2.0,1.405465,0.0,1.0,2.81093,1.0,2.81093,0.0,1.0,0.0,1.405465,0.0,1.405465,1.405465,1.0,1.405465,0.0,0.0


#### Computing the cosine similarity

In [14]:
csim = cosine_similarity(tfIdfMat, tfIdfMat)
csimDf = pd.DataFrame(csim,index=sorted(docList),columns=sorted(docList))
csimDf

Unnamed: 0,test_doc1,test_doc2
test_doc1,1.0,0.172404
test_doc2,0.172404,1.0


### Exporting results to excel sheets

In [15]:
transpoTfIdf = skDocsTfIdfdf.T
transpoTfIdf

Unnamed: 0,test_doc1,test_doc2
also,5.621860,0.000000
anyone,0.000000,1.405465
became,0.000000,1.405465
behold,1.405465,0.000000
best,0.000000,1.405465
...,...,...
wives,0.000000,1.405465
women,1.000000,1.000000
work,0.000000,1.405465
works,1.405465,0.000000


In [16]:
# export the tf idf
# transpoTfIdf.to_excel('TF IDF Result .xlsx')

In [17]:
# export the cosine similarity
# csimDf.to_excel('test_doc1 compared to test_doc2')