Document Similarity



---

**Importing and downloading the required packages**

In [None]:
import urllib.request
import nltk
import string
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
from nltk import sent_tokenize,word_tokenize
from nltk.corpus import wordnet,stopwords
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Reading the 08 documents as text files**

Note: Text files given in the assignment are uploaded to my personal github inorder to be read by this colab file.

In [None]:
files=[] #Array declared to hold each text document

In [None]:
#READING DOCUMENT 01 AND APPENDING IT TO THE FILES[] ARRAY

url1 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%201.txt"
filename_doc1 = "doc1.txt"
urllib.request.urlretrieve(url1, filename_doc1)
doc1_object = open('doc1.txt', 'r') #File is opened to a object in the 'read only' mode
doc1_text = doc1_object.read()    
  
files.append(doc1_text)             #File is read and added to the array

In [None]:
#READING DOCUMENT 02 AND APPENDING IT TO THE FILES[] ARRAY

url2 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%202.txt"
filename_doc2 = "doc2.txt"
urllib.request.urlretrieve(url2, filename_doc2)
doc2_object = open('doc2.txt', 'r')
doc2_text = doc2_object.read() 

files.append(doc2_text)

In [None]:
#READING DOCUMENT 03 AND APPENDING IT TO THE FILES[] ARRAY

url3 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%203.txt"
filename_doc3 = "doc3.txt"
urllib.request.urlretrieve(url3, filename_doc3)
doc3_object = open('doc3.txt', 'r')
doc3_text = doc3_object.read() 

files.append(doc3_text)

In [None]:
#READING DOCUMENT 04 AND APPENDING IT TO THE FILES[] ARRAY

url4 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%204.txt"
filename_doc4 = "doc4.txt"
urllib.request.urlretrieve(url4, filename_doc4)
doc4_object = open('doc4.txt', 'r')
doc4_text = doc4_object.read() 

files.append(doc4_text)

In [None]:
#READING DOCUMENT 05 AND APPENDING IT TO THE FILES[] ARRAY
url5 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%205.txt"
filename_doc5 = "doc5.txt"
urllib.request.urlretrieve(url5, filename_doc5)
doc5_object = open('doc5.txt', 'r')
doc5_text = doc5_object.read() 

files.append(doc5_text)

In [None]:
#READING DOCUMENT 06 AND APPENDING IT TO THE FILES[] ARRAY
url6 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%206.txt"
filename_doc6 = "doc6.txt"
urllib.request.urlretrieve(url6, filename_doc6)
doc6_object = open('doc6.txt', 'r')
doc6_text = doc6_object.read() 

files.append(doc6_text)

In [None]:
#READING DOCUMENT 08 AND APPENDING IT TO THE FILES[] ARRAY
url7 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%207.txt"
filename_doc7 = "doc7.txt"
urllib.request.urlretrieve(url7, filename_doc7)
doc7_object = open('doc7.txt', 'r')
doc7_text = doc7_object.read() 

files.append(doc7_text)

In [None]:
#READING DOCUMENT 09 AND APPENDING IT TO THE FILES[] ARRAY

url8 = "https://raw.githubusercontent.com/Yashithi98/Document-Similarity---python/main/Dataset/Text%20Files/doc%208.txt"
filename_doc8 = "doc8.txt"
urllib.request.urlretrieve(url8, filename_doc8)
doc8_object = open('doc8.txt', 'r')
doc8_text = doc8_object.read() 

files.append(doc8_text)

**Defining the required parameters for preprocessing**

*   Snowball Stemmer - Used for stemming inorder to reduce the inflection of words  
*   Stop Words - Used to remove stopwords like 'the', 'a' in the document



In [None]:
snow_stemmer = SnowballStemmer(language='english')
stop_words=stopwords.words('english')

**Preprocessing**



1.   Eliminates stop words
2.   Eliminates special characters
3.   Turns all words to lower case words
4.   Stemming 



In [None]:
def preprocessing(word_doc):
  vec = []
  for word in word_doc:                     #Examines individual words within the document array
    if(word not in stop_words):             #The word is discarded if it's a stop word
      if(word not in string.punctuation):   #If the read word is a special character it's discarded
        word = word.lower()                 #Coverts to lowercase
        word = snow_stemmer.stem(word)      #Performs stemming             
        vec.append(word);
  return vec                                #Returns the preprocessed word array

**Building the final preprocessed corpus that's ready to be subjected to further calculations in document similarity**

In [None]:
def build_corpus(files):
  corpus = []
  for file in files:
    tokenized_words = word_tokenize(file)                 #Tokenize the sentences into words
    preprocessed_words = preprocessing(tokenized_words)   #Passes these individual words through the preprocessing function
    corpus.append(' '.join(preprocessed_words))           #Rejions individual words to sentences
  
  return corpus

In [None]:
corpus = build_corpus(files)                              #Calls the function defined above

**Calculates TD-IDF for each word in each document**

In [None]:
vectorizer = TfidfVectorizer()
TFIDF_matrix = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()



In [None]:
#Printing the results of the calculations
print("\nTF-IDF values for each term in their corresponding document\n\n")
print("\t\t Doc1 \t Doc2 \t Doc3 \t Doc4 \t Doc5 \t Doc6 \t Doc7 \t Doc8")
for idx, row in enumerate(TFIDF_matrix.toarray().transpose()):
	print((terms[idx]) + '\t\t' + str(round(row[0],3)) + '\t' + str(round(row[1],3)) + '\t' + str(round(row[2],3)) + '\t' + str(round(row[3],3)) + '\t'+ str(round(row[4],3)) + '\t' + str(round(row[5],3)) + '\t'+ str(round(row[6],3)) + '\t' + str(round(row[7],3)) + '\t' )


TF-IDF values for each term in their corresponding document


		 Doc1 	 Doc2 	 Doc3 	 Doc4 	 Doc5 	 Doc6 	 Doc7 	 Doc8
000		0.072	0.022	0.076	0.0	0.0	0.0	0.05	0.0	
10		0.082	0.026	0.0	0.0	0.0	0.0	0.028	0.0	
100		0.0	0.051	0.043	0.029	0.0	0.0	0.0	0.0	
11		0.0	0.0	0.0	0.058	0.043	0.043	0.0	0.0	
110		0.0	0.089	0.0	0.0	0.0	0.0	0.0	0.034	
115		0.0	0.0	0.0	0.0	0.0	0.0	0.079	0.0	
12		0.0	0.0	0.06	0.0	0.0	0.0	0.0	0.0	
125		0.0	0.0	0.06	0.0	0.0	0.0	0.0	0.0	
14		0.076	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
140		0.0	0.059	0.05	0.0	0.0	0.0	0.0	0.0	
15		0.032	0.0	0.05	0.0	0.0	0.0	0.0	0.0	
16		0.027	0.0	0.043	0.029	0.0	0.0	0.0	0.0	
17		0.0	0.026	0.0	0.0	0.0	0.043	0.028	0.0	
18		0.0	0.0	0.0	0.0	0.0	0.119	0.0	0.0	
1982		0.0	0.0	0.0	0.034	0.0	0.05	0.0	0.0	
1988		0.0	0.0	0.043	0.029	0.0	0.043	0.0	0.0	
20		0.055	0.0	0.0	0.029	0.0	0.086	0.0	0.0	
200		0.0	0.0	0.06	0.0	0.0	0.0	0.0	0.0	
22		0.0	0.0	0.0	0.034	0.05	0.0	0.0	0.0	
25		0.027	0.026	0.0	0.0	0.0	0.0	0.057	0.0	
250		0.0	0.03	0.0	0.034	0.0	0.0	0.0	0.0	
26		0.0	

**Calculating Cosine Similarity**

In [None]:
cos_sim_matrix = cosine_similarity(TFIDF_matrix )
print("\nCosine Similarity for each Document-Document relation (Rounded to 3 decimal points for convenience)\n")
print("\t\t Doc1 \t Doc2 \t Doc3 \t Doc4 \t Doc5 \t Doc6 \t Doc7 \t Doc8")
for idx, row in enumerate(cos_sim_matrix):
	print( 'Doc' + str(idx+1) + '\t\t' + str(round(row[0],3)) + '\t' + str(round(row[1],3)) + '\t' + str(round(row[2],3)) + '\t' + str(round(row[3],3)) + '\t'+ str(round(row[4],3)) + '\t' + str(round(row[5],3)) + '\t'+ str(round(row[6],3)) + '\t' + str(round(row[7],3)) + '\t' )


Cosine Similarity for each Document-Document relation (Rounded to 3 decimal points for convenience)

		 Doc1 	 Doc2 	 Doc3 	 Doc4 	 Doc5 	 Doc6 	 Doc7 	 Doc8
Doc1		1.0	0.077	0.049	0.106	0.075	0.087	0.083	0.371	
Doc2		0.077	1.0	0.518	0.132	0.071	0.129	0.715	0.066	
Doc3		0.049	0.518	1.0	0.09	0.049	0.084	0.446	0.037	
Doc4		0.106	0.132	0.09	1.0	0.481	0.547	0.137	0.11	
Doc5		0.075	0.071	0.049	0.481	1.0	0.246	0.102	0.073	
Doc6		0.087	0.129	0.084	0.547	0.246	1.0	0.141	0.09	
Doc7		0.083	0.715	0.446	0.137	0.102	0.141	1.0	0.079	
Doc8		0.371	0.066	0.037	0.11	0.073	0.09	0.079	1.0	


**Cosine Similarity as a percentage**

In [None]:
cos_sim_matrix = cosine_similarity(TFIDF_matrix )
print('Cosine Similarity for each Document-Document relation as a percentage (Rounded to 2 decimal points for convenience)"')
print("\t\t Doc1 \t Doc2 \t Doc3 \t Doc4 \t Doc5 \t Doc6 \t Doc7 \t Doc8")
for idx, row in enumerate(cos_sim_matrix):
	print( 'Doc' + str(idx+1) + '\t\t' + str(round(row[0]*100,1))+ '%' + '\t' + str(round(row[1]*100,1))+ '%'  + '\t' + str(round(row[2]*100,1))+ '%'  + '\t' + str(round(row[3]*100,1)) + '%' + '\t'+ str(round(row[4]*100,1)) + '%' + '\t' + str(round(row[5]*100,1))+ '%'  + '\t'+ str(round(row[6]*100,1))+ '%'  + '\t' + str(round(row[7]*100,1))+ '%'  + '\t' )

Cosine Similarity for each Document-Document relation as a percentage (Rounded to 2 decimal points for convenience)"
		 Doc1 	 Doc2 	 Doc3 	 Doc4 	 Doc5 	 Doc6 	 Doc7 	 Doc8
Doc1		100.0%	7.7%	4.9%	10.6%	7.5%	8.7%	8.3%	37.1%	
Doc2		7.7%	100.0%	51.8%	13.2%	7.1%	12.9%	71.5%	6.6%	
Doc3		4.9%	51.8%	100.0%	9.0%	4.9%	8.4%	44.6%	3.7%	
Doc4		10.6%	13.2%	9.0%	100.0%	48.1%	54.7%	13.7%	11.0%	
Doc5		7.5%	7.1%	4.9%	48.1%	100.0%	24.6%	10.2%	7.3%	
Doc6		8.7%	12.9%	8.4%	54.7%	24.6%	100.0%	14.1%	9.0%	
Doc7		8.3%	71.5%	44.6%	13.7%	10.2%	14.1%	100.0%	7.9%	
Doc8		37.1%	6.6%	3.7%	11.0%	7.3%	9.0%	7.9%	100.0%	




**DETERMINING WHICH TITLE BEST SUITES EACH ARTICLE**

In [None]:
titles = ["Hurricane Gilbert Heads Toward Dominican Coast","IRA terrorist attack","McDonalds Opens First Restaurant in China"] #Defining all titles
titles = build_corpus(titles) #Preprocessing titles

**Function to calculate IF-IDF & Cosine Similarity for each title against each document**

In [None]:
def title_to_document_matching(index):                        
    docA_corpus = []
    for title in titles:
      docA_corpus.append(title)                               #Creating an array with the titles and the document
    docA_corpus.append(corpus[index])

    vectorizer = TfidfVectorizer()
    TFIDF_matrixA = vectorizer.fit_transform(docA_corpus)     #Calculating IF-IDF 
    terms = vectorizer.get_feature_names()
    cos_sim_matrixA = cosine_similarity(TFIDF_matrixA)        #Calculating Cosine Similarity

    print("\t\t Title 1 \t Title 2 \t Title 3")
    print( 'Percentage (%) ' + '\t' + str(round((cos_sim_matrixA[3][0])*100,1)) + '\t\t'  + str(round((cos_sim_matrixA[3][1])*100,1)) + '\t\t' + str(round((cos_sim_matrixA[3][2])*100,1))+"\n")

Applying the above defined function to all 08 documents

In [None]:
for i in range(8):
  print("\nDocument 0"+str(i+1))
  title_to_document_matching(i)


Document 01
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	0.0		0.0		25.4


Document 02
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	24.0		0.0		0.0


Document 03
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	25.3		0.0		1.3


Document 04
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	0.0		14.1		0.8


Document 05
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	1.1		0.0		1.2


Document 06
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	1.1		6.8		0.0


Document 07
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	22.4		0.0		0.8


Document 08
		 Title 1 	 Title 2 	 Title 3
Percentage (%) 	0.0		1.3		33.6



