# Homework 2

Introduction: The purpose of this homework will be to examine the similarity of a number of articles contained in the data directory. Specifically, the intent is to implement different document similarity techniques and then see how similar these documents are using said techniques. Finally, we would like to present the similarity in a table or heatmap.

Step 1:
In order to begin this exercise, we will first need to iterate over the data directory and read the content of each file. We will store the data from each file in a hash.

In [1]:
from os import listdir
from os.path import isfile, join
path_to_homework_2_data_directory = "/Users/teacher/repos/s19_ds_nlp/homework_solutions/homework_2/data"

article_hash = {} # this hash should serve to represent the content of the files in the data directory
# use the filename as the hash key and the value will be the text of the file
# thus you would be able to retrieve an individual documents text like: article_hash["article_1"]

# here we will get a list of the filenames of things contained in the data directory
files = [f for f in listdir(path_to_homework_2_data_directory) if isfile(join(path_to_homework_2_data_directory, f))]

# here you will iterate over all the files contained in the directory
for file in files:
    file_location = join(path_to_homework_2_data_directory, file)
    file_text = open(file_location, 'r')
    article_hash[file] = file_text.read()
    # article_hash[filename_variable] = the content of the file


Step 2:
Text processing. Now that you have the content of the files read into a hash you will be able to process them. Specifically, you should perhaps employ sentence segmentation, tokenization, and stemming to get a new representation of the document. Here you will want to build a sufficiently flexible approach so that you can try out several different pre-processing strategies to see how it affects your similarity scores.

We'll create a new hash that contains the processed text.

In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
# setup a new hash to store the results in
processed_article_hash = {}

# iterate through the keys, i.e. document ids, in the hash to pull out the stored text and process
for key in article_hash.keys():
    text_of_article = article_hash[key]
    tokens = nlp(text_of_article)
    token_lemmas = [token.lemma_ for token in tokens if token.is_alpha]
    
    
    processed_article_hash[key] = token_lemmas

Step 3:
Implement two similarity techniques. 
We would like to examine the Jacard Similarity and Cosine.

Jacardian Similarity: here we want to identify the set of words in two documents that overlap and then divide that by the count of unique words across both documents.

Cosine Similarity: Here we want to create vector representations for each document. Specifically, we want to come up with a vector that is based on the list of all words that occur across both documents. Then for each document we will create a vector that includes the counts of the number of time a word occurs in the document.

So if the document 1 is: "the ship sails at midnight" and document 2 is: "the crow flies at noon." We would be creating a vector like: [the, ship, sails, at, midnight, crow, flies, noon]. Then we would calculate the values of the vector for each document. For document 1: [1,1,1,1,1,0,0,0] and for document 2: [1,0,0,1,0,1,1,1]. With these two vectors we would simply take the dot product and that would provide the cosine similarity. 


In [27]:
def jacardian_distance(document_1_data, document_2_data):

    words_in_doc_1_not_in_doc_2 = set(document_1_data) - set(document_2_data)
    words_in_doc_2_not_in_doc_1 = set(document_2_data) - set(document_1_data)
    words_in_both_doc_1_and_doc_2 = set(document_1_data).union(set(document_2_data))
    
    jacardian = len(set(document_1_data).intersection(document_2_data))/(len(words_in_both_doc_1_and_doc_2))
    
    return jacardian

import numpy as np
def cosine_similarity(document_1_data, document_2_data):

    document_vector_word_index =     sorted(list(set(document_1_data).union(set(document_2_data))))
    document_1_vector = [document_1_data.count(word) for word in document_vector_word_index] # fill in the array with the frequency of the words in the document
    document_2_vector = [document_2_data.count(word) for word in document_vector_word_index] # fill in the array with the frequency of the words in the document
    
    return np.dot(document_1_vector, document_2_vector)/(np.linalg.norm(document_1_vector) * np.linalg.norm(document_2_vector)) 



Step 4:
Now that we have our two similarity measures, we want to examine each document relative to each other and calculate their similarity. 

So we will want to create two tables that show the document similarities using both techniques.

In [34]:
# create a variable to store your table data... you could use a hash or some other data structure. 
# We just want it to identify which document is being compared to which other document.

data_structure_for_jacard_similarity = {}
data_structure_for_cosine_similarity = {}

for doc_1_key in article_hash.keys():
    for doc_2_key in article_hash.keys():
        if doc_1_key not in data_structure_for_jacard_similarity.keys():
            data_structure_for_jacard_similarity[doc_1_key] = {}
        if doc_1_key not in data_structure_for_cosine_similarity.keys():
            data_structure_for_cosine_similarity[doc_1_key] = {}


        # we have the nested for loops as one way to compare each document to each other document
        doc_1_processed_text = article_hash[doc_1_key]
        doc_2_processed_text = article_hash[doc_2_key]
        data_structure_for_jacard_similarity[doc_1_key][doc_2_key] = jacardian_distance(doc_1_processed_text, doc_2_processed_text)
        data_structure_for_cosine_similarity[doc_1_key][doc_2_key] = cosine_similarity(doc_1_processed_text, doc_2_processed_text)


        
# finally, find some way to present this data back. Either as a straight table or a heatmap.
for doc_key in article_hash.keys():
    print("VALUES FOR:")
    print(doc_key)
    print(data_structure_for_jacard_similarity[doc_key])
    print(data_structure_for_cosine_similarity[doc_key])




VALUES FOR:
article_10
{'article_10': 1.0, 'article_11': 0.7619047619047619, 'article_3': 0.6944444444444444, 'article_4': 0.7808219178082192, 'article_5': 0.7361111111111112, 'article_2': 0.72, 'article_12': 0.7837837837837838, 'article_9': 0.75, 'article_7': 0.7837837837837838, 'article_1': 0.6883116883116883, 'article_6': 0.7777777777777778, 'article_8': 0.7536231884057971}
{'article_10': 1.0000000000000002, 'article_11': 0.9947009765583739, 'article_3': 0.9912794528428056, 'article_4': 0.988660431028577, 'article_5': 0.992271240801478, 'article_2': 0.9927117012038547, 'article_12': 0.9935793787694568, 'article_9': 0.9893410648010433, 'article_7': 0.9903740593563779, 'article_1': 0.9929176451935869, 'article_6': 0.9935983167439728, 'article_8': 0.9935447426962715}
VALUES FOR:
article_11
{'article_10': 0.7619047619047619, 'article_11': 1.0, 'article_3': 0.746268656716418, 'article_4': 0.6233766233766234, 'article_5': 0.6438356164383562, 'article_2': 0.6986301369863014, 'article_12': 

Step 5:
You should now have two different similarity mechanisms. What do your results suggest? From perusing the documents, do you think the suggested ones are similar or not? Does tokenization, stemming, stop word removal or anything else improve your results?

Write a brief description of your reactions to identifying these similar documents and what measures and pre-processing steps you think worked best.

Bonus: In the above similarity measures, you have only put tokens into the computed similarity. What if you added named entities? 

For the bonus try adding named entities as distinct features into the similarity calculation. Here you could use Spacy and just get the entities off each document. One thing to keep in mind is that entities are chunks of text and not tokens, so you will want to put the whole entity text in as a separate feature for similarity. Another thing to consider is that since there are multiple types of entities you might also include the entity type along with the entity when you do your similarity comparisons (you could just concatenate the entity type and the entity text as a single token into your similarity comparisons).

While this bonus question might seem tricky, it actually will allow for very unsophisticated implementations. I'll aim to discuss a similar technique in class.

<-- put your comments here -->