1. Find the **most similar sentences or documents in your dataset using word count, TF-IDF, and word-embeddings** as your vectorization techniques. If the computation is slow, **you may subsample** for only a few thousand rows. (2 pts)



* Our dataset is about worldnews on reddit. We got the dataset from https://www.kaggle.com/rootuser/worldnews-on-reddit.

* The dataset contains following columns:
    1. time_created - a Unix timestamp of the submission creation date 
    2. date_created - creation time in %Y-%m-%d 
    3. up_votes - how often the submission was upvoted 
    4. down_votes - how often the submission was downvoted, all 0s 
    5. title - the title of the submission 
    6. over_18 - if the submission is for mature persons 
    7. author - the reddit username of the author 
    8. subreddit - this is always 'worldnews'

* In our case, "title" is the only text field. To find the top most similar titles, we apply word count, TF-IDF, and word-embeddings as our vectorization techniques.

* Comparing among the three techniques, we feel word-embeddings is the most powerful one. Other than just measuring how many words appear in both titles, it also measures the meaning of them. For example, title **"australian rodent named 1st mammal go extinct due human-caused climate change"** and **"australian rodent first mammal become extinct due climate change  scientist say"**, are almost the same. However, since they do not share completely the same words, word count and TF-IDF do not do well on catching their similarity, where word count give them 0.61 and TF-IDF gives them 0.45. In contrast, since word embeddings take the words meaning into account, it gives us 0.94, which is closer to our expectation.

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("reddit_worldnews_start_to_2016-11-22.csv")

In [4]:
data.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,subreddit
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [5]:
len(data["title"])

509236

## Preprocessing

In [6]:
small_sample = data.sample(n = 5000).reset_index(drop=True)

In [7]:
titles = small_sample["title"].values

In [8]:
# Preprocess the reviews (tokenizing, lemmatization, removing stopwords)
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

def preprocessing(titles):
    filtered_titles = []
    for title in titles:
        title = title.lower()
        token_list = word_tokenize(title) # Tokenize
        filtered_token = [t for t in token_list if not t in stop_words] # Remove stopwords
        for i in range(len(filtered_token)):
            filtered_token[i] = lemmatizer.lemmatize(filtered_token[i]).strip(string.punctuation) # Lemmatization
        filtered_titles.append(" ".join(filtered_token))
    return filtered_titles

## Word Count

In [9]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

filtered_corpus = preprocessing(titles)
vectorizer = CountVectorizer()

# vectorize the corpus and generate similarity matrix
vector = vectorizer.fit_transform(filtered_corpus)
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector.toarray())

In [11]:
similarity_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [12]:
# Keep the upper similarity matrix without diagonal to exclude correlation with itself and repetitive correlation pairs
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [13]:
tri_upper_no_diag

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [15]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [16]:
top_similar_index

(array([ 456, 2006, 1560, 2298, 3832, 2685,  963,  673, 3034,  840, 2969,
         685, 2392, 3340,  706,   65,   65, 2380, 2066,   65,  324,  461,
          64, 3496, 2006, 1084, 1590,  963, 1530, 3856, 1136,  456,  456,
        2380, 1797,  685, 2351, 2361,  784,  685, 3418,  781,  932,  371,
        3052,  255,  178, 1817, 1636, 1599]),
 array([1560, 4968, 1797, 3525, 3960, 3328, 1518,  815, 3606, 2090, 3990,
        2935, 4387, 3396, 3037, 4076, 1797, 3021, 2722,  456, 3396,  569,
        2210, 4909, 4392, 3212, 4393, 2498, 3212, 4544, 3788, 4076, 1797,
        3766, 4076, 1084, 2672, 4981, 1255, 3212, 3420, 3727, 4520, 2954,
        3084,  538, 4917, 3997, 3139, 3888]))

In [17]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(1.0000000000000002)
japan pm abe  way peace treaty russia coming sight
japan pm abe  way peace treaty russia coming sight


(0.9354143466934851)
russia say shelling embassy damascus act terrorism 
russia say shelling embassy damascus act terrorism  reuters


(0.8819171036881968)
nasa television broadcast return space station crew
nasa television broadcast sept 10 return space station crew


(0.8451542547285165)
arm cut reagan statue poland
vandal poland cut arm statue ronald reagan


(0.7559289460184543)
iran risk eu sanction missile test  france
france say eu could impose sanction iran missile test


(0.7142857142857141)
say 20,000 could infected ebola november 
ebola could eventually afflict 20,000  w.h.o  say


(0.6741998624632421)
israel  un human right council rapporteur denied entry israel
un fault israel human right


(0.6708203932499369)
russia  back foreign force syria
russia hit back u.s syria


(0.6666666666666669)
arizona police officer killed  suspect dead 
france stabbin

In [29]:
filtered_corpus[1599]

'japan pm abe  way peace treaty russia coming sight'

In [28]:
filtered_corpus[3888]

'japan pm abe  way peace treaty russia coming sight'

## TF-IDF

In [18]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

filtered_corpus = preprocessing(titles)
vectorizer = TfidfVectorizer(ngram_range=(1,2),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df = 0.4)

# vectorize the corpus and generate similarity matrix
vector = vectorizer.fit_transform(filtered_corpus)
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector.toarray())

In [30]:
# Keep the upper similarity matrix without diagonal to exclude correlation with itself and repetitive correlation pairs
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [31]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [32]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(1.0000000000000002)
japan pm abe  way peace treaty russia coming sight
japan pm abe  way peace treaty russia coming sight


(0.9185643327276879)
russia say shelling embassy damascus act terrorism 
russia say shelling embassy damascus act terrorism  reuters


(0.8230718300079781)
nasa television broadcast return space station crew
nasa television broadcast sept 10 return space station crew


(0.5578039854147123)
school meal kill 22 india
poisoned school meal kill child peru


(0.5418691631395968)
iron dome intercept rocket ashdod
iaf hit gaza launching cell  iron dome intercept rocket


(0.5105659976659083)
bbc news  australia foreign minister kevin rudd resigns
australian foreign minister former prime minister kevin rudd resigns post potential leadership challenge


(0.4895749024307273)
bombing shooting kill 12 around afghanistan
afghan suicide bombing  shooting kill least 6


(0.4862373725062589)
u reportedly ban ukrainian party leader anti-semitism
anti-semitism


(0.456197794513300

## Word-embeddings

In [22]:
#import sys
#!{sys.executable} -m spacy download en_core_web_md

In [35]:
# load in spacy
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_md.load()

In [36]:
# vectorize the corpus and generate similarity matrix
vector = [nlp(i).vector for i in filtered_corpus]
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector)

In [37]:
# Keep the upper similarity matrix without diagonal
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [38]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [39]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(0.9999997615814209)
japan pm abe  way peace treaty russia coming sight
japan pm abe  way peace treaty russia coming sight


(0.9776901006698608)
russia say shelling embassy damascus act terrorism 
russia say shelling embassy damascus act terrorism  reuters


(0.9674497246742249)
nasa television broadcast return space station crew
nasa television broadcast sept 10 return space station crew


(0.9529333710670471)
bombing shooting kill 12 around afghanistan
afghan suicide bombing  shooting kill least 6


(0.9432119727134705)
iran risk eu sanction missile test  france
france say eu could impose sanction iran missile test


(0.9422625303268433)
arm cut reagan statue poland
vandal poland cut arm statue ronald reagan


(0.9397093057632446)
arizona police officer killed  suspect dead 
france stabbing suspect  killed police officer


(0.9390401244163513)
pakistani court upholds blasphemy death sentence christian woman
pakistani christian woman sentenced death blasphemy file appeal


(0.9376088

In [40]:
index1=[top_similar_index[0][-i] for i in range(1, top_similar + 1)]
index2=[top_similar_index[1][-i] for i in range(1, top_similar + 1)]
similarities=[similarity_matrix[top_similar_index[0][-i],top_similar_index[1][-i]] for i in range(1, top_similar + 1)]
originaltitle1=[small_sample['title'][i] for i in index1]
originaltitle2=[small_sample['title'][i] for i in index2]
top50_similar_df=pd.DataFrame({'Index 1':index1,'Index 2':index2,'Title 1':originaltitle1,
                              'Title 2':originaltitle2,'Similarity':similarities})

In [41]:
top50_similar_df.head()

Unnamed: 0,Index 1,Index 2,Title 1,Title 2,Similarity
0,1599,3888,Japan PM Abe: way to peace treaty with Russia ...,Japan PM Abe: way to peace treaty with Russia ...,1.0
1,1636,3139,Russia says shelling of its embassy in Damascu...,Russia says shelling of its embassy in Damascu...,0.97769
2,1817,3997,NASA Television to Broadcast Return of Space S...,NASA Television to Broadcast Sept. 10 Return o...,0.96745
3,1087,4200,Bombings and shootings kill 12 around Afghanis...,"Afghan suicide bombing, shooting kills at least 6",0.952933
4,255,538,Iran risks EU sanctions over missile tests -...,France says EU could impose sanctions over Ira...,0.943212
