1. Find the **most similar sentences or documents in your dataset using word count, TF-IDF, and word-embeddings** as your vectorization techniques. If the computation is slow, **you may subsample** for only a few thousand rows. (2 pts)



* Our dataset is about worldnews on reddit. We got the dataset from https://www.kaggle.com/rootuser/worldnews-on-reddit.

* The dataset contains following columns:
    1. time_created - a Unix timestamp of the submission creation date 
    2. date_created - creation time in %Y-%m-%d 
    3. up_votes - how often the submission was upvoted 
    4. down_votes - how often the submission was downvoted, all 0s 
    5. title - the title of the submission 
    6. over_18 - if the submission is for mature persons 
    7. author - the reddit username of the author 
    8. subreddit - this is always 'worldnews'

* In our case, "title" is the only text field. To find the top most similar titles, we apply word count, TF-IDF, and word-embeddings as our vectorization techniques.

* Comparing among the three techniques, we feel word-embeddings is the most powerful one. Other than just measuring how many words appear in both titles, it also measures the meaning of them. For example, title **"australian rodent named 1st mammal go extinct due human-caused climate change"** and **"australian rodent first mammal become extinct due climate change  scientist say"**, are almost the same. However, since they do not share completely the same words, word count and TF-IDF do not do well on catching their similarity, where word count give them 0.61 and TF-IDF gives them 0.45. In contrast, since word embeddings take the words meaning into account, it gives us 0.94, which is closer to our expectation.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("reddit_worldnews_start_to_2016-11-22.csv")

In [3]:
data.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,subreddit
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [4]:
len(data["title"])

509236

## Preprocessing

In [4]:
small_sample = data.sample(n = 5000).reset_index(drop=True)

In [5]:
titles = small_sample["title"].values

In [6]:
# Preprocess the reviews (tokenizing, lemmatization, removing stopwords)
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

def preprocessing(titles):
    filtered_titles = []
    for title in titles:
        title = title.lower()
        token_list = word_tokenize(title) # Tokenize
        filtered_token = [t for t in token_list if not t in stop_words] # Remove stopwords
        for i in range(len(filtered_token)):
            filtered_token[i] = lemmatizer.lemmatize(filtered_token[i]).strip(string.punctuation) # Lemmatization
        filtered_titles.append(" ".join(filtered_token))
    return filtered_titles

## Word Count

In [7]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

filtered_corpus = preprocessing(titles)
vectorizer = CountVectorizer()

# vectorize the corpus and generate similarity matrix
vector = vectorizer.fit_transform(filtered_corpus)
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector.toarray())

In [9]:
# Keep the upper similarity matrix without diagonal
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [10]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [11]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(0.9999999999999999)
landmine kill four u.n soldier  wound 15 northern mali
landmine kill four u.n soldier  wound 15 northern mali


(0.9999999999999998)
syria say brings u.s drone  sana news agency
syria say brings u.s drone  sana news agency


(0.9999999999999998)
russia say sinai plane crash caused terrorist attack
russia say sinai plane crash caused terrorist attack


(0.9999999999999997)
suicide attack saudi mosque kill 17 people
suicide attack saudi mosque kill 17 people 


(0.8660254037844388)
north korea fire missile  south korea say
north korea fire seven missile east sea  south korea say


(0.8660254037844388)
nelson mandela discharged hospital
nelson mandela discharged


(0.8571428571428569)
philippine tell u joint patrol south china sea
u  philippine launch joint south china sea patrol


(0.7715167498104596)
taliban gunman attack kabul presidential palace
taliban claim suicide attack kabul presidential palace


(0.75)
volcano erupts southwestern japan
sakurajima volcano jap

## TF-IDF

In [15]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

filtered_corpus = preprocessing(titles)
vectorizer = TfidfVectorizer(ngram_range=(1,2),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df = 0.4)

# vectorize the corpus and generate similarity matrix
vector = vectorizer.fit_transform(filtered_corpus)
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector.toarray())

In [17]:
# Keep the upper similarity matrix without diagonal
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [18]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [19]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(1.0000000000000004)
syria say brings u.s drone  sana news agency
syria say brings u.s drone  sana news agency


(1.0000000000000004)
russia say sinai plane crash caused terrorist attack
russia say sinai plane crash caused terrorist attack


(1.0)
landmine kill four u.n soldier  wound 15 northern mali
landmine kill four u.n soldier  wound 15 northern mali


(0.9999999999999998)
suicide attack saudi mosque kill 17 people
suicide attack saudi mosque kill 17 people 


(0.8592868091661006)
nelson mandela discharged hospital
nelson mandela discharged


(0.6284726270615565)
taliban gunman attack kabul presidential palace
taliban claim suicide attack kabul presidential palace


(0.5816074308323751)
north korea fire missile  south korea say
north korea fire seven missile east sea  south korea say


(0.5682770109148848)
6.6 magnitude earthquake hit chinese city ya
magnitude-4.9 earthquake hit macedonia


(0.5527852661558667)
toronto mayor rob ford win appeal keep job
toronto mayor rob ford file

## Word-embeddings

In [22]:
#import sys
#!{sys.executable} -m spacy download en_core_web_md

In [8]:
# load in spacy
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_md.load

In [9]:
# vectorize the corpus and generate similarity matrix
vector = [nlp(i).vector for i in filtered_corpus]
similarity_matrix = sklearn.metrics.pairwise.cosine_similarity(vector)

In [12]:
# Keep the upper similarity matrix without diagonal
tri_upper_no_diag = np.triu(similarity_matrix, k=1)

In [13]:
# Get the index of 50 most similar title pairs
top_similar = 50
top_similar_index = np.unravel_index(np.argsort(tri_upper_no_diag.ravel())[-top_similar:], tri_upper_no_diag.shape)

In [14]:
# Print the top 50 most similar title pairs
for i in range(1, top_similar + 1):
    a_index = top_similar_index[0][-i]
    b_index = top_similar_index[1][-i]
    a_text = filtered_corpus[a_index]
    b_text = filtered_corpus[b_index]
    similarity = similarity_matrix[a_index,b_index]
    print(f"({similarity})\n{a_text}\n{b_text}\n\n")

(0.983873724937439)
vatican say arrest former archbishop paedophilia charge
vatican arrest former archbishop paedophilia charge


(0.9438580274581909)
afghan taliban announce start “ spring offensive ”
taliban announce start spring offensive afghanistan


(0.9359098672866821)
6 u.s troop killed helicopter crash afghanistan
afghanistan  helicopter crash kill 12 nato troop  2 civilian


(0.9342800974845886)
300 million child breathe highly toxic air  report unicef
2 billion child breathe toxic air worldwide  unicef say


(0.9317377805709839)
u.s fly burundian troop central african republic coordination france
france increase troop central african republic


(0.9299061894416809)
israeli settler stab palestinian jerusalem  attack school west bank
illegal israeli settler stab palestinian occupied west bank


(0.9286984801292419)
eurogroup vow strengthen bailout fund contain debt crisis  17 eurozone member said plan strengthen common bailout fund order stop debt crisis spreading italy  finan

In [37]:
index1=[top_similar_index[0][-i] for i in range(1, top_similar + 1)]
index2=[top_similar_index[1][-i] for i in range(1, top_similar + 1)]
similarities=[similarity_matrix[top_similar_index[0][-i],top_similar_index[1][-i]] for i in range(1, top_similar + 1)]
originaltitle1=[small_sample['title'][i] for i in index1]
originaltitle2=[small_sample['title'][i] for i in index2]
top50_similar_df=pd.DataFrame({'Index 1':index1,'Index 2':index2,'Title 1':originaltitle1,
                              'Title 2':originaltitle2,'Similarity':similarities})

In [38]:
top50_similar_df.head()

Unnamed: 0,Index 1,Index 2,Title 1,Title 2,Similarity
0,3355,3580,Vatican says arrests former archbishop over pa...,Vatican arrests former archbishop on paedophil...,0.983874
1,450,2875,Afghan Taliban announce start of “spring offen...,Taliban announce start of spring offensive in ...,0.943858
2,196,4812,6 U.S. troops killed in helicopter crash in Af...,Afghanistan: Helicopter crash kills 12 NATO tr...,0.93591
3,240,4257,"300 Million Children Breathe Highly Toxic Air,...",2 billion children breathe toxic air worldwide...,0.93428
4,2777,3125,U.S. to fly Burundian Troops to Central Africa...,France Increases Its Troops In The Central Afr...,0.931738
