# Word Embeddings (Word2Vec, Sent2Vec, and Doc2Vec)

## Due date

April 30, 2018

## Assignment description

In this assignment, you will implement a semantic search engine using the word2vec algorithm. You will use pre-trained word embeddings and build a search engine that can retrieve documents related to a given query based on semantic similarity.

### Objective

1. Familiarize yourself with the word2vec algorithm: Start by reading about the word2vec algorithm and its applications in NLP. You can use the resources provided in the course or search for additional materials online.

2. Choose a pre-trained word embedding model: There are many pre-trained word embedding models available online, such as Google's Word2Vec, Stanford's GloVe, and Facebook's fastText. Choose one that you find suitable for your task and download it. See the lecture notebooks for links to code that can be used to load the models.

3. Preprocess the data: Choose a dataset of documents that you want to use for your search engine. Use the news dataset that you performed Exploratory Data Analysis on the previous assignment.

4. Map the documents to vectors: Use the pre-trained word embedding model to map the words in each document to vectors. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec.

5. Implement the search engine: Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

6. Write a brief summary of your algorithm and document it's usage with some examples.

### Outcomes

The student will be able to:

1. Implement a semantic search engine using word embeddings.
2. Use pre-trained word embedding models.
3. Map documents to vectors using word embeddings.
4. Discover how cosine similarity can be used to cluster documents.

## Submission medium

Well documented Jupyter notebook.

## Dataset

The dataset used in this assignment is the same as the one used in the EDA assignment. That is, the input for this assignment is the output you created in the EDA assignment. You can download the preprocessed dataset from the following link:

In [1]:
import pandas as pd
import numpy as np

data_source = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/Notebooks/data/news-2023-02-01.csv'

articles = pd.read_csv(data_source)

In [2]:
articles.head()

Unnamed: 0,source,title,text
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...


### Dataset description

In [3]:
articles.describe()

Unnamed: 0,source,title,text
count,11587,11586,11419
unique,20,716,1062
top,politicususa,Nicolle Wallace Devastates Trump And Shows Why...,Contact Us\nThis material may not be published...
freq,720,127,698


## Preprocessing

Clean, deduplicate, and tokenize the documents. You should be able to repurpose your code from the EDA assignment to do this.

In [4]:
## YOUR CODE HERE
import spacy
import string
import warnings
from tqdm.auto import tqdm

tqdm.pandas()
warnings.filterwarnings('ignore')
NLP = spacy.load('en_core_web_sm')

In [17]:
# start by dropping duplicates if based on source and text
articles.drop_duplicates(subset = ['source', 'text'], inplace = True) 

#dropping the Na values
articles.dropna(inplace = True)

In [6]:
# Create the tokens

punctuation = list(string.punctuation)
whitespace = list(string.whitespace)[1:] #want to remove the new line characters and tabs, don't want to remove all space though so take after first index
undesired_characters = punctuation + whitespace
articles['tokens'] = articles['text'].progress_apply(lambda x: 
                                             [x.text.lower() for x in NLP(x) if x.text.lower() not in undesired_characters]) 

  0%|          | 0/1062 [00:00<?, ?it/s]

In [7]:
# each article is now tockenized
articles.head()

Unnamed: 0,source,title,text,tokens
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...,"[manhattan, prosecutors, are, likely, to, noti..."
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...,"[print, investigators, looking, into, donald, ..."
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...,"[the, ap, reported, republican, rep., george, ..."
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...,"[rachel, maddow, showed, how, trump, committed..."
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...,"[candles, are, placed, in, front, of, a, photo..."


## Word embeddings

Load the pre-trained word embedding model. You can use the code provided in the lecture notebooks to load the model. Vectorize the documents using the pre-trained word embedding model. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec (see SpaCy and Gensim packages).

In [8]:
## YOUR CODE HERE
# start by loading the gensim pretrained model
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.downloader
from gensim import similarities

In [10]:
#loading the pretrained model
model = gensim.downloader.load('word2vec-google-news-300')

In [11]:
# I will try to just do the average of the vectors for the words. When I looked into the doc2vec method, the only resources I could
# find showed training a new model. None of the examples had using a pre-trained embedding.

# Iterate through each word in each document, then average the vectors to get the document vector representation

# a list to hold the vector representations of each document
document_vectors = [] 

# iterating through each article in the series by index
for i in list(articles['tokens'].index): 
    
    # selecting the document
    document = articles['tokens'][i] 
    
    # a list to hold the vectors for each word in the document
    document_word_vectors = [] 
    for word in document:
        # using try because of emojis which are not in the pre-trained embeddings
        try: 
            #getting the vector representation of the word based on the model's embeddings
            word_vector = model.get_vector(word) 
            #appending the embeddings to the list
            document_word_vectors.append(word_vector)
        except:
            continue
    
    #averaging the word vectors for the document to get a single vector
    document_vector = sum(document_word_vectors) / len(document_word_vectors) 
    document_vectors.append(document_vector)

# now I have the document vectors, can add them as a column to the table
articles['embedding_vectors'] = document_vectors

## Search engine

Write a search engine that can retrieve documents related to a given query based on semantic similarity. Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

In [12]:
from scipy import spatial
import operator

In [13]:
## YOUR CODE HERE

# function to tokenize the querry and returns its vector form

def transform_query(query):
    NLP = spacy.load('en_core_web_sm')
    
    punctuation = list(string.punctuation)
    whitespace = list(string.whitespace)[1:]
    undesired_characters = punctuation + whitespace
    
    #tokenization that I used previously
    document = [x.text.lower() for x in NLP(query) if x.text.lower() not in undesired_characters] 
    
    # a list to hold the vectors for each word in the document
    document_word_vectors = [] 
    for word in document:
        try:
            word_vector = model.get_vector(word)
            document_word_vectors.append(word_vector)
        except:
            continue
    query_vector = sum(document_word_vectors) / len(document_word_vectors)
    
    return query_vector

def getSimilarArticles(query):
    
    #transforming the query to an embedding vector
    query_vector = transform_query(query) 
    
    #dictionary for the similiarity scores
    similarity_scores = {} 
    
    #going through each corpus vector
    for i in list(articles['embedding_vectors'].index): 
        corpus_vector = articles['embedding_vectors'][i]
        
        #finding the cosine distance, subtracting it from one since the smaller distnaces will mean more simliary
        similarity = 1 - spatial.distance.cosine(query_vector, corpus_vector) 
        similarity_scores[i] = similarity 
        
    #sort the dictionary based on the similarity scores in descending order, select the top 10
    top_ten = sorted(similarity_scores.items(), key = operator.itemgetter(1), reverse = True)[:10] 
    top_ten_indicies = [i[0] for i in top_ten]
    
    #return the top 10 article results in a data frame format
    return articles.loc[top_ten_indicies][['source', 'title', 'text']]

## Explanation of the Algorithm

The algorithm takes in a query and transforms it. This transformation starts with tokenization in the same way that I tokenized the documents in the corpus. Next, each token in the query is transformed into its embedding vector using the pre-trained model. The embedding vector for each word in the query is averaged together give one vector for the whole query. Once the query is vectorized, the algorithm calculates the cosine distance of its vector from each document vector. These results are saved in a dictionary which is sorted in descending order. The top 10 results are retrieved and their indices are used to pull their rows of the dataframe. 

## 3 Examples of Queries

In [14]:
# first pull the title of the first document
query = articles['title'][0]

results = getSimilarArticles(query)
results

#the correct article was given as the 7th top result

Unnamed: 0,source,title,text
5708,msnbc,Video released of Trump deposition in New York...,Video released of Trump deposition in NY fraud...
9161,abcnews.go,Top Trump Organization executive to appear bef...,Top Trump Organization executive to appear bef...
10225,thehill,Trump Organization controller expected to appe...,Email\nAssociated Press/Seth Wenig\nA Trump Or...
8531,thehill,Cohen says he turned over cellphones to Manhat...,Cohen says he turned over cellphones to Manhat...
8408,thehill,Cohen says he turned over cell phones to Manha...,Cohen says he turned over cell phones to Manha...
65,vox,Donald Trump could be criminally charged in th...,President Donald Trump greets people at a New ...
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...
8969,thegatewaypundit,Liberal Columbia Journalism Review Issues Scat...,ShareShareShare Email\nSEVEN YEARS after the l...
1552,thehill,Ethics concerns raised over business ties of S...,Ethics concerns raised over business ties of S...
10205,westernjournal,'RINO GLOBALIST': Donald Trump Goes Scorched E...,'RINO GLOBALIST': Donald Trump Goes Scorched E...


In [15]:
# try the title of the 7th index
query = articles['title'][7]

results = getSimilarArticles(query)
results
# gives it as the 4th result

Unnamed: 0,source,title,text
7903,dailycaller,"‘We Were Really Dedicated’: Arian Foster, Marl...","February 01, 2023 10:36 AM ET\nFont Size:\nI w..."
6892,westernjournal,"CNN Calls in Bill Maher, But It Most Likely Wo...",Commentary\nCommentary\nHBO comedian Bill Mahe...
7427,westernjournal,"CNN Calls in Bill Maher, But It Most Likely Wo...",Commentary\nCommentary\nHBO comedian Bill Mahe...
7,vox,Sundance 2023: 17 movies to watch out for this...,Chiwetel Ejiofor and Emilia Clarke in The Pod ...
7428,westernjournal,Tom Brady Announces He's Retiring 'For Good' i...,"NFL quarterback Tom Brady, a cast member and p..."
9732,westernjournal,Tim Allen Says Pamela Anderson Has a 'Weird Me...,Tim Allen poses for a portrait at the 43rd Ann...
9892,westernjournal,Tim Allen Says Pamela Anderson Has a 'Weird Me...,Tim Allen poses for a portrait at the 43rd Ann...
9811,westernjournal,Tim Allen Says Pamela Anderson Has a 'Weird Me...,Tim Allen poses for a portrait at the 43rd Ann...
9560,thehill,Jon Stewart blasts media for playing Tyre Nich...,Jon Stewart blasts media for playing Tyre Nich...
5731,thehill,'Dr. Phil' to end after 21 seasons,Email\n( NewsNation ) — TV host and celebrity ...


In [20]:
# try a query that is not a title

results = getSimilarArticles('Donald Trump White House')
results

Unnamed: 0,source,title,text
8428,msnbc,FBI searches President Biden's Rehoboth home,Rev. Al: We will stand and fight in the name o...
8905,msnbc,FBI searches President Biden's Rehoboth home,Rev. Al: We will stand and fight in the name o...
5708,msnbc,Video released of Trump deposition in New York...,Video released of Trump deposition in NY fraud...
6110,msnbc,"Lawrence: Tyre Nichols' family, civil rights l...",Sen. Murphy: ‘Don’t negotiate’ with House GOP ...
7867,msnbc,FBI searches President Biden's Rehoboth home,Rev. Al: We will stand and fight in the name o...
7390,msnbc,Lawrence: Tyre Nichols’ family spoke in the sa...,Sen. Murphy: ‘Don’t negotiate’ with House GOP ...
7307,msnbc,FBI searches President Biden's Rehoboth home,Rev. Al: We will stand and fight in the name o...
9890,westernjournal,'RINO GLOBALIST': Donald Trump Goes Scorched E...,Florida Gov.-elect Ron DeSantis sits next to P...
7704,huffpost,Ron DeSantis Dismisses Trump's Criticism Of Hi...,Politics Donald Trump Ron DeSantis\nRon DeSant...
7627,msnbc,FBI searches President Biden's Rehoboth home,Rev. Al: We will stand and fight in the name o...


It looks like there are duplicates in the results even though I dropped the duplicates based on the source and text. The reason could be that the text varies by a small amount but I am not entirely sure so I will test with these results.

In [28]:
#shows that the text are different somehow so they're not duplicates in the results
results['text'][8428] == results['text'][8905] 

False

## Extra credit

Based on the results of your search engine, write a kmeans clustering algorithm that can cluster the documents into groups based on their semantic similarity, along with some topics words that can describe each cluster. Some tips are to look into kmeans++, DBSCAN, and agglomerative clustering. For example, see this blog post: https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891

In [8]:
## YOUR CODE HERE