# [LEGALST-190] Lab 4/5: Word2Vec & PCA

In this lab, we will implement a Word2Vec algorithm on a sample of the UN General Debates dataset and conduct Principle Component Analysis of the resulting word embeddings. 

*Estimated Time: 45 minutes*

### Table of Contents

[Overview](#section context)<br>

[The Data](#section data)<br>

0- [Pre-Processing](#section 0)<br>

1 - [Word2Vec](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 - [Training](#subsection 1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2 - [Embeddings](#subsection 2)

2- [PCA of Word Embeddings](#section 2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 - [Plot Word Vectors Using PCA](#subsection 3)


__Dependencies:__

In [None]:
import pandas as pd
import numpy as np

!pip install nltk
import nltk
!pip install gensim
import gensim
import string

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Overview <a id='context'></a>

In this lab, we will be turning individual words in the data set into vectors, called "Word Embeddings". Word embedding generally attempts to identify semantic relationships between words by observing them in the context that the word appears. Word2Vec is the most prominent word embedding algorithm - this is what we will have practice using in today's lab.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts.  This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

### Word2Vec Features
<ul>
<li>`size`: Number of dimensions for word embedding model</li>
<li>`window`: Number of context words to observe in each direction</li>
<li>`min_count`: Minimum frequency for words included in model</li>
<li>`sg` (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>
<li>`alpha`: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>
<li>`iterations`: Number of passes through dataset</li>
<li>`batch_words`: Number of words to sample from data during each pass</li>
</ul>

For more detailed background on Word2Vec's mechanics, I suggest this  <a href="https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html">brief tutorial</a> by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."

We will be using the default value for most of our parameters.

## The Data <a id='data'></a>

We will be working with the UN General Debates data set that contains data from years __2013 and up__. Run the following code block to load the `un-general-debates-2013up` csv into the notebook as `un`:

In [None]:
un = pd.read_csv('data/un-general-debates-2013up.csv')
un.head()

---

# Pre-Processing <a id='section 0'></a>

Word2Vec learns about the relationships among words by observing them in context. We'll need to tokenize the words in a sample of texts from the dataframe while retaining sentence boundaries. 

Just as we did in the Pre-Processing lab about a month ago (3/1), we'll use the Natural Language Toolkit (NLTK) to tokenize a our sample text. Note that each text of each row in the `un` dataframe is in the form of a single string. Because we want to work with a larger sample than just one instance of a text, we will first __combine the first 300 strings of texts in the data frame__ to create one long string of text, which we will simply call `text`.

In [None]:
#Use a for loop to combine 300 strings of texts in the `un` dataframe into one string.
text = ''
for ...:
    ...

In [None]:
#SOLUTION
text = ''
for i in range(300):
    text += un['text'][i] + ' '

Next, get the individual sentences from `text`. Show the first three sentences.

*Hint*: the `sent_tokenize` method of nltk will be useful here.

In [None]:
# create sentence tokens
sents = ...
...

In [None]:
#SOLUTION
sents = nltk.sent_tokenize(text)
sents[:3]

Lowercase all words in each sentence, and make sure we only keep nonempty sentences:

*Hint*: You need only one list comprehension to accomplish this.

In [None]:
#filter out empty sentences
nonempty_sents = ...

In [None]:
#SOLUTION
nonempty_sents = [sentence.lower() for sentence in sents if sentence != []]

Now for each sentence in `nonempty_sents`, return a list of words without punctuation or stopwords. Set this to `tokens`. This should result in a list of lists, each list representing a sentence that has been tokenized. Show the first three tokenized sentences.

*Hint*: In the Pre-Processing Lab, we defined the function `rem_punc_stop` that does exactly this. How might we apply this to every sentence of `text`? 

In [None]:
#redefine rem_punc_stop here:

def rem_punc_stop(text):
    
    from string import punctuation
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words("english") #a list of English stop words

    for ...:
        text = ... #get rid of all punctuation marks
        
    toks = ... #create a list of tokens from resulting punctuation-less text
    toks_reduced = ... #filter out stopwords from the list of tokens

    return ...


In [None]:
#SOLUTION

def rem_punc_stop(text):
    
    from string import punctuation
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words("english")
    
    for char in punctuation:
        text = text.replace(char, "")
        
    toks = word_tokenize(text)
    toks_reduced = [x for x in toks if x.lower() not in stopwords.words('english')]
    
    return toks_reduced 

Now that you have re-defined rem_punc_stop, you are ready to use this function to tokenize every sentence cleanly using the following code. This would create a list `tokens` whose elements are lists of tokens, which each list representing a single sentence. However, due to the limitations of DataHub, running this code will very likley crash your Kernel. This will not become a problem once you begin working locally (i.e. working on notebooks on your own computer as opposed to DataHub) as you will have more memory and computing power to work with.

In [None]:
###**** DO NOT RUN THIS CODE BLOCK ***###
### tokens = [rem_punc_stop(sentence) for sentence in nonempty_sents] ###

Luckily, we have created a `Tokens.csv` for you that contains a table with a `text` column containing cleanly tokenized sentences. Loadthis csv, and run the follwing code block to get the desired list format.

In [None]:
#load Tokens.csv
tokens_tbl = 

In [None]:
#SOLUTION
tokens_tbl = pd.read_csv("data/Tokens.csv")

In [None]:
#RUN THIS CELL!
tokens = [eval(tokens_tbl['texts'][i]) for i in range(len(tokens_tbl))]
tokens[:3]

---

# Word2Vec<a id='section 1'></a>


## Training <a id='subsection 1'></a>

Phew! Now that we have pre-processed our text, we can use the `gensim` library to construct our word embeddings. We will use the Continous Bag of Words model (CBOW), which predicts target words from its neighboring context words to learn word embeddings from raw text.

Read through the documentation of the Word2Vec method in gensim to understand how to implement the Word2Vec model. Then fill in the blanks so that: we use a __Continuout Bag of Words__ model to create word embeddings of __size 100__ for words that appear in `text` __5 or more times__. Set the learning rate to .025, number of iterations to 5, and sample 10000 words from the data during each pass.

In [None]:
#Run this code for documentation, or refer to the list above for parameter definitions
gensim.models.Word2Vec?

In [None]:
#Fill in the missing parameter values
model = ...

In [None]:
#SOLUTION
model = gensim.models.Word2Vec(tokens, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=5, batch_words=10000)

## Embeddings <a id='subsection 2'></a>

We can return the actual high-dimensional vector by simply indexing the model with the word as the key:

In [None]:
#Run this cell
print(model['assembly'])

Use the following empty cells to look at what the word embeddings look like for words you think may appear in the `text`! Keep in mind that even if a word shows up in `text` as seen above, a word vector will not be created unless it satisfies all conditions we inputted into the model above. Try words like `president` and `conference` to start! If you're curious, the cell directly below will return a list of words that have been turned into word vectors by the model above:

In [None]:
#Run this cell
words = list(model.wv.vocab)
print(words)

In [None]:
...

In [None]:
...

In [None]:
...

`gensim` comes with some handy methods to analyze word relationships. `similarity` will give us a number from 0-1 based on how similar two words are. If this sounds like cosine similarity for words, you'd be right! It just takes the cosine similarity of the high dimensional vectors we input. 

In the following cell, find the similarity between the words `president` and `leadership`:

In [None]:
model.similarity(...,...)

In [None]:
#SOLUTION
model.similarity('president', 'leadership')

Now find the similarity between the words `different` and `leadership`.

In [None]:
model.similarity('different', 'leadership')

You should notice that the the second smilarity score is significantly lower than the first. Does this make sense?

Find the similarity score between other words that may have very strong or very weak relationships:

In [None]:
#Similarity 1

In [None]:
#Similarity 2

We can also find cosine distance between two clusters of word vectors. Each cluster is measured as the mean of its words:

In [None]:
#Similarity between the president/leadership cluster and the confdient/experience cluster
model.n_similarity(['president','leadership'],['confident','experience'])

We can find words that don't belong with `doesnt_match`. It finds the mean vector of the words in the `list`, and identifies the furthest away. Out of the three words in the list `['president', 'violent', 'leadership']`, which is the furthest vector from the mean?

In [None]:
#Fill in the blanks
model.doesnt_match([..., ..., ...])

*YOUR ANSWER*

In [None]:
#SOLUTION
model.doesnt_match(['president', 'violent', 'leadership'])
# 'violent'

The most famous implementation of this vector math is semantics. What happens if we take:

$$leadership - president + assembly = $$

In [None]:
model.most_similar(positive=['leadership', 'assembly'], negative=['president'])

__Question__: What does this equation mean, and what do these output vectors mean?

*YOUR ANSWER HERE*

__Answer__:

The way that this works is that, by adding and subtracting attributes of each word vector, the equation outputs a vector with a new value for each of the (in this case 100) attributes. The model then outputs words in the corpus that most closely matches the attributes of this outputted vector. The most famous exampls is:

$$ King - Man + Woman = ...$$

__Question__: What do you think this would output, and why?

Your answer here:

__ANSWER__: Queen. We take the 'manliness' from `KING` and replace it with 'woman', so the word vector King, while retaining its high royalty attribute, no longer has a strong manliness attribute and now has a strong feminine attribute. 

# Principle Component Analysis <a id='section 2'></a>

Next we will explore the word embeddings of our `text` visually with PCA (remember the EDA lab from 3/22?). 

We can retrieve __all__ of the vectors from a trained model as follows:

In [None]:
X = model[model.wv.vocab]

As we did in the EDA lab, we want to standardize X so that all features have the same scale. Do this by creating a StandardScaler(), then run its fit_transform method on X. You should recognize the syntax the EDA lab.

In [None]:
#scale the data
X_std = ...

#look at the covariance matrix
...

In [None]:
#SOLUTION
# scale the data
X_std = StandardScaler().fit_transform(X)

# look at the covariance matrix
np.cov(X_std.T)

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot which we will do next.

### Plot Word Vectors Using PCA <a id='subsection 3'></a>

Recall that we can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class. Construct a PCA objectusing the `PCA()` class of the scikit-learn library (setting n_components=2 so we can graph it in two dimensions) and use its fit_transform method on your standardized X to get Y_pca: the principal components.

In [None]:
# make a PCA
pca = ... #set n_components to 2 to graph in 2-D

# fit the standardized data
Y_pca = pca...

In [None]:
#SOLUTION
#make a PCA
pca = PCA(n_components=2) #set n_components to 2 to graph in 2-D

# fit the standardized data
Y_pca = pca.fit_transform(X_std)

The resulting projection can be plotted using matplotlib, pulling out the two dimensions as x and y coordinates. Create a scatter plot of the standardized word embeddings, setting the __size of each scatter point to 5__ to avoid overcrowding.

In [None]:
#Create a scatter plot here

In [None]:
#SOLUTION
plt.scatter(Y_pca[:, 0], Y_pca[:, 1], s= 5);

__Question__: What do each point represent? What do the x and y axes represent?

*YOUR ANSWER HERE:*

__ANSWER__: Eat point represents a word. Explanation for axes are below as a part of the lab:

You might at this point still be confused on what the x- and y- axes represent. Because PCA selects and combines features according to what best describes and models the desired variable, the x and y axes actually don't have an intuitive meaning on a human level. PCA's job is to reduce the dimension of the features, and in this case it manipulated the 100 features each word vector had to just 2 that best described the words we modeled on. So, don't worry too much about what the coordinates of each word represents - we just want you to have a general visual understanding of word vectors and how they may be related to one another on a graph.

On that note, run the following cell. This will label each vector with its respective word. Can you figure out, in general, what the code is doing?

In [None]:
import random
rando = random.sample(list(model.wv.vocab), 15)

X1 = model[rando]
pca1 = PCA(n_components=2)
result = pca.fit_transform(X1)
# create a scatter plot of the projection
plt.scatter(result[:, 0], result[:, 1])

for i, word in enumerate(rando):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()

*YOUR ANSWER HER*

__ANSWER:__

We are randomly selecting 15 words from the vectors outputted by the model. Then we are using PCA to reduce the dimention to 2 features and standardize the values as we did before. Then we are using matplotlib to create a scatter plot and, for each coordinate of the word, assigning the word label to the coordinate!

Great job! You've now completed the lab on word embeddings and visualizing embeddings with PCA!

---

## Bibliography

- Brownlee, Jason. (2017, October 6). How to Develop Word Embeddings in Python with Gensim. https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

- TensorFlow. (2018, March 29). Vector Representations of Words. https://www.tensorflow.org/tutorials/word2vec
- `rem_punc_stop` function borrowed from Tian Qin's notebook on pre-processig: https://github.com/ds-modules/LEGALST-190/blob/master/labs/3-1/3-1_preprocessing_text_student_version.ipynb
- PCA section adapted from materials by Keeley Takimoto: https://github.com/ds-modules/LEGALST-190/blob/master/labs/3-22/3-22_EDA.ipynb
- Word2Vec introduction & examples adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/08-Word-Embeddings.ipynb



Notebook developed by: Keiko Kamei

Data Science Modules: http://data.berkeley.edu/education/modules