# [Computational Social Science]
## 5-4 word2vec - Student Version

In this lab we will use the techniques we introduce word embeddings via word2vec.

## Virtual Environment
Remember to always activate your virtual environment first before you install packages or run a notebook! This helps to prevent conflicts between dependencies across different projects and ensures that you are using the correct versions of packages. You must have created anaconda virtual enviornment in the `Anaconda Installation` lab. If you have not or want to create a new virtual environment, follow the instruction in the `Anaconda Installation` lab. 

<br>

If you have already created a virtual enviornment, you can run the following command to activate it: 

<br>

`conda activate <virtual_env_name>`

<br>

For example, if your virtual environment was named as CSS, run the following command. 

<br>

`conda activate CSS`

<br>

To deactivate your virtual environment after you are done working with the lab, run the following command. 

<br>

`conda deactivate`

<br>

In [None]:
# download and install new libraries -- WILL NEED TO UNCOMMENT AND RUN ONLY FIRST 
# ----------
#!pip install gensim
#!pip install tqdm
#!pip install adjustText
#!pip install multiprocessing - ONLY NECESSARY ON OLD VERSIONS OF PYTHON (BEFORE 2.6)

In [None]:
# load libraries
# ----------
import pandas as pd
import numpy as np
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

from tqdm import tqdm

import gensim
from gensim import models
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import utils
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelBinarizer
from sklearn.manifold import TSNE

import multiprocessing

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data

<img src = "../../images/cfpb_logo.png"/>

We'll once again use the Consumer Financial Protection Bureau's [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). Picking up from where we left off last time, we'll focus on predicting whether a consumer complaint narrative is talking about a "mortgage" issue or a "student loan" issue.

In [None]:
# load the data
# ----------
cfpb = pd.read_csv("../../data/CFPB 2020 Complaints.csv")
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative'])
cfpb = cfpb[(cfpb['Product']=='Mortgage') | (cfpb['Product'] == 'Student loan')]
cfpb = cfpb[:1000].reset_index(drop = True)

## Overview <a id='context'></a>

In this lab, we will be turning individual words in the data set into vectors, called "Word Embeddings". **Word embedding** attempt to identify semantic relationships between words by observing them in the context that the word appears. `Word2Vec` is the most prominent word embedding algorithm.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” (`“Call me Ishmael"`). After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts.  This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of `Word2Vec` are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. **Skip-Grams** take a word of interest as its input (e.g., "me") and tries to learn how to predict its context words ("Call", "Ishmael"). **CBOW** does the opposite, taking the context words ("Call", "Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

### Word2Vec Features

* `vector_size`: Number of dimensions for word embedding model (*formerly* `size`) 
* `window`: Number of context words to observe in each direction
* `min_count`: Minimum frequency for words included in model
* `sg` (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
* `alpha`: Learning rate (initial); prevents model from over-correcting, enables finer tuning
* `epochs`: Number of passes through dataset (*formerly* `iterations`) 
* `batch_words`: Number of words to sample from data during each pass


For more detailed background on Word2Vec's mechanics, I suggest this  <a href="https://www.tensorflow.org/text/tutorials/word2vec">brief tutorial</a> by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing." There is also this [tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py) that might be helpful as well. 

We will be using the default value for most of our parameters.

## Preprocessing

First let's use our handy preprocessing function. Notice that this version will return a list of tokens (not a string), and we also added the `str.lower()` method.

In [None]:
# create preprocessing function - like we have in the past few labs
# ----------

def rem_punc_stop(text):
    stop_words = STOP_WORDS
    # Individually
    # nlp.Defaults.stop_words.add("XX")
    # nlp.Defaults.stop_words.add("XXXX")
    # nlp.Defaults.stop_words.add("XXXXXXX")
    
    # Using the bitwise |= (or) operator
    nlp.Defaults.stop_words |= {"xx", "xxxx","xxxxxxxx"}
    
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text.lower() for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

In [None]:
# apply our preprocessing function to the consumer complaint column in our original dataframe
# ----------
cfpb['tokens'] = cfpb['Consumer complaint narrative'].map(lambda x: rem_punc_stop(x))
cfpb['tokens']

## Model Training

Now that we have pre-processed our text, we can use the [`gensim`](https://radimrehurek.com/gensim/models/word2vec.html) library to construct our word embeddings. We will use the Continous Bag of Words model (CBOW), which predicts target words from its neighboring context words to learn word embeddings from raw text.

Read through the documentation of the Word2Vec method in gensim to understand how to implement the Word2Vec model. Then fill in the blanks so that: we use a __Continuous Bag of Words__ model to create word embeddings of __vector_size 100__ for words that appear in the `text` at least __5 or more times__. Set the learning rate to .025, epochs/number of iterations to 5, and sample 10000 words from the data during each pass.

**CHALLENGE:** Go ahead and annotate each line so you know what each one is doing. 

In [None]:
# apply CBOW Word2vec model to our cfpb data
# ----------
model = gensim.models.Word2Vec(cfpb['tokens'],   # specify data - sentences
                               vector_size=...,  # ...
                               window=...,       # ...
                               min_count=...,    # ...
                               sg=0,             # ...
                               alpha=...,        # ... 
                               epochs=...,       # ... 
                               seed=1,           # set random seed (same as random_state in sklearn )
                               batch_words=...,  # ...
                               workers = 1)      # ...

## Practice with Embeddings <a id='subsection 2'></a>

Now that we've trained the mode, we can return the actual high-dimensional vector by simply indexing the model with the word as the key:

In [None]:
# return embeddings for specific word 
# ----------
print(model.wv.__getitem__(['account']))  # specify a key word here: "account"

**QUESTION:** Check out the shape of the vectors for 'account', what do you notice?

In [None]:
# get shape
# ----------
model.wv.__getitem__(['account'])... 

**ANSWER:** ...

**CHALLENGE:** Use the following empty cells to look at what the word embeddings look like for words you think may appear in the text, for example, `bank`. Keep in mind that even if a word shows up in the text as seen above, a word vector will not be created unless it satisfies all conditions we inputted into the model above. 

In [None]:
# word 1
# ----------
...

In [None]:
# word 2
# ----------
...

In [None]:
# word 3 
# ----------
...

If you're curious, the cell directly below will return a list of words that have been turned into word vectors by the model above:

In [None]:
# return a list of word for which we have calculations
# ----------
words = list(model.wv.index_to_key)
print(words[0:100])  # print the first 100 wordsd

`gensim` comes with some handy methods to analyze word relationships. `similarity` will give us a number from 0-1 based on how similar two words are. If this sounds like cosine similarity for words, you'd be right! It just takes the cosine similarity of the high dimensional vectors we input. 

In the following cell, find the similarity between the words `credit` and `debt`:

In [None]:
# similarty between credit and debt
# ----------
model.wv.similarity('credit', 
                    'debt')

We can also find cosine distance between two clusters of word vectors. Each cluster is measured as the mean of its words:

In [None]:
# similarity between credit/debt and loan/mortgage
# ----------
model.wv.n_similarity(['credit','debt'],
                      ['loan','mortgage'])

We can find words that don't belong with `doesnt_match`. It finds the mean vector of the words in the `list`, and identifies the furthest away. Out of the three words in the list `['credit', 'loan', 'student']`, which is the furthest vector from the mean?

In [None]:
# which doesn't belong? 1
# ----------
model.wv.doesnt_match(['credit', 'loan', 'student'])

In [None]:
# which doesn't belong? 2
# ----------
model.wv.doesnt_match(('pandemic', 'covid', 'bank'))

The most famous implementation of this vector math is semantics. What happens if we take:

$$\vec{house} - \vec{rent} + \vec{loan} = $$

In [None]:
# vector math
# ----------
model.wv.most_similar(positive=['house', 'loan'], 
                      negative=['rent'])

Take a few minutes to try looking at some more vector similarity and differences. 

In [None]:
# more vector math
# ----------
...

## Principal Component Analysis <a id='section 2'></a>

Next we will explore the word embeddings of our `text` visually with PCA. We can retrieve __all__ of the vectors from a trained model as follows:

In [None]:
# retrieve vectors from trained model
# ----------
X = model.wv.__getitem__(model.wv.index_to_key)

As we do with non-text features, we want to standardize X so that all features have the same scale. Do this by creating a `StandardScaler()`, then run its `fit_transform` method on X. 

In [None]:
# scale the data
# ----------
X_std = StandardScaler().fit_transform(X)

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then plot the projection as a scatter plot which we will do next.

### Plot Word Vectors Using PCA <a id='subsection 3'></a>

Recall that we can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class. Construct a PCA objectusing the `PCA()` class of the scikit-learn library (setting n_components=2 so we can graph it in two dimensions) and use its fit_transform method on your standardized X to get Y_pca: the principal components.

In [None]:
# make a PCA
# ----------
pca = PCA(...)

# fit and transform the standardized data
# ----------
Y_pca = ...

The resulting projection can be plotted using `matplotlib`, pulling out the two dimensions as x and y coordinates. Create a scatter plot of the standardized word embeddings, setting the __size of each scatter point to 5__ to avoid overcrowding.

In [None]:
# visualize
# ----------
sns.scatterplot(x = ...,  # extract all the elements from the first column
                y = ...); # extract all the elements from the second column

__QUESTION__: What do each point represent? What do the x and y axes represent?

__ANSWER__: ...

You might at this point still be confused on what the x- and y- axes represent. Because PCA selects and combines features according to what best describes and models the desired variable, the x and y axes actually **don't have an intuitive meaning on a human level.** PCA's job is to reduce the dimension of the features, and in this case it reduce the 100 features of each word vector to just the 2 that best described the words we modeled on. So, don't worry too much about what the coordinates of each word represents - we just want you to have a general and visual understanding of word vectors and how they may be related to one another on a graph.

On that note, run the following cell. This will label each vector with its respective word. 

**CHALLENGE:** Annotate each line to ensure clear understanding.

In [None]:
# recreate visualization with points
# ----------

# ANNOTATE EACH LINE

#
import random
from adjustText import adjust_text

#
random.seed(10)

#
rando = random.sample(list(model.wv.index_to_key), 25) 

#
X1 = model.wv.__getitem__(rando)


pca1 = PCA(n_components=2, 
           random_state=15)

#
result = pca1.fit_transform(X1)

#
result_df = pd.DataFrame(result,                   # 
                         columns = ['PC1', 'PC2'], # 
                         index = rando)            # 

#
sns.scatterplot(x = 'PC1',         # 
                y = 'PC2',         # 
                data = result_df)  # 

#
texts = []

#
for word in result_df.index:
    texts.append(plt.text(result_df.loc[word, 'PC1'], 
                          result_df.loc[word, 'PC2'], 
                          word, 
                          fontsize = 8))
    
#
adjust_text(texts, 
            force_text = (0.4,0.4),
            expand = (1.2,1),
            arrowprops = dict(arrowstyle = "-", 
                              color = 'black', 
                              lw = 0.5))

#
plt.show()

## t-SNE

Another popular unsupervised method for summarizing and visualizing word embeddings is [t-distributed stochastic neighbor embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). We won't get into the details here, but the basic method involves:

1. Estimating the joint probability of the distance between each pair of points, assuming a Gaussian distribution.
2. Project the data into 1-dimension, and then estimate the joint probability of the distance between each pair of points assuming a Student's t-distribution.
3. Use gradient descent to update the second distribution to become similar to the first one.

The basic idea behind this two-step procedure is that we search for the the best lower dimensional representation that gets closest to modeling the original distances in higher dimensional space.

In [None]:
# preprocessing
# ----------

# filter to include only those for which Word2Vec has a vector
vector_list = [model.wv.__getitem__(word) for word in words if word in model.wv.index_to_key]

# create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in model.wv.index_to_key]

# bind together both lists using zip
word_vec_zip = zip(words_filtered, vector_list)

# create a dictionary and save as a dataframe
word_vec_dict = dict(word_vec_zip)
word_vec_df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
word_vec_df.shape

Here we initialize a t-SNE model using 2 components as the key hyperparameter. 

In [None]:
# create t-SNE visualization
# ----------

# initialize t-SNE
tsne = TSNE(n_components = 2,  # specify 2 components
            init = 'random',   # set initalization
            random_state = 10, # set seed
            perplexity = 100)  # set preplexity threshold

# subset to only 400 rows to speed up training time
tsne_df = tsne.fit_transform(word_vec_df[:400])

# figure specifications
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(x = tsne_df[:, 0], 
                y = tsne_df[:, 1], 
                alpha = 0.5)

# initialize empty list
texts = []

# create list of words
words_to_plot = list(np.arange(0, 400, 10))

# append words to list using loop
for word in words_to_plot:
    texts.append(plt.text(tsne_df[word, 0], 
                          tsne_df[word, 1], 
                          word_vec_df.index[word], 
                          fontsize = 10))
    
# adjust text to clearlly see labels
adjust_text(texts, 
            force_text = (0.4,0.4),
            expand = (1.2,1),
            arrowprops = dict(arrowstyle = "-",
                              color = 'black', 
                              lw = 0.5))

# plot
plt.show()

### Challenge: t-SNE hyperparameters

Try playing with the hyperparameters to see if you can get a different looking plot. Why might this be a problem for interpretability or inference?

In [None]:
# recreate t-SNE visualization with 3 components
# ----------
tsne = TSNE(n_components = ..., # specify 3 components  
            init = 'random',    # set initalization 
            random_state = 10,  # set seed   
            perplexity = ...)   # you might have to lower preplexity threshold
   

# subset to only 400 rows to speed up training time
tsne_df = tsne.fit_transform(word_vec_df...)



# figure specifications
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(x = tsne_df[:, 0], 
                y = tsne_df[:, 2], 
                alpha = 0.5)


# initialize empty list
texts = ...

# create list of words
words_to_plot = list(np.arange(0, 400, 10))

# append words to list using loop
for word in words_to_plot:
    texts.append(plt.text(tsne_df[word, 0], 
                          tsne_df[word, 1],
                          word_vec_df.index[word], 
                          fontsize = 10))
    
# adjust text to clearlly see labels   
adjust_text(texts, 
            force_text = (0.4,0.4),
            expand = (1.2,1),
            arrowprops = dict(arrowstyle = "-",
                              color = 'black', 
                              lw = 0.5))

# plot 
plt.show()

## Averaging Word Embeddings

You'll notice that right now each token is represented by a 100-dimensional array. If we passed these directly to a classification algorithm our feature space would get very high dimensional quickly! A common practice to avoid this problem is to average the word embeddings somehow. You might have heard of variations of `word2vec` like `sent2vec` and `par2vec` which creates embeddings for sentences and paragraphs, respectively. We'll introduce a similar method below, but one straightforward way to do this without a fancy new model is to simply average the word embeddings at the document level.

To get a sense of how this works, let's look at how many tokens we have in our first document:

In [None]:
# get length
# ----------
len(cfpb['tokens'][0])

223 - but remember not all documents will have vectors associated with them if do not meet word2vec's criteria. Let's see how many we have that are in our model's vocabulary:

In [None]:
# get model's vocabulary
# ----------
doc = [word for word in cfpb['tokens'][0] if word in model.wv.index_to_key]
len(doc)

195! Looks like quite a few tokens didn't make it into the model. Let's look at a few that did: 

In [None]:
# look at the first five
# ----------
doc[0:5]

Let's look at the array for 'contacting'. Notice that it is represented by a 100-dimensional array. 

In [None]:
# word embeddings for "contacting"
# ----------
print(model.wv.__getitem__('contacting'))
print(model.wv.__getitem__('contacting').shape)

In [None]:
# find the mean
# ----------
np.mean(model.wv.__getitem__('contacting'))

Now let's grab the first vector each token and take their mean:

In [None]:
# find the first vector of each token and find their mean
# ----------
# create empty list
first_vec = ...

# loop over each document
for token in model.wv.__getitem__(doc):
    first_vec.append(token[0])
    
# calculate their mean
..(first_vec)

And then let's do this for every token and document in our corpus:

In [None]:
# create function to do this for every token and document in our corpus
# ----------
def document_vector(word2vec_model, doc):
    doc = [word for word in doc if word in model.wv.index_to_key]
    return np.mean(word2vec_model.wv.__getitem__(doc), axis=0)

In [None]:
# create an array for the size of the corpus
# ----------
# create empty list
empty_list_embeddings_means = []

# loop over each each token
for puppy in cfpb['tokens']: # append the vector for each document
    empty_list_embeddings_means.append(document_vector(model, puppy))
    
# convert the list to array
doc_average_embeddings = np.array(empty_list_embeddings_means)

# print averages
doc_average_embeddings

Ultimately we get an array with `n` rows and 100 columns:

In [None]:
# find the shape 
doc_average_embeddings.shape

## Document averaged work embedding (doc2vec)

Document averaged word embeddings tend to perform well with downstream prediction tasks, but there are other options as well. Here, we'll take a look at [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html). This is also a very helpful [tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html).  

We're getting close to classifying, and this is a good point to do our train/test splits. While normally we recommend waiting to do splits until after all preprocessing is done, in this case it will be easier to do the split now because of the way the `TaggedDocument` class works. Let's take a look:

In [None]:
# preprocessing
# ----------

# intitalize label binarizer
lb_style = LabelBinarizer()

# fit transform
y = cfpb['Product_binary'] = lb_style.fit_transform(cfpb["Product"])

# train/test split
train, test = train_test_split(cfpb,             # specify dataset
                               test_size=0.2,    # specify test size
                               random_state=42)  # set seed
# view
train.head()

After we do our train test splits, we apply the `TaggedDocument()` function to every token. This allows us to associate each document with the class that we want to predict later:

In [None]:
# apply tag to each train/test dataset
# ----------

# tag training datatset
cfpb_train_tagged = train.apply(lambda r: TaggedDocument(words=r['tokens'],
                                                         tags=[r.Product_binary]), 
                                axis=1)

# tag testing datatset
cfpb_test_tagged = test.apply(lambda r: TaggedDocument(words=r['tokens'], 
                                                       tags=[r.Product_binary]),
                              axis=1)

# view the first row
cfpb_train_tagged[0]

We're now ready to train our `doc2vec!` One of the key features of gensim is that it natively allows multicore processing - meaning it can take advantage of your CPU cores. Note that this is slightly different from tensorflow that we covered last semester, which also allows GPU acceleration. You can check how many CPU cores you have available: 

In [None]:
# count your cores for processing
# ----------
cores = multiprocessing.cpu_count()
cores

[Parallel processing](https://en.wikipedia.org/wiki/Parallel_computing) is an important topic in computational social science - it will be the key to speeding up lots of different operations. In general, we recommend that whenever you use parallel processing you reserve 1 CPU core for your computer's other functions (keeping your browser and other software running), and use the remaining for your task at hand. In this case, doc2vec's training process is an example of [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) meaning that the different CPUs don't need to talk to each other to do their computations. They can independently work and combine the results at the end. In this case, we are working with very few observations (just 1000), but this is a useful technique to keep in mind for your projects and research! 

First we'll train the model (notice the use of the [`tqdm`](https://tqdm.github.io/) library for progerss bars):

In [None]:
# train a Doc2Vec model 
# ----------
# initalize Doc2Vec
model_dbow = Doc2Vec(...,               # specify a distributed bag of words
                     ...,               # set word embedding to 300
                     ...,               # include 5 negative samples
                     hs=0,              # hierarchical softmax using negative samples
                     min_count=2,       # ignores all words with a total frequency lower than this threshold.
                     ...,               # essentially turn off downsampling
                     seed = 1995,       # set seed for reproducibility
                     ...)               # set # of cores to 1 less than you have - not be fully reproducible if not 1

# apply to training data
model_dbow.build_vocab([x for x in tqdm(cfpb_train_tagged.values)])

We'll allow the model to train for 30 iterations (epochs - this is the same as our neural nets lab):

In [None]:
# loop over our data
# ----------
for epoch in range(1,30):
    model_dbow.train(utils.shuffle([x for x in tqdm(cfpb_train_tagged.values)]), 
                     total_examples=len(cfpb_train_tagged.values), 
                     epochs=epoch)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

Finally, we'll define a function that will grab the embeddings for each document:

In [None]:
# grab the embeddings
# ----------
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words)) for doc in sents])
    return targets, regressors

## Classification

And finally, let's use logistic regression to see how well our document embeddings do:

In [None]:
#
# Classifcation model
# ----------------------------------------


# intialize a logit model
# ----------

# split into training
y_train, X_train = vec_for_learning(model_dbow,     # specify datasets for split
                                    cfpb_train_tagged)

# split into testing
y_test, X_test = vec_for_learning(model_dbow,       # specify datasets for split
                                  cfpb_test_tagged)

# initalize model
logit_reg = ...()  # initialize logit model

# fit on training
logit_model = logit_reg.fit(...,   # fit to training data
                            ...)

# predict on test data
y_pred = logit_model.predict(...) # predict on test data


# confusion matrix
# ----------

# create confusion matrix 
cf_matrix = confusion_matrix(...,                 # actual
                             ...,                 # predictions
                             ...)                 # normalize

# create dataframe
df_cm = pd.DataFrame(....,    # specify matrix for calculations 
                     range(2),
                     range(2))

# rename indices
df_cm = df_cm.rename(index=str, columns={0: "Checking or savings account", 
                                         1: "Student loan"})

df_cm.index = ["Checking or savings account", "Student loan"]

# plot specifications
# ----------
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, 
           annot=True,
           annot_kws={"size": 16},
           fmt='g')

plt.title('Confusion Matrix')
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

Hmm, not as well as we might think! But this shouldn't be too surprising with a small dataset - word embeddings tend to work best when given lots of data.

# Loading Pre-Trained Embeddings

So far we have been working with embeddings trained on our particular corpus. However, this is not usually standard - as you saw above, word2vec works best when it has lots of data. The problem is that training state-of-the-art models requires intense computational resources. It also has a [large carbon footprint](https://arxiv.org/pdf/1906.02243.pdf). Luckily, we can use pre-trained models like Google News or Stanford's GloVe. Note to run this next chunk of code, you need to have the [GoogleNews embeddings](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) in your "data" directory in this repo.

In [None]:
# load pre-trained Google News model 
# ----------
googlenews_word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('../../data/GoogleNews-vectors-negative300.bin.gz', 
                                                                            binary = True) 

We can also tune the Google News embeddings to a domain-specific corpus. This may or may not be necessary depending on how specific or unique you think words in your corpus might be:

In [None]:
# retrain google on your own corpus
# ----------

# ! NOTE: THIS TAKES A VERY LONG TIME TO RUN - SO WE WON'T RUN IT HERE
# ! BUT WE CAN RE-RUN FROM HERE TO SEE HOW THIS MIGHT ULTIMATELY AFFECT
# ! OUR CLASSIFICATION MODEL



## specify your corpus
#your_corpus = cfpb['tokens']  # Load your corpus here, make sure it's tokenized
#
## initialize Word2Vec model with the same dimensions as the Google News vectors
#word2vec_model = gensim.models.Word2Vec(vector_size=300,    # word embedding size
#                                        window=5,           # window size
#                                        min_count=1,        # ignores words w/ frequency lower than this threshold
#                                        workers=cores - 1)  # how many cores will be used
#
## build vocabulary 
#word2vec_model.build_vocab(your_corpus)
#
## initialize word vectors using the Google News pre-trained vectors
#word2vec_model.wv.vectors = googlenews_word2vec_model.vectors
#
## training model
#word2vec_model.train(your_corpus, 
#                     total_examples=len(your_corpus), 
#                     epochs=5, 
#                     compute_loss=True)

## Challenges

### Challenge Embeddings 

Repeat the exploration we did in the first part where we trained word embeddings with the new Google model. Do you notice any differences? Create document averaged word embeddings and predict our outcome ('Product_binary'). How did this compare to doc2vec?

In [None]:
# return embeddings for specific word -- "account"
# ----------
print(...)

**QUESTION:** Check out the shape of the vectors for 'account', what do you notice?

In [None]:
# get the shape of the embeddings - what does this tell us?
# ----------
googlenews_word2vec_model.__getitem__(['account']). ...

**ANSWER:** ...

Use the following empty cells to look at what the word embeddings look like for words you think may appear in the text! Keep in mind that even if a word shows up in the text as seen above, a word vector will not be created unless it satisfies all conditions we inputted into the model above. 

In [None]:
# word 1 - try the word "navient"
# ----------
...

**QUESTION:** Did this run as you expected? Why or why not?

**ANSWER:** ...

In [None]:
# word 2 - try the word "company"
# ----------
...

In [None]:
# word 3 - try the word "credit"
# ----------
...

In [None]:
# return a list of words for which we have calculations
# ----------
words = list(googlenews_word2vec_model. ...) # get the indices and keys
print(words...)                              # return the first 100 words

`gensim` comes with some handy methods to analyze word relationships. `similarity` will give us a number from 0-1 based on how similar two words are. If this sounds like cosine similarity for words, you'd be right! It just takes the cosine similarity of the high dimensional vectors we input. 

In the following cell, find the similarity between the words `credit` and `debt`. 

**QUESTION:** How does this compare to the model trained on our own small dataset?

In [None]:
# similarty between credit and debt and compare to above
# ----------
googlenews_word2vec_model...(...)

**ANSWER:** ...

We can also find cosine distance between two clusters of word vectors. Each cluster is measured as the mean of its words:

In [None]:
# similarity between credit/debt and loan/mortgage
# ----------
googlenews_word2vec_model.n_similarity(...)

We can find words that don't belong with `doesnt_match` parameter. This finds the mean vector of the words in the `list`, and identifies the furthest away. Try it out. Of the three words in the list `['credit', 'loan', 'student']`, which is the furthest vector from the mean?

In [None]:
# words that don't match 1
# ----------
...

In [None]:
# words that don't match 2
# ----------

Let's look at how the Google News model did for this example above: 
$$\vec{house} - \vec{rent} + \vec{loan} = $$

In [None]:
# vector math with google
# ----------
googlenews_word2vec_model.most_similar(positive=..., 
                                       negative=...)

### Challenge PCA

Next we will explore the word embeddings of our `text` visually with PCA. We can retrieve __all__ of the vectors from a trained model as follows:

In [None]:
# retrieve all vectors
# ----------

# ANNOTATE WHAT EACH LINE IS DOING

# 
model_words = [word for word in doc if word in model.wv.index_to_key and word in googlenews_word2vec_model.index_to_key]

# 
X = googlenews_word2vec_model.__getitem__(model_words)

#
print(X.shape) # what is the shape of this and what does it mean?
print(model_words)


As we do with non-text features, we want to standardize X so that all features have the same scale. Do this by creating a `StandardScaler()`, then run its `fit_transform method` on X. 

In [None]:
# scale the data
# ----------
X_std = ...

In [None]:
# fit a PCA 
# ----------
# initalize model and set parameters
pca = ... # set n_components to 2 

# fit and transform the standardized data
Y_pca = ...

In [None]:
# visualize
# ----------
sns.scatterplot(x = ...[:, 0], # extract all the elements from the first column
                y = ...[:, 1]) # extract all the elements from the second column

In [None]:
# recreate the visualization with 2 components
# ----------

# set random seed
random.seed(10)

# sample 25 words
rando_words = random.sample(model_words, 25) 

# get embeddings
X1 = model.wv.__getitem__(rando_words)

# initalize PCA
pca1 = PCA(n_components=2, 
          random_state=16)

# fit and transform
result = pca1.fit_transform(X1)

# create dataframe
result_df = pd.DataFrame(result,               
                         columns = ['PC1', 'PC2'], 
                         index = rando)

# create scatterplot
sns.scatterplot(x = 'PC1',        # specify x-axis
                y = 'PC2',        # specify y-axis
                data = result_df) # specify dataset

# initialize empty list
texts = []

# append words to list - FOR SOME REASON, THIS LINE WILL NOT RUN. YOU GET THE SAME OUTPUT JUST W/O THE LABELS
#for word in result_df.index:
#    texts.append(plt.text(result_df.loc[word, 'PC1'], 
#                          result_df.loc[word, 'PC2'], 
#                          word, 
#                          fontsize = 8))

# plot text using adjust_text (because overlapping text is hard to read)
adjust_text(texts, 
            force_text = (0.4,0.4),
            expand = (1.2,1),
            arrowprops = dict(arrowstyle = "-", 
                              color = 'black', 
                              lw = 0.5))

plt.show()

## Averaging Word Embeddings

And then let's do this for every token and document in our corpus:

In [None]:
# create function to iterate over every token and document in our corpus
# ----------
def document_vector(model, doc):
    doc = [word for word in doc if word in model.index_to_key]
    return np.mean(model.__getitem__(doc), axis=0)

In [None]:
# create an array for the size of the corpus
# ----------
# ANNOTATE EACH LINE

#
empty_list_embeddings_means = []

#
for doc in cfpb['tokens']: 
    empty_list_embeddings_means.append(document_vector(googlenews_word2vec_model, doc))

# 
doc_average_embeddings = np.array(empty_list_embeddings_means) 

# 
doc_average_embeddings

### Challenge: Classification

Let's run a classificaiton model using the Google News trained model. 

**QUESTION:** How does this pre-trained library do compared to our first model fit to our own data? Why do you think this is?

In [None]:
# convert word embeddings into dataframe
# ----------
word2vec_features_df = pd.DataFrame(...)

In [None]:
#
# Classifcation model
# ----------------------------------------


# specify logit model
# ----------

# create label
y = ...  # subset the product binary column from cfpb dataframe

# split into training
X_train, X_test, y_train, y_test = train_test_split(...,                # specify features
                                                    ...,                # specify labels
                                                    ... = ...,          # specify training split
                                                    ...=...,            # specify test split
                                                    random_state = 10)  # set seed
# inititialize a model
logit_reg = LogisticRegression()

# fit on training
logit_model = logit_reg.fit(..., 
                            ....ravel())

# predict on test data
y_pred = logit_model...(...)

# confusion matrix
# ----------

# create confusion matrix 
cf_matrix = ....(...,              # actual
                 ...,              # predictions
                 ... = "true")     # normalize

# create dataframe
df_cm = pd.DataFrame(...,   # specify matrix for calculations
                     range(2),
                     range(2))

# rename indices
df_cm = df_cm.rename(index=str, columns={0: "Checking or savings account", 
                                         1: "Student loan"})

df_cm.index = ["Checking or savings account", "Student loan"]


# plot specifications
# ----------

plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, 
           annot=True,
           annot_kws={"size": 16},
           fmt='g')

plt.title('Google Document Averaged Embeddings')
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

**ANSWER:** ...

Uncomment the script code above to re-train the Google News model on your data and re-run to see how it performs in comparison. How do you think it will do?



# Discussion

You've now had a gentle introduction to word embeddings! There are a few major lessons here.

- Word embeddings are powerful and capture a lot of context that frequency based embeddings do not. However, this isn't perfect! 
- As with any machine learning application, your choice of model and hyperparameters can matter quite a lot. In this case, some of our simpler featurizations and models actually did better than word embeddings, but this won't always be true. 
- It is also worth learning more about other embeddings like [GloVe](https://nlp.stanford.edu/projects/glove/), transformer based models like [BERT](https://arxiv.org/abs/1810.04805) and deep learning approaches like [ELMo](https://arxiv.org/abs/1802.05365).

---
Notebook developed by Aniket Kesari. Some materials borrowed from [LS123: Data, Prediction, and Law](https://github.com/Akesari12/LS123_Data_Prediction_Law_Spring-2019/blob/master/labs/Word%20Embedding/LEGALST-190%20Word%20Embedding%20SOLUTIONS.ipynb). Modified by Prashant Sharma (2023) and annotated by Kasey Zapatka (2024).