<center><b>DIGHUM101</b></center>
<center>5-3: Word2vec</center>

---

# Learning objectives

- Create a word embeddings model using word2vec
- Use word2vec to find similarities between words
- Plot words using PCA


In [None]:
# Install new libraries if needed - NOTE this notebook assumes you have Gensim v4 or higher
# !pip install gensim

# In case you need to upgrade
# !pip install --upgrade gensim

In [None]:
# Import libraries

import gensim # word2vec model
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np 
import os
import pandas as pd
import re # regular expressions
import seaborn as sns
# Preprocessing
import gensim
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

# Word2vec

The word2vec family of algorithms use shallow neural networks to produce word embeddings, or ways to represent similar words similarly as numbers.

In [None]:
human_rights = pd.read_csv("../../Data/human_rights.csv")
print(human_rights.shape)
human_rights.head()

In [None]:
# First, store the documents we want to explore in a separate dataframe with just one column
w2v_df = pd.DataFrame({'Processed': human_rights["Text_processed"]})
w2v_df

In [None]:
# Turn the text of each row into a list
# We now have a list of lists - one for each document

split_rows = [row.split() for row in w2v_df['Processed']]

In [None]:
# How many tokens do we have?
tokens_flat = [token for sublist in split_rows for token in sublist]
len(tokens_flat)

In [None]:
# We can speed up training by counting our cores

import multiprocessing 
cpu_count = multiprocessing.cpu_count()
print(cpu_count)

# Check your installed version of Gensim! This notebook assumes you have v4 or higher
from importlib.metadata import version
version('gensim')

In [None]:
# Define the word2vec model

model = gensim.models.Word2Vec(split_rows,
                               vector_size=100, # length of vector
                               min_count = 2, # words must appear n times to be considered
                               workers = cpu_count-1, # set this as your number of CPU cores minus one
                               window = 3, # words around the target word that are treated as context
                               sg = 1, # 1 = skip-gram, 0 = CBOW
                               seed = 1) # model will be initialized and trained in the same way

In [None]:
# Save the vocabulary - change .index_to_key to .vocab if running gensim <4
words = list(model.wv.index_to_key)

# Preview
words[0:10]

In [None]:
# Inspect the vector for a word in the vocabulary

model.wv["human"]

Note that our word embeddings model is going to be far from perfect--we are training it on a very small dataset with "only" 65000 tokens. For reference, the original model which Google created in 2012 (trained on part of the Google News dataset) included 100 billion words!

Let's have a look at some similarity scores between words we might expect to be related somehow.

In [None]:
# compare the vectors of two words in the vocabulary
model.wv.similarity("human", "rights")

In [None]:
model.wv.similarity("human", "law")

In [None]:
model.wv.similarity("justice", "law")

In [None]:
model.wv.similarity("war", "humanity")

Looks like it's picking up on words that could be bigrams as well:

In [None]:
model.wv.similarity("united", "nations")

In [None]:
# Get the most similar words to a given word
human_words = model.wv.most_similar("men", topn=10)
print(type(human_words))
human_words

In [None]:
# Convert into a dataframe
pd.DataFrame(human_words,columns=['Word', "Similarity"])

We can also have a look at most similar vectors to an input word. Let's look at what kinds of contexts the word "women" appears in in this human rights dataset.

In [None]:
model.wv.most_similar("women")

# Plot words with PCA

[Principal component analysis](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60) and related dimension-reduction algorithms are an excellent way to visualize multivariate data in reduced dimensional space - such as a 2D scatterplot. 

In [None]:
# Save the word2vec vocab vectors
features = [model.wv[word] for word in model.wv.index_to_key]

In [None]:
# Pre Gensim 4:
#features = model[model.wv.vocab]

In [None]:
# Define parameters of our PCA

# Just look at the first two dimensions - the X and Y axes
pca = PCA(n_components = 2)
pca_out = pca.fit_transform(features)

Let's plot this with the top words (just to keep things slightly uncluttered - visualizing word embeddings is a tricky job!)

In [None]:
plt.figure(figsize = (7,14))
sns.scatterplot(x=pca_out[:, 0], y=pca_out[:, 1])
words = list(model.wv.index_to_key)
# Annotate only the top words 
for i, word in enumerate(words[0:40]):
    plt.annotate(word, size = 8, xy = (pca_out[i, 0], pca_out[i, 1]))
plt.show()

## More resources, if you want to go further 

https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XuxYm2pKjOQ

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92

https://www.datacamp.com/community/blog/spacy-cheatsheet

https://code.google.com/archive/p/word2vec/