# Data Science for Social Justice Workshop: Module 4

## Word Embeddings

In this notebook, we'll work with word embeddings using `gensim`.

The goal of word embedding models is to learn a **numerical representation** of a text corpus. We already did that to a certain extent when we did topic modeling. In this case, we're going to be more explicit about how we construct that numerical representation: for each word, we're going to find a **vector** of numbers to represent it. The actual numbers themselves won't be meaningful to us as humans. However, if successful, the vectors for each term should encode information about the meaning or concept the term represents, as well as the relationship between it and other terms in the vocabulary.

Word vector models are fully unsupervised: they learn all of these meanings and relationships without any advance knowledge. Unsupervised learning requires the specification of a right task. We won't go into detail in this lesson, but you can roughly think of the  Read [this post](https://tomvannuenen.medium.com/analyzing-reddit-communities-with-python-part-6-word-embeddings-f92bba876d60) for a deeper introduction to word embeddings.

This notebook is designed to help you:

* Use `gensim`'s `word2vec` method to create word vectors for a corpus;
* Use these word vectors to reflect on implicit binaries and normativities in your data;
* Visualize topic models using K-means clustering.

## Data Preprocessing

As we will be considering the language biases in the next notebook, we will use the comments of our subreddit this time. The thinking behind this is that this data will be derived from more people, and include more evaluative statements (after all, comments on r/amitheasshole generally evaluate the original posts).

In [None]:
import os
import pandas as pd

In [None]:
# Change directory
# We include two ../ because we want to go two levels up in the file structure
os.chdir("../../data")

In [None]:
# Import dataset
df = pd.read_csv('aita_com_top.csv')
df.head(3)
print(df.shape)

Next, we remove comments that were removed or deleted, and additionally only take comments that are sufficiently long:

In [None]:
# Remove comments that are [removed] or [deleted]
df = df[~df['body'].isin(['[removed]', '[deleted]'])].dropna(subset=['body'])
# Remove comments less than 15 characters long
df = df[df['body'].str.len() >= 15]
len(df)

Now, we'll import `spacy` and `gensim` to do some preprocessing. We have functions written here for you to help streamline the process.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from gensim.models.phrases import Phrases, Phraser

In [None]:
parsed = nlp('You are being a very bad dog, mister.')
print(parsed)

for token in parsed:
    print(token.lemma_)

In [None]:
def clean(token):
    """Helper function that specifies whether a token is:
        - punctuation
        - space
        - digit
    """
    return token.is_punct or token.is_space or token.is_digit

def line_read(df, text_col='body'):
    """
    Generator function to read in text from df and get rid of line breaks.
    """    
    for text in df['body']:
        yield text.replace('\n', '')

def preprocess(df, allowed_postags=['NOUN', 'ADJ']):
    """Preprocessing function to apply to a dataframe.
    
    """
    for parsed in nlp.pipe(line_read(df), batch_size=1000, disable=["tok2vec", "ner"]):
        # Gather lowercased, lemmatized tokens, 
        tokens = [token.lemma_.lower() #if token.lemma_ != '-PRON-'
                  #else token.lower_ 
                  for token in parsed if not clean(token)]
        tokens = [lemma
                  for lemma in tokens
                  if not lemma in ["'s",  "’s", "’"] and not lemma in allowed_postags]
        tokens = [token for token in tokens if token not in spacy.lang.en.stop_words.STOP_WORDS]
        yield tokens

We apply the `preprocess()` function to each comment in the dataframe, producing a `docs` output:

In [None]:
docs = [line for line in preprocess(df)]

Now, we create bi-grams. Bi-grams consist of pairs of words that appear commonly together (e.g., "New York"). `gensim` provides some functions to detect bi-grams that appear often enough that we should include them.

In [None]:
# Create bigram model: pass docs into Phrases class
bigrams = Phrases(tokens, min_count=20, threshold=300)
# Create a "frozen" bigram model using the Phraser class
bigram_phraser = Phraser(bigrams)
# Now, create bigrams 
docs_bigrams = [bigram_phraser[doc] for doc in docs]

In [None]:
bigrams[docs]

There's nothing stopping us from going further: we can create tri-grams or even $n$-grams. We'll make some tri-grams and build our word2vec model on top of them. A tri-gram can be constructed by simply looking for bi-grams in a bi-grams corpus.

In [None]:
trigrams = Phrases(bigrams[docs], min_count=20, threshold=100)  
trigram_phraser = Phraser(trigrams)
docs_trigrams = [trigram_phraser[doc] for doc in docs_bigrams]

Let's save the data to an external JSON file:

In [None]:
import json

with open('aita_com_top_lemmas.json', 'w') as write:
    json.dump(docs_trigrams, write)

In [None]:
# Opening the same file works as follows:
with open("aita_com_top_lemmas.json") as f:
    trigrams = json.load(f)

## Constructing a Word2Vec Model

Let's create our word embeddings model. 

While last week's LDA method was focused on finding topics in a collection of documents (or in our case, submissions), word embeddings models focus on individual words, and learning vector representations of these words.

The input to the model is a text corpus split up in sentences – in word embeddings, there is no concept of "documents". The model's output is a set of "vectors" (one for each word) in N dimensions. Think of these vectors as "features", capturing latent meaning.

This model allows us to group the vectors of similar words together in vector space. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in a 2-dimensional space), or to perform linear algebra operations in order to find out to what extent words are related.

Word2Vec is one example of a word embeddings model. It learns by taking words and their contexts (e.g. sentences) into account, and can then try to predict other words. Given enough data, usage and contexts, word2vec can make accurate guesses about a word’s meaning based on its appearances. Those guesses can be used to establish a word’s association with other words (e.g. "Paris" is to "France" as “Berlin” is to “Germany”), or cluster documents and classify them by topic.

We now instantiate and train our Word2Vec model, using the parameters below.

In [None]:
from gensim.models import Word2Vec
import multiprocessing

In [None]:
# Count the number of cores you have at your disposal
cores = multiprocessing.cpu_count()
# Word vector dimensionality (how many features each word will be given)
n_features = 300
# Minimum word count to be taken into account
min_word_count = 10
# Number of threads to run in parallel (equal to your amount of cores)
n_workers = cores
# Context window size
window = 5
# Downsample setting for frequent words
downsampling = 1e-2
# Seed for the random number generator (to create reproducible results)
seed = 1 
# Skip-gram = 1, CBOW = 0
sg = 1
epochs = 20

model = Word2Vec(
    sentences=trigrams,
    workers=num_workers,
    vector_size=n_features,
    min_count=min_word_count,
    window=window,
    sample=downsampling,
    seed=seed,
    sg=sg)

In [None]:
model.train(trigrams, total_examples=model.corpus_count, epochs=10)        

That was it! We have a Word Embeddings model now. Let's save it so that we don't have to train it again. Then, we'll reload the embeddings:

In [None]:
model.save('aita.emb')

In [None]:
model = Word2Vec.load('aita.emb')

How many terms are in our vocabulary? Whenever interacting with the word vector dictionary, we use the `wv` attribute:

In [None]:
len(model.wv)

Let's take a peek at the word vectors our model has learned. We can take a look at the individual words using the `index_to_key` attribute, and the word vectors themselves can be accessed with the `vectors` attribute:

In [None]:
model.wv.index_to_key[0]

In [None]:
model.wv.vectors[0]

Looking at it - it doesn't make a whole lot of sense to us! It's just a bunch of numbers. However, we can do semantic operations on these vectors, such as getting related terms.

### Word Similarity

With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let's create a function that retrieves related terms to some input. We're going to use the `most_similar()` function in `gensim` as part of this helper function.

In [None]:
def get_most_similar_terms(model, token, topn=20):
    """Look up the top N most similar terms to the token."""
    for word, similarity in model.wv.most_similar(positive=[token], topn=topn):
        print(f"{word}: {round(similarity, 3)}")

In [None]:
get_related_terms('asshole')

Here are some other terms. What else interests you?

In [None]:
get_related_terms('empathy')

In [None]:
get_related_terms('relationship')

In [None]:
get_related_terms('power')

### Word Algebra

One of the most famous usages of `word2vec` is via word analogies. For example:

`Paris : France :: Berlin : Germany`

Here, the analogy is between (Paris, France) and (Berlin, Germany), with "capital city" being the concept that connects them. We can abstract the "analogy" relationship to vector modeling. Let's pretend we're working with each of the vectors. Then, the analogy is

$$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} \approx \mathbf{v}_{\text{Germany}} - \mathbf{v}_{\text{Berlin}}.$$

The vector difference here represents the notion of "capital city". Presumably, going from the Paris vector to the France vector (i.e., the vector difference) will be the same as going from the Berlin vector to the Germany vector, if that difference carries similar semantic meaning.

Let's test this directly. We'll do so by rewriting the above expression:

$$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} + \mathbf{v}_{\text{Berlin}} \approx \mathbf{v}_{\text{Germany}}.$$

The core idea is that once words are represented as numerical vectors, you can do "math" with them. The mathematical procedure works as follows:

1. Provide a set of words or phrases you want to add or subtract.
2. Look up the vectors that represent those terms in the word vector model.
3. Add and subtract those vectors to produce a new, combined vector.
4. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
5. Return the word(s) associated with the similar vector(s).

Let's try it out. We'll create a function that does this for us.

In [None]:
def word_algebra(add=[], subtract=[], topn=10):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = model.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [None]:
word_algebra(add=['men', 'dating'])

In [None]:
word_algebra(add=['women', 'dating'])

## K-means Clustering

One convenience of word embeddings is that we can cluster them using, for instance, K-Means clustering. 

K-Means clustering aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean (called the "cluster centre"), which serves as a prototype of the cluster.

Since our words are all represented as vectors, applying K-Means is easy to do since the clustering algorithm will simply look at differences between vectors (and centers).

In [None]:
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn.manifold import TSNE

def clustering_on_wordvecs(word_vectors, num_clusters):
    # Initalize a k-means object and use it to extract centroids
    kmeans_clustering = KMeans(n_clusters = num_clusters, init='k-means++');
    idx = kmeans_clustering.fit_predict(word_vectors);
    return kmeans_clustering.cluster_centers_, idx;

In [None]:
Z = model.wv.vectors

In [None]:
centers, clusters = clustering_on_wordvecs(Z, 20);
centroid_map = dict(zip(model.wv.index_to_key, clusters));

Next, we get words in each cluster that are closest to the cluster center. To do this, we initialize a KDTree on the word vectors, and query it for the Top K words on each cluster center. Using the Index 2 word dictionary, we than correspond each word vector back to it’s original word representation and add them to a dataframe for easier printing.

In [None]:
def get_top_words(index2word, k, centers, wordvecs):
    tree = KDTree(wordvecs);
    # Use closest points for each cluster center to query closest 20 points to it
    closest_points = [tree.query(np.reshape(x, (1, -1)), k=k) for x in centers];
    closest_words_idxs = [x[1] for x in closest_points];
    # Query Word Index  for each position in the above array, and added to a Dictionary
    closest_words = {};
    for i in range(0, len(closest_words_idxs)):
        closest_words['Cluster #' + str(i)] = [index2word[j] for j in closest_words_idxs[i][0]]
    # Create DataFrame from dictionary
    df = pd.DataFrame(closest_words);
    df.index = df.index+1
    return df

Let’s get the top words and print the first 20 in each cluster:

In [None]:
import numpy as np

top_words = get_top_words(model.wv.index_to_key, 5000, centers, Z);

In [None]:
top_words[:30]

# T-SNE

The word embeddings made by the model can be visualised by reducing dimensionality of the words to 2 dimensions using tSNE.

T-Distributed Stochastic Neighbor Embedding, or t-SNE, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation. It tries to keep the relative distances between points as closely as possible in both high-dimensional and low-dimensional space.

Visualisations can be used to notice semantic and syntactic trends in the data.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.manifold import TSNE

tsne_input = model.wv.drop(spacy.lang.en.stop_words.STOP_WORDS, errors=u'ignore')
tsne_input = tsne_input.head(5000)

In [None]:
tsne_input

In [None]:
# Create some filepaths
tsne_filepath = 'tsne_model'
tsne_vectors_filepath = 'tsne_vectors.npy'

In [None]:
import pickle

if 1 == 1:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

In [None]:
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()
bokeh.io.output_notebook()


In [None]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800)

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@index') )

# draw the words as circles on the plot
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
                 color='blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color='black')

# configure visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot)

# Reflection: The Hermeneutics of Word Embeddings

“In vector space, identities and differences change in nature. Similarity and belonging no longer rely on resemblance or a common genesis but on measures of proximity or distance, on flat loci that run as vectors through the space.” (Dourish 2018: 73-4)

As we've seen, word embeddings are essentially a set of vectors. We should reflect on this. What is vectorization? It is reducing linguistic complexity. Or rather, it produces a common space that juxtaposes and mixes complex localized realities. Anything can be turned into a vector operation, but what do we lose when doing so? 