# Data Science for Social Justice Workshop: Module 4

## Word Embeddings

In this notebook, we'll work with word embeddings using [`gensim`](https://radimrehurek.com/gensim/models/word2vec.html).

The goal of word embedding models is to learn **numerical representations** of text corpora. We already did that to a certain extent when we did topic modeling. In this case, we're going to be more explicit about how we construct that numerical representation: for each word, we're going to find a **vector** of numbers to represent it. The actual numbers themselves won't be meaningful to us as humans. However, if successful, the vectors for each term should encode information about the meaning or concept the term represents, as well as the relationship between it and other terms in the vocabulary.

Word vector models are fully **unsupervised**: they learn all of these meanings and relationships without any advance knowledge. Unsupervised learning requires the specification of a right task. We won't go into detail in this lesson, but you can roughly think of the task as predicting nearby words, given a specific word. Read [this post](https://tomvannuenen.medium.com/analyzing-reddit-communities-with-python-part-6-word-embeddings-f92bba876d60) for a deeper introduction to word embeddings.

This notebook is designed to help you:

* Use `gensim`'s [`word2vec`](https://radimrehurek.com/gensim/models/word2vec.html) method to create word vectors for a corpus;
* Use these word vectors to reflect on implicit binaries and normativities in your data;
* Visualize topic models using K-means clustering.

In [1]:
#Uncomment and run the code in this cell if you are running this notebook on DataHub or Binder
#!pip install spacy
#!pip install gensim
#!pip install pandas
#!pip install -U scikit-learn 
#!python -m spacy download en_core_web_sm

## Data Preprocessing

As we will be considering the language biases in the next notebook, we will use the comments of the AITA subreddit this time. The thinking behind this is that this data will be derived from more people, and include more evaluative statements (after all, comments on r/amitheasshole generally evaluate the original posts).

In [2]:
# Package imports
import os
import pandas as pd
import pickle
import sklearn

In [None]:
#check sklearn version
sklearn.__version__

If your sklearn version < 1.0, run the following line:

In [36]:
#!pip install -U scikit-learn 

In [3]:
# Change directory
# We include two ../ because we want to go two levels up in the file structure
os.chdir("../../data")

In [4]:
# Import dataset
df = pd.read_csv('aita_sub_top_sm.csv')
df.head(3)
print(df.shape)

(20000, 18)


Next, we remove comments that were removed or deleted, and additionally only take comments that are sufficiently long:

In [12]:
# Remove comments that are [removed] or [deleted]
df = df[~df['selftext'].isin(['[removed]', '[deleted]'])].dropna(subset=['selftext'])
# Remove comments less than 15 characters long
df = df[df['selftext'].str.len() >= 15]
len(df)


16307

Now, we'll import `spacy` and `gensim` to do some preprocessing. We have functions written here for you to help streamline the process.

In [13]:
import spacy
nlp = spacy.load('en_core_web_sm')
from gensim.models.phrases import Phrases, Phraser



In [14]:
def clean(token):
    """Helper function that specifies whether a token is:
        - punctuation
        - space
        - digit
    """
    return token.is_punct or token.is_space or token.is_digit

def line_read(df, text_col='selftext'):
    """Generator function to read in text from df and get rid of line breaks."""    
    for text in df[text_col]:
        yield text.replace('\n', '')

def preprocess(df, text_col='selftext', allowed_postags=['NOUN', 'ADJ']):
    """Preprocessing function to apply to a dataframe."""
    for parsed in nlp.pipe(line_read(df, text_col), batch_size=1000, disable=["tok2vec", "ner"]):
        # Gather lowercased, lemmatized tokens
        tokens = [token.lemma_.lower() if token.lemma_ != '-PRON-'
                  else token.lower_ 
                  for token in parsed if not clean(token)]
        # Remove specific lemmatizations, and words that are not nouns or adjectives
        tokens = [lemma
                  for lemma in tokens
                  if not lemma in ["'s",  "’s", "’"] and not lemma in allowed_postags]
        # Remove stop words
        tokens = [token for token in tokens if token not in spacy.lang.en.stop_words.STOP_WORDS]
        yield tokens

We apply the `preprocess()` function to each comment in the dataframe, producing a `docs` output. This call might take a while.

In [15]:
docs = [line for line in preprocess(df, text_col='selftext')]

Now, we create bigrams. Bigrams consist of pairs of words that appear commonly together (e.g., "New York"). `gensim` provides some functions to detect bigrams by finding words appear often enough together that we should include them.

In [18]:
# Create bigram model: pass docs into Phrases class
bigrams = Phrases(docs, min_count=20, threshold=300)
# Create a "frozen" bigram model using the Phraser class
bigram_phraser = Phraser(bigrams)
# Now, create bigrams 
docs_bigrams = [bigram_phraser[doc] for doc in docs]

There's nothing stopping us from going further: we can create trigrams or even $n$-grams. We'll make some trigrams and build our word2vec model on top of them. A trigram can be constructed by simply looking for bigrams in a bigrams corpus.

In [22]:
trigrams = Phrases(bigrams[docs], min_count=20, threshold=100)  
trigram_phraser = Phraser(trigrams)
docs_trigrams = [trigram_phraser[doc] for doc in docs_bigrams]

Let's save the data to an external JSON file:

In [6]:
import json

In [24]:
with open('aita_com_top_lemmas.json', 'w') as write:
    json.dump(docs_trigrams, write)

In [7]:
# Opening the same file works as follows:
with open("aita_com_top_lemmas.json") as f:
    trigrams = json.load(f)

## Constructing a Word2Vec Model

Let's create our word embeddings model. 

While last week's LDA method was focused on finding topics in a collection of documents (or in our case, submissions), word embeddings models focus on individual words, and learning vector representations of these words.

The input to the model is a text corpus split up into sentences – in word embeddings, there is no concept of "documents". The model's output is a set of "vectors" (one for each word) in N dimensions (a parameter of the model). Think of these vectors as "features", capturing latent meaning.

This model allows us to group the vectors of similar words together in vector space. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in a 2-dimensional space), or to perform linear algebra operations in order to find out to what extent words are related.

Word2Vec is one example of a word embeddings model. It learns by taking words and their contexts (e.g. sentences) into account, and can then try to predict other words. Given enough data, usage and contexts, word2vec can make accurate guesses about a word’s meaning based on its appearances. Those guesses can be used to establish a word’s association with other words (e.g. "Paris" is to "France" as “Berlin” is to “Germany”), or cluster documents and classify them by topic.

We now instantiate and train our Word2Vec model, using the parameters below.

In [8]:
from gensim.models import Word2Vec
import multiprocessing



In [9]:
# Count the number of cores you have at your disposal
cores = multiprocessing.cpu_count()
# Word vector dimensionality (how many features each word will be given)
n_features = 300
# Minimum word count to be taken into account
min_word_count = 10
# Number of threads to run in parallel (equal to your amount of cores)
n_workers = cores
# Context window size
window = 5
# Downsample setting for frequent words
downsampling = 1e-2
# Seed for the random number generator (to create reproducible results)
seed = 1 
# Skip-gram = 1, CBOW = 0
sg = 1
epochs = 20

model = Word2Vec(
    sentences=trigrams,
    workers=n_workers,
    vector_size=n_features,
    min_count=min_word_count,
    window=window,
    sample=downsampling,
    seed=seed,
    sg=sg)

KeyboardInterrupt: 

In [None]:
model.train(trigrams, total_examples=model.corpus_count, epochs=10)        

That was it! We have a Word Embeddings model now. Let's save it so that we don't have to train it again. Then, we can reload the embeddings so that we don't have to train it every single time:

In [29]:
model.save('aita.emb')

In [10]:
model = Word2Vec.load('aita.emb')

How many terms are in our vocabulary? Whenever interacting with the word vector dictionary, we use the `wv` attribute:

In [11]:
len(model.wv)

12295

Let's take a peek at the word vectors our model has learned. We can take a look at the individual words using the `index_to_key` attribute, and the word vectors themselves can be accessed with the `vectors` attribute:

In [12]:
model.wv.index_to_key[0]

'said'

In [13]:
model.wv.vectors[0]

array([-0.06102639,  0.19767517, -0.09929685, -0.01651838, -0.11792817,
       -0.12836273,  0.10424978,  0.35728016,  0.02528324, -0.13426565,
        0.12556353,  0.08988883,  0.12721542, -0.02497321, -0.13635816,
       -0.2221032 ,  0.01603243, -0.01290215, -0.10347308, -0.06935027,
        0.04374868, -0.0986355 ,  0.2338665 ,  0.24940981,  0.07474294,
        0.02770644, -0.08778619, -0.07598864, -0.06600711, -0.12008175,
       -0.10145769, -0.03904285,  0.20689853,  0.04596431,  0.11964171,
       -0.12708524,  0.05245314, -0.09969737,  0.00313492,  0.09943998,
       -0.00700299,  0.33826628, -0.00694402, -0.00755043,  0.0834844 ,
        0.10920402,  0.08264863,  0.11614382,  0.0428956 ,  0.19038281,
       -0.07969172,  0.03352622, -0.1976534 , -0.05535923, -0.0638773 ,
       -0.06511011,  0.11220443,  0.04737068,  0.1741564 ,  0.05661403,
       -0.06075669,  0.01387172, -0.09167124, -0.0693732 , -0.0759397 ,
        0.09873556, -0.17701876,  0.04444606, -0.19546069,  0.00

Looking at it won't make a whole lot of sense to us! It's just a bunch of numbers. However, we can do semantic operations on these vectors, such as getting related terms.

### Word Similarity

With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let's create a function that retrieves related terms to some input. We're going to use the [`most_similar()`](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html) function in `gensim` as part of this helper function.

In [14]:
def get_most_similar_terms(model, token, topn=20):
    """Look up the top N most similar terms to the token."""
    for word, similarity in model.wv.most_similar(positive=[token], topn=topn):
        print(f"{word}: {round(similarity, 3)}")

In [15]:
get_most_similar_terms(model, 'asshole')

ahole: 0.57
ah: 0.545
aita: 0.541
ta: 0.534
amita: 0.51
aitah: 0.502
unwarranted: 0.495
jackass: 0.488
b\*tch: 0.475
aita?&#x200b;update: 0.472
ita: 0.47
arsehole: 0.463
i?edit: 0.46
second_guessing: 0.456
jerk: 0.454
aita??edit: 0.452
verdict: 0.451
misogynist: 0.451
aitaedit: 0.449
uncaring: 0.441


Here are some other terms. What else interests you?

In [16]:
get_most_similar_terms(model, 'empathy')

kindness: 0.477
compassion: 0.455
aita??edit: 0.447
overdramatic: 0.436
lacks: 0.435
validated: 0.429
❤_️: 0.428
empathize: 0.425
thankyou: 0.418
disregarding: 0.418
deprive: 0.417
her-: 0.414
insights: 0.413
empathetic: 0.412
abilities: 0.412
digress: 0.406
confide: 0.406
disregard: 0.398
sympathy: 0.389
disgraceful: 0.386


In [17]:
get_most_similar_terms(model, 'relationship')

coparent: 0.51
marriage: 0.503
co_parenting: 0.486
relationships: 0.474
animosity: 0.464
rocky: 0.459
unfaithful: 0.454
stepparent: 0.449
51f: 0.445
fwb: 0.442
dynamic: 0.437
bond: 0.43
intimacy: 0.43
friendship: 0.427
lc: 0.424
polyamorous: 0.419
coparenting: 0.417
romantically: 0.416
poly: 0.416
throwaway_accounti: 0.413


In [18]:
get_most_similar_terms(model, 'power')

veto: 0.339
authority: 0.313
merge: 0.3
tripping: 0.299
sustainable: 0.283
articulate: 0.282
skewed: 0.282
circumcised: 0.281
doormat: 0.28
unilaterally: 0.28
sinners: 0.278
cognizant: 0.277
steering: 0.275
lined: 0.274
exists: 0.274
electrical: 0.273
epic: 0.272
unreliable: 0.27
uproot: 0.27
advise: 0.27


In [19]:
get_most_similar_terms(model, 'man')

woman: 0.476
bloke: 0.426
mans: 0.376
father: 0.369
lacy: 0.361
intimidating: 0.358
nervously: 0.354
men: 0.351
ftm: 0.351
spun: 0.349
\[my: 0.348
intimidated: 0.348
guy: 0.348
flattered: 0.346
handsome: 0.346
soldier: 0.343
heterosexual: 0.343
angie: 0.343
cis: 0.339
marries: 0.339


In [20]:
get_most_similar_terms(model, 'woman')

man: 0.476
bloke: 0.407
guy: 0.391
girl: 0.384
lady: 0.376
seventeen: 0.369
ftm: 0.368
phobic: 0.361
handsome: 0.357
conventionally: 0.356
lia: 0.353
4ish: 0.352
opinionated: 0.346
men: 0.34
women: 0.339
tempered: 0.339
claude: 0.338
identifies: 0.337
iran: 0.336
isle: 0.332


### Word Analogies

One of the most famous usages of `word2vec` is via word analogies. For example:

`Paris : France :: Berlin : Germany`

Here, the analogy is between (Paris, France) and (Berlin, Germany), with "capital city" being the concept that connects them. We can abstract the "analogy" relationship to vector modeling. Let's pretend we're working with each of the vectors. Then, the analogy is

$$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} \approx \mathbf{v}_{\text{Germany}} - \mathbf{v}_{\text{Berlin}}.$$

The vector difference here represents the notion of "capital city". Presumably, going from the Paris vector to the France vector (i.e., the vector difference) will be the same as going from the Berlin vector to the Germany vector, if that difference carries similar semantic meaning.

Let's test this directly. We'll do so by rewriting the above expression:

$$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} + \mathbf{v}_{\text{Berlin}} \approx \mathbf{v}_{\text{Germany}}.$$

The core idea is that once words are represented as numerical vectors, you can do "math" with them. In `gensim`, this works with the `most_similar` function, which takes `positive` and `negative` arguments. In the above scenario, the positive terms would be Berlin and France, while the negative term is Paris. You can roughly think of this as: "What is the vector most similar to Berlin and France, but opposite Paris?"

We can't do this example in our corpus, because we don't have all these words represented. Another example we can do is perhaps t he most well known example:

`Man : King :: Woman : ?`

What does the function tell us is on the other side of the analogy? Remember, analogies are constructions by humans: they quite literally encode semantic relationships, and thus enforce norms. The fact that the word embedding learns this analogy implies that it has inherited norms practiced by humans, whether those norms are biased or not.

In [21]:
model.wv.most_similar(positive=['woman', 'king'], negative='man')


KeyError: "Key 'a' not present"

## Clustering Word Vectors

One convenience of word embeddings is that we can "cluster" them. We can find a group of word vectors that are close to each other, and call it a related "cluster". Since we expected word vectors that are semantically similar to be close to each other in space, we might expect the clusters to be semantically meaningful. The clustering algorithm we use is called **K-means clustering**. 

K-Means clustering aims to group $N$ observations into $K$ clusters (we choose $K$) in which each observation belongs to the cluster with the nearest mean (called the "cluster center"), which serves as a prototype of the cluster.

Since our words are all represented as vectors, applying K-means is easy to do since the clustering algorithm will simply look at differences between vectors (and centers).

A package called [`scikit-learn`](https://scikit-learn.org/stable/) provides us a couple algorithms that will be useful for this section: [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and [`KDTree`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html?highlight=kdtree#sklearn.neighbors.KDTree)

In [22]:
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree

Here is a helper function to perform the clustering:

In [23]:
def clustering_on_wordvecs(word_vecs, n_clusters):
    """Clusters a set of word vectors and returns the center of each cluster."""
    # Initalize a k-means object and use it to extract centroids
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++')
    cluster_ids = kmeans.fit_predict(word_vecs)
    return kmeans.cluster_centers_, cluster_ids

We can access the raw word vectors with the `vectors` attribute:

In [24]:
n_clusters = 20
centers, cluster_ids = clustering_on_wordvecs(model.wv.vectors, n_clusters)
centroid_map = dict(zip(model.wv.index_to_key, centers))

Next, we get words in each cluster that are closest to the cluster center. To do this, we initialize a data structure called a KDTree on the word vectors, and query it for the top $K$ words on each cluster center. We will use a helper function to print this into a convenient dataframe:

In [25]:
def get_top_words(model, n_closest, centers):
    """Get the words closest to each cluster center."""
    # Create KD Tree
    tree = KDTree(model.wv.vectors)
    # Use closest points for each cluster center to query the closest points to it
    closest_points = tree.query(centers, k=n_closest)[1]
    # Query word index for each position
    closest_words = {}
    for cluster_idx, cluster in enumerate(closest_points):
        closest_words[f'Cluster {cluster_idx + 1}'] = [model.wv.index_to_key[idx] for idx in cluster]
    # Create DataFrame from dictionary
    df = pd.DataFrame(closest_words)
    return df

Let’s get the top 50 words for each cluster:

In [26]:
get_top_words(model, 50, centers)

Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6,Cluster 7,Cluster 8,Cluster 9,Cluster 10,Cluster 11,Cluster 12,Cluster 13,Cluster 14,Cluster 15,Cluster 16,Cluster 17,Cluster 18,Cluster 19,Cluster 20
0,😊,laurie,barked,scrubbing,steamed,amita,thirdly,economics,arab,i(28,canvas,like,time,sans,said,65k,alexia,festivities,looser,therapies
1,trolling,told,choking,scrubbed,frying,crybaby,speculation,conferences,catholics,i(28f,tide,wierd,10pm,speculation,like,500k,instrument,i(28f,neon,anemia
2,sooooo,m16,whip,toothpaste,macaroni,told,told,elite,i(28f,i(27,rubber,told,tuesdays,homeowners,told,aud,bells,14th,cardigan,uti
3,clarifications,i(28f,wolf,sponge,refried,caving,contemplating,emt,ultra,m16,luxurious,observation,winding,inspect,jab,6000,doubles,laurie,hairstyles,mri
4,p.s,seventeen,puddle,sinks,caramel,like,warrants,inconsistent,crosses,f35,vienna,think,decompress,radius,frown,25000,hype,newlyweds,pins,observation
5,inputs,said,shrieked,fountain,oatmeal,said,caution,thirdly,nigeria,f34,like,know,told,wandering,embarassing,bitcoin,celebrities,going,horns,deteriorating
6,here?update,wierd,tripping,rearranging,seasoned,fixated,cognizant,skillset,stumped,i(27f,ikea,laughable,like,wolf,know,revenue,dang,told,vibrant,prognosis
7,aita?&#x200b;update,granddaughters,chucked,sippy,burritos,dictator,bases,dang,norway,46f,persona,said,infants,umbrella,bummer,biweekly,stunts,autumn,unicorn,stunts
8,me?edit,delight,peculiar,perfumes,sausages,wierd,distressing,delaying,uneducated,51f,plushies,amita,said,ontario,prob,inconsistent,like,time,sleeveless,contracting
9,😅,carl,smacked,pees,hamburger,"""aita",pending,interns,philippines,f15,laurie,scummy,hyped,dock,prodding,airfare,gamecube,50th,wavy,moods


## Visualizing High Dimensional Spaces with $t$-SNE

The word embeddings we created are what's called a **high-dimensional representation** of the text. That is, we take a word in the corpus, and represent it using, in this case, 300 numbers. We can plot 3 numbers at a time - that's in 3 dimensions - but there's no way for humans to visualize something in a 300-dimensional space. 

So, **dimensionality reduction** is a big part of machine learning. How can we take vectors that are 300-dimensional, and visualize them in 2-dimensions, while keeping the structure between vectors the same? How can we reduce the dimensionality?

One of the most popular methods for dimensionality reduction is called $t$-SNE ($t$-Distributed Stochastic Neighbor Embedding). The details are not important, but using it in practice is a useful skill to learn (if you want to read more, [here](https://lvdmaaten.github.io/tsne/) is a good starting point). Roughly, it tries to keep the relative distances between points as closely as possible in both high-dimensional and low-dimensional space.

So, we'll use $t$-SNE to take all the word vectors, and obtain a **low dimensional representation**. We can then visualize it, which may reveal semantic and syntactic trends in the data.

A [$t$-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html?highlight=tsne#sklearn.manifold.TSNE) implementation is available via `scikit-learn`:

In [None]:
#if you get error in the line tsne=TSNE(), you can try to run the following line:
#!pip install -U scikit-learn 

In [27]:
from sklearn.manifold import TSNE

In [28]:
# Create some filepaths to save our model
tsne_path = 'tsne_model'
tsne_vectors_path = 'tsne_vectors.pkl'

In [29]:
tsne = TSNE(init='pca', learning_rate='auto')
tsne_vectors = tsne.fit_transform(model.wv.vectors)




We have our low dimensional representation. Now, let's store the 2 dimensions in a dataframe, with the word as the index:

In [30]:
# Store the t-SNE vectors
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(model.wv.index_to_key),
                            columns=['x', 'y'])

In [31]:
with open(tsne_path, 'wb') as f:
    pickle.dump(tsne, f)

tsne_vectors.to_pickle(tsne_vectors_path)

Here's a convenient code block to load this data, to start from this point:

In [32]:
with open(tsne_path, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.read_pickle(tsne_vectors_path)

We're going to visualize the 2-dimensional space using a package called `bokeh`. This package is nice for this because it allows for some degree of interactivity: we can go over each point and dynamically get information about the word denoting that vector.

In [33]:
!pip install bokeh



In [34]:
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()
bokeh.io.output_notebook()

In [35]:
# Add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# Create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
                   plot_width=800,
                   plot_height=800)

# Add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips='@index') )

# Draw the words as circles on the plot
tsne_plot.circle('x', 'y',
                 source=plot_data,
                 color='blue',
                 line_alpha=0.2,
                 fill_alpha=0.1,
                 size=10,
                 hover_line_color='black')

# Configure visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# Engage!
show(tsne_plot)

# Reflection: The Hermeneutics of Word Embeddings

“In vector space, identities and differences change in nature. Similarity and belonging no longer rely on resemblance or a common genesis but on measures of proximity or distance, on flat loci that run as vectors through the space.” (Dourish 2018: 73-4)

As we've seen, word embeddings are essentially a set of vectors. We should reflect on this. What is vectorization? It is reducing linguistic complexity. Or rather, it produces a common space that juxtaposes and mixes complex localized realities. Anything can be turned into a vector operation, but what do we lose when doing so? 