# Introduction to Word Embeddings

## Douglas Rice

*This tutorial was originally created by Burt Monroe for his prior work with the Essex Summer School. I've updated and modified it.*

In this notebook, we'll estimate our first word embedding model, then go through a series of analyses of the estimated embeddings. After completing this notebook, you should be familar with:


1. Preparing a corpus for estimating word embeddings
2. Estimating a (static) word embedding model
3. Analyzing output of (static) word embedding model



# Data 

This notebook illustrates the estimation of embeddings on a corpus of Supreme Court oral arguments. The data are available via the excellent Cornell Conversational Analysis Toolkit (ConvoKit). You can read more about the data (and ConvoKit) [here](https://convokit.cornell.edu/documentation/supreme.html#).  The Oral Arguments corpus is described as:

>A collection of cases from the U.S. Supreme Court, along with transcripts of oral arguments. Contains approximately 1,700,000 utterances over 8,000 oral arguments transcripts from 7,700 cases.



In [1]:
!pip3 install convokit

Collecting convokit
  Downloading convokit-2.5.3.tar.gz (167 kB)
     -------------------------------------- 168.0/168.0 kB 1.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting msgpack-numpy>=0.4.3.2
  Downloading msgpack_numpy-0.4.8-py2.py3-none-any.whl (6.9 kB)
Collecting dill>=0.2.9
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
     ---------------------------------------- 95.8/95.8 kB 1.4 MB/s eta 0:00:00
Collecting clean-text>=0.1.1
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
     -------------------------------------- 235.9/235.9 kB 3.6 MB/s eta 0:00:00
Collecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
     ---------------------------------------- 53.1/53.1 kB 2.9 MB/s eta 0:00:00
Collecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
     -------------

## Prepping the Corpus

The first thing we need to do is to download the corpus. This will take a couple minutes, as this is a large corpus. Lawyers and judges like to talk a lot. The benefit of this additional text, though, is that we have significantly more information for validly estimating the word embeddings.


In [12]:
from convokit import Corpus, download
import pandas as pd

In [3]:
corpus = Corpus(filename=download("supreme-corpus"))

Downloading supreme-corpus to C:\Users\hermida\.convokit\downloads\supreme-corpus
Downloading supreme-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/supreme-corpus.zip (1255.8MB)... Done


We can see a bit of information on our corpus as follows. 

In [4]:
corpus.print_summary_stats()

Number of Speakers: 8979
Number of Utterances: 1700789
Number of Conversations: 7817


Let's look at the first utterance. This is the Chief Justice of the U.S. Supreme Court introducing the case and the first lawyer to speak before the Court.

In [5]:
for utt in corpus.iter_utterances():
    print(utt.text)
    break

Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.


Next, we need to begin to prepare the corpus for estimating word embeddings. To do so, we must first do some standard NLP tasks, segmenting the corpus by sentence and tokenizing the texts. We'll just use the nltk tokenizers to segment into sentences and tokens.

In [6]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hermida\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let's look at how the tokenizer works for the first utterance.

In [7]:
for utt in corpus.iter_utterances():
    print( [word_tokenize(t) for t in sent_tokenize(utt.text)])
    break

[['Number', '71', ',', 'Lonnie', 'Affronti', 'versus', 'United', 'States', 'of', 'America', '.'], ['Mr.', 'Murphy', '.']]


Generate the sentence tokens, and the word tokens within them. This took ~ 11 minutes, given 1.7 million utterances.

In [13]:
sents = []
for utt in corpus.iter_utterances():
    sents.append([word_tokenize(t) for t in sent_tokenize(utt.text)])

In [14]:
len(sents)

1700789

In [18]:
sents[0]

[['Number',
  '71',
  ',',
  'Lonnie',
  'Affronti',
  'versus',
  'United',
  'States',
  'of',
  'America',
  '.'],
 ['Mr.', 'Murphy', '.']]

That's the second document/utterance, a list of lists (each sentence is a list of tokens). That means sents is organized as a list of lists of lists. Word2Vec wants a list of lists (the tokens by sentence, without distinguishing between the utterances in which they are used). So, we flatten the list (to a list of sentences, each a list of tokens).

In [16]:
flat_sents_list = [sentence for utt in sents for sentence in utt] # for every utterance, loop over its sentences and add them to the list

In [17]:
flat_sents_list[0]

['Number',
 '71',
 ',',
 'Lonnie',
 'Affronti',
 'versus',
 'United',
 'States',
 'of',
 'America',
 '.']

In [None]:
len(flat_sents_list)

3880254

As you can see, we are closing in on 4 million sentences overall.

## Estimate word2vec Embeddings

First, we'll estimate word2vec embeddings. We'll use gensim, a well-known (and fast!) library for estimating embeddings and topic models.

In [None]:
import gensim
from gensim.models import Word2Vec

To estimate the word2vec model, we need to specify a number of parameters. The first argument we pass is the sentence object that we've just created (`flat_sents_list`). 

From there, we need to specify the dimensionality (or size) of the estimated vectors. I used the default dimensionality of 100. I set the context window at 5; we will play around with estimates based on other windows below. Note that we also have retained all tokens to this point. In estimating the model, we can specify the terms to retain though (which can enhance our computational speed); the min_count of token frequency defaults to 1, but I set it at 5, which will still probably be a bit noisy. According to the gensim docs, a random seed is always set to 1, but to ensure replicability, you need to use only one worker/thread, which I think is all Google Colab will give anyway.

You'll have to wait a bit again. This took approximately 11 minutes on a recent run. 

In [None]:
model_w5 = Word2Vec(sentences=flat_sents_list, size=100, window=5, min_count=5, workers=1)
model_w5.save("w5_word2vec.model")

Now let's see what words are near each other; given the setting, let's start with a term likely to pop up.

In [None]:
model_w5.wv.most_similar("police")

[('policeman', 0.686346709728241),
 ('arresting', 0.6797668933868408),
 ('detective', 0.6674090623855591),
 ('commanding', 0.6645101308822632),
 ('Kallnischkies', 0.6578880548477173),
 ('sheriff', 0.6559174656867981),
 ('FBI', 0.635968804359436),
 ('plainclothes', 0.6321883201599121),
 ('officers', 0.6278793811798096),
 ('corrections', 0.6276471614837646)]

All of the most similar words are one that we would expect to see in this context. One thing you might notice here is that we haven't done anything with capitalization; see how "FBI" and "Border" are capitalized? That means we are capturing subtle (and not-so-subtle) differences in what words mean. So what happens if we capitalize "Police"?

In [None]:
model_w5.wv.most_similar("Police")

[('Fire', 0.8665454387664795),
 ('Ships', 0.8045356273651123),
 ('Detectives', 0.783249020576477),
 ('Correction', 0.7493676543235779),
 ('Commander', 0.7446268796920776),
 ('Corrections', 0.7420327067375183),
 ('Bookkeeping', 0.7297062873840332),
 ('Herder', 0.7293038368225098),
 ('Engineer', 0.7246467471122742),
 ('Staff', 0.7201530337333679)]

Much different! This is catching some related professions ("Detectives", "Corrections"), but also just seems to be capturing professions generally. 

As we discussed in class, the classic analogy is (Man is to woman, as king is to ____.) Of course, at oral argument for the Supreme Court, we are exceedingly unlikely to see the terms "king" and "queen" used very often, which might limit the utility of the analogy. Here's a quick check. The basic idea is to calculate the vector from `woman` + `king` - `man`, then look for the most similar vectors. 

In [None]:
model_w5.wv.most_similar(positive=["woman","king"],negative=["man"])

[('German', 0.5547077655792236),
 ('Yugoslavia', 0.5521583557128906),
 ('mother', 0.5404218435287476),
 ('Mexican', 0.5387974977493286),
 ('native', 0.5348100662231445),
 ('husband', 0.5322563648223877),
 ('marrying', 0.5303377509117126),
 ('descent', 0.5277263522148132),
 ('YMCA', 0.5270720720291138),
 ('Netherlands', 0.5259039402008057)]

Yeah, that's not making much sense. Let's look at the most similar terms for "king" to see how it worked. 

In [None]:
model_w5.wv.most_similar("king")

[('crowd', 0.5560708045959473),
 ('parliament', 0.5532700419425964),
 ('crown', 0.5440690517425537),
 ('YMCA', 0.5377370715141296),
 ('Netherlands', 0.5282279253005981),
 ('helicopter', 0.5278602838516235),
 ('bartender', 0.5258022546768188),
 ('village', 0.517535388469696),
 ('disappearing', 0.515839695930481),
 ('Canadians', 0.5113198757171631)]

# A Tangent on Bias

As we discussed in class, word embeddings have proven to be a useful tool for uncovering/revealing bias in large corpora. Here, we can see how well the U.S. Supreme Court fares. We'll look at occupations. 

In [None]:
model_w5.wv.most_similar(positive=["woman","occupation"],negative=['man'])

[('extraction', 0.6386083364486694),
 ('offspring', 0.6179507374763489),
 ('employment', 0.6173360347747803),
 ('impairment', 0.6090482473373413),
 ('infant', 0.5856988430023193),
 ('engagement', 0.5855349898338318),
 ('marriage', 0.5724813342094421),
 ('nationality', 0.5723633766174316),
 ('unborn', 0.5705928802490234),
 ('placement', 0.5701239109039307)]

In [None]:
model_w5.wv.most_similar(positive=["man","occupation"],negative=["woman"])

[('engagement', 0.6586723327636719),
 ('installation', 0.6567591428756714),
 ('occupancy', 0.6194782853126526),
 ('enterprise', 0.6189873218536377),
 ('outlet', 0.6179249286651611),
 ('activity', 0.6159071922302246),
 ('operator', 0.6034233570098877),
 ('aircraft', 0.5891773104667664),
 ('agent', 0.5782469511032104),
 ('operation', 0.574250340461731)]

You can see from the above that women's employment is routinely discussed with respect to marriage and reproduction, a dynamic totally absent in the `man` example. 

# Working with Estimated Embeddings

Once we've estimated the embeddings, there are a number of other options for analysis beyond the simple vector operations above. In this section, we'll look at how the estimated vectors cluster together into coherent themes. To do so, we load `numpy` (for extracting the vectors as arrays) and the KMeans library (for estimating the clustering algorithm). 

In [None]:
import numpy as np
from sklearn.cluster import KMeans

In [None]:
wv_w5 = model_w5.wv

In [None]:
# extract the words & their vectors, as numpy arrays
vectors_w5 = np.asarray(model_w5.wv.vectors)
labels_w5 = np.asarray(model_w5.wv.index2word)  # fixed-width numpy strings


We can check the dimension of the embedding vectors that we've extracted. Note that they are equal to the number of words by the number of dimensions; so we have a weighted distribution over 100 dimensions for 61,103 tokens.

In [None]:
vectors_w5.shape

(61103, 100)

With that in hand, we can estimate a simple clustering algorithm. We specify 20 clusters, but feel free to play around with that number.

In [None]:
kmeans_w5_20 = KMeans(n_clusters=20)
kmeans_w5_20.fit(vectors_w5)

KMeans(n_clusters=20)

In [None]:
kmeans_w5_20.labels_.shape

(61103,)

In [None]:
kmeans_w5_20.cluster_centers_.shape

(20, 100)

Note what we have estimated with KMeans. We have 20 cluster centers, each of 100 dimensions, the same number of dimensions that we have for each of our tokens. Therefore, we look for which of the tokens are most similar to one of our cluster centers. 

In [None]:
model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[1]])


[('attacked', 0.817511260509491),
 ('sanctioned', 0.7676904797554016),
 ('harmed', 0.76265549659729),
 ('condemned', 0.7622823119163513),
 ('misled', 0.7566522359848022),
 ('replaced', 0.7562367916107178),
 ('tested', 0.7554956674575806),
 ('criticized', 0.753906786441803),
 ('victimized', 0.7522069215774536),
 ('pursued', 0.7501586079597473)]

Now let's loop over each of the cluster centers for the most similar.

In [None]:
for k in range(20):
  print(model_w5.wv.most_similar([kmeans_w5_20.cluster_centers_[k]]))

[('Orville', 0.8814734220504761), ('Kresge', 0.8752458095550537), ('Frana', 0.8727689385414124), ('Followed', 0.8725795745849609), ('Ivanovich', 0.8725254535675049), ('Wash', 0.8718684911727905), ('Deegan', 0.8708628416061401), ('G.R.H', 0.8643906116485596), ('Y.', 0.8628705143928528), ('Shops', 0.8625250458717346)]
[('attacked', 0.817511260509491), ('sanctioned', 0.7676904797554016), ('harmed', 0.76265549659729), ('condemned', 0.7622823119163513), ('misled', 0.7566522359848022), ('replaced', 0.7562367916107178), ('tested', 0.7554956674575806), ('criticized', 0.753906786441803), ('victimized', 0.7522069215774536), ('pursued', 0.7501586079597473)]
[('130', 0.9330103397369385), ('45', 0.9282690286636353), ('25', 0.9267070293426514), ('75', 0.9163283109664917), ('240', 0.911709189414978), ('140', 0.9107958078384399), ('17', 0.9082223176956177), ('70', 0.9067625999450684), ('55', 0.902971625328064), ('15', 0.9015152454376221)]
[('1968', 0.9276836514472961), ('1956', 0.9260790348052979), ('

# Window Size

What's the effect of choices over window size? Let's play around and find out.

In [None]:
model_w1 = Word2Vec(sentences=flat_sents_list, size=100, window=1, min_count=5, workers=1)
model_w1.save("w1_word2vec.model")

In [None]:
vectors_w1 = np.asarray(model_w1.wv.vectors)
labels_w1 = np.asarray(model_w1.wv.index2word)

kmeans_w1_20 = KMeans(n_clusters=20)
kmeans_w1_20.fit(vectors_w1)

KMeans(n_clusters=20)

In [None]:
for k in range(20):
  print(model_w1.wv.most_similar([kmeans_w1_20.cluster_centers_[k]]))

[('185', 0.9471156001091003), ('272', 0.9401297569274902), ('255', 0.9379180073738098), ('227', 0.9356634020805359), ('186', 0.9354671239852905), ('352', 0.9337529540061951), ('167', 0.9334511160850525), ('398', 0.9321123957633972), ('199', 0.9266774654388428), ('683', 0.9257817268371582)]
[('supervisor', 0.7971924543380737), ('laborer', 0.7777252197265625), ('doctor', 0.7767788171768188), ('parent', 0.7759108543395996), ('customer', 0.7593592405319214), ('nonmember', 0.7578420639038086), ('student', 0.7573502063751221), ('nonprofessional', 0.7480571269989014), ('dissident', 0.7470308542251587), ('banker', 0.7456125020980835)]
[('deceived', 0.828521728515625), ('captured', 0.826597273349762), ('attacked', 0.8232210874557495), ('terminated', 0.8220384120941162), ('disclosed', 0.821386992931366), ('represented', 0.8101081252098083), ('revealed', 0.8093607425689697), ('admitted', 0.8074983358383179), ('evaluated', 0.8072065114974976), ('replaced', 0.8067431449890137)]
[('2004', 0.89474087

Window of 30.

In [None]:
model_w30 = Word2Vec(sentences=flat_sents_list, size=100, window=30, min_count=2, workers=4)
model_w30.save("w30_word2vec.model")

In [None]:
vectors_w30 = np.asarray(model_w30.wv.vectors)
labels_w30 = np.asarray(model_w30.wv.index2word)

kmeans_w30_20 = KMeans(n_clusters=20)
kmeans_w30_20.fit(vectors_w30)

KMeans(n_clusters=20)

In [None]:
for k in range(20):
  print(model_w30.wv.most_similar([kmeans_w30_20.cluster_centers_[k]]))

[('Cannel', 0.8970688581466675), ('Bud', 0.8745160102844238), ('Et', 0.8718833923339844), ('Larrimer', 0.8678923845291138), ('Jochnowitz', 0.8659942150115967), ('Schmechel', 0.8649727702140808), ('Eggers', 0.8629851341247559), ('Rauch', 0.8628865480422974), ('Sperling', 0.8627452850341797), ('Lockton', 0.8580783605575562)]
[('kicked', 0.8909618258476257), ('threw', 0.8817209005355835), ('thrown', 0.8754757642745972), ('backed', 0.857831597328186), ('knocked', 0.8560508489608765), ('walked', 0.8446695804595947), ('throw', 0.842941164970398), ('turned', 0.8425027132034302), ('stepped', 0.8380867838859558), ('pull', 0.8378132581710815)]
[('Blackmun', 0.9822768568992615), ('Kennedy', 0.9806851744651794), ('Brennan', 0.9784543514251709), ('Gorsuch', 0.978076696395874), ('Sotomayor', 0.9751110076904297), ('Kagan', 0.9750866293907166), ('Alito', 0.9750633835792542), ('Harlan', 0.9749494791030884), ('Kavanaugh', 0.9743109941482544), ('Rehnquist', 0.9738765954971313)]
[('mechanisms', 0.75570046

The larger window size seems to be forming some significantly better clusters of related terms. For instance, you can see one that relates specifically to the justices that neither of the others picked up on.