# AKA: A Quick Introduction to Word2Vec with Python and Gensim
@Aniello De Santo, Feb 06, 2020

## Setting up and getting some data

In [1]:
# We need a resource for our data
import nltk

In [None]:
# If you are using this in CoLab, also run this cell, otherwise you can skip it
import warnings
warnings.filterwarnings('ignore')

In [2]:
# We are going to use the brown corpus
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to /Users/Ani/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Let's start by printing a few sentences out of the "brown" corpus, to get an idea of what the data looks like.

In [3]:
brown_sent = brown.sents()
print(brown_sent[:3])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']]


## Building the Model

We don't want to build the whole model from scratch, we will use the Gensim library instead.

In [4]:
from gensim.models import Word2Vec

We can now build an instance of the model!

In [5]:
# This is the whole model for the brown corpus (it might take a few minutes)!
brown_model = Word2Vec(brown_sent)

Let's look at an example!

In [6]:
test1 = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test1[:3])

Most similar to 'blue':
 [('red', 0.9619253873825073), ('gray', 0.9619019031524658), ('green', 0.9586248397827148)]


## Refining the model

Word2Vec takes a broad range of parameters. In our example above, we only chose where to get our sentences from, and we used the *default* settings for the rest. But let's now look at a few that are most relevant (you can find a full list here: https://radimrehurek.com/gensim/models/word2vec.html):

- **size**: The dimensionality of our embeddings (i.e. the length of each word vector).
- **window**: Which words are considered contexts of the target. The size of window affects the type of similarity captured in the embeddings.
- **negative**: The number of negative samples (incorrect training-pair instances) that are drawn for each good.
- **sg**: Training algorithm -- 1 for skip-gram; otherwise CBOW.
- **min_count**: Ignores all words with total frequency lower than this.
- **iter**: Number of iterations (epochs) over the corpus.

So let's now train our model by explicitly setting some of these parameters!

In [7]:
# This is the whole model (it's going to take a few minutes!)
brown_model = Word2Vec(brown_sent, size = 300, window = 5, negative = 5, sg = 1, min_count = 5, iter = 10)

In [9]:
# We can do the same test as before
test = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test[:3])

Most similar to 'blue':
 [('silk', 0.8611888885498047), ('gray', 0.8563375473022461), ('pink', 0.850502610206604)]


## Evaluating the Model

We are going to rely on our own **human intuitions** to decide how well the model is doing!

In [11]:
sim = brown_model.wv.similarity("cup", "water")
print("How similar is 'cup' to 'water':\n", sim)

sim = brown_model.wv.similarity("cup", "book")
print("How similar is 'cup' to 'book':\n", sim)

How similar is 'cup' to 'water':
 0.5454486
How similar is 'cup' to 'book':
 0.56064796


In [15]:
brown_test = brown_model.wv.most_similar('child')
print("Most similar to 'child':\n", brown_test[:3])

Most similar to 'child':
 [('autistic', 0.6856845617294312), ('patient', 0.666350245475769), ('fantasy', 0.6656447649002075)]


We can do more complex comparisons, but some results will be less intuitive than others!

In [16]:
brown_test = brown_model.wv.most_similar(positive = ['child'], negative = ['person'])
print("Most similar to 'child' but dissimilar to 'person':\n", brown_test[:3])

Most similar to 'child' but dissimilar to 'person':
 [('health', 0.2888645529747009), ('high', 0.2569887638092041), ('children', 0.24687400460243225)]


### Let's try a few more interesting tests.

Which word is a mismatch in the sequence?

In [19]:
mismatch = brown_model.wv.doesnt_match(['teacher','professor','doctor','red','athlete','runner'])
print(mismatch)

red


Maybe not **just** semantic relations?

In [20]:
mismatch = brown_model.wv.doesnt_match(['running','swimming','singing','paper','reading','booking','catch'])
print(mismatch)

paper


In [21]:
compare = brown_model.wv.similarity('walk','walked') 
print("The similarity between 'walk' and 'walked':\n", compare)

compare = brown_model.wv.similarity('look','looked') 
print("The similarity between 'look' and 'looked':\n", compare)

compare = brown_model.wv.similarity('look','walk') 
print("The similarity between 'look' and 'walk':\n", compare)

The similarity between 'walk' and 'walked':
 0.6708297
The similarity between 'look' and 'looked':
 0.6468835
The similarity between 'look' and 'walk':
 0.5292294


## The choice of training data

As for the other parameters that we looked at, the **choice of training data** (our corpus) is essential in driving model performance.
For example, consider a very famous test case for Word2Vec: is the model able to derive the fact that "woman" is to "queen" what "man" is to "king"?

We can represent this question algebraically as:

$$vector(woman) +  vector(king) - vector(man) = vector(queen)$$

In [23]:
test = brown_model.wv.most_similar(positive=['woman','king'], negative=['man'], topn=1)
print(test)

[('mourning', 0.7187150716781616)]


We got a *weird* result!

However, consider the fact that the brown corpus is not too big (1M words) and it is fairly old. What would happen if we used a bigger, more recent corpus?

### Working with a pretrained model

Luckily, NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

In [24]:
# we need to get the data
from nltk.data import find
nltk.download('word2vec_sample')

# we are going to use a pruned set
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/Ani/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


In [25]:
# This time we are **not** training it from scratch, we are just loading it in (it is still going to take a bit)!
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

Let's do a sanity check!

In [28]:
model.most_similar("blue")[:3]

[('red', 0.7225173115730286),
 ('purple', 0.7134224772453308),
 ('white', 0.6606029272079468)]

Let's try our example once more!

In [29]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)

[('queen', 0.7118192911148071)]

In [30]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

[('France', 0.7884092330932617)]

We can do more! Let's track **semantic shifts** (e.g. historical changes in meaning)

In [32]:
change1 = brown_model.wv.most_similar('gay')
print("Most similar to 'gay' in the brown corpus:\n", change1[:5])

Most similar to 'gay' in the brown corpus:
 [('awfully', 0.7962538003921509), ('wonderfully', 0.7953144907951355), ('passionate', 0.79194575548172), ('ballad', 0.788561224937439), ('lonely', 0.7879823446273804)]


In [33]:
change2 = model.most_similar('gay')
print("Most similar to 'gay' in Google News:\n", change2[:5])

Most similar to 'gay' in Google News:
 [('homosexual', 0.8145634531974792), ('homosexuals', 0.7562745809555054), ('lesbians', 0.7516927719116211), ('queer', 0.6972684264183044), ('Gay', 0.6740463376045227)]


## Biases

Relying on frequency patterns in human-generated data to make inferences has some problems...

In [34]:
compare1 = model.similarity('she','engineer')
print("The similarity between 'she' and 'engineer':\n", compare1)

compare2 = model.similarity('he','engineer')
print("The similarity between 'he' and 'engineer':\n", compare2)

The similarity between 'she' and 'engineer':
 0.0032564793
The similarity between 'he' and 'engineer':
 0.107617


In [35]:
compare1 = model.similarity('woman','nurse')
print("The similarity between 'woman' and 'nurse':\n", compare1)

compare2 = model.similarity('man','nurse')
print("The similarity between 'man' and 'nurse':\n", compare2)

The similarity between 'woman' and 'nurse':
 0.44135568
The similarity between 'man' and 'nurse':
 0.25472283


In [38]:
compare1 = model.similarity('black','criminal') 
print("The similarity between 'black' and 'criminal':\n", compare1)

compare2 = model.similarity('white','criminal') 
print("The similarity between 'white' and 'criminal':\n", compare2)

The similarity between 'black' and 'criminal':
 0.08380781
The similarity between 'white' and 'criminal':
 0.04107798
