# AKA: A Quick Introduction to Word2Vec with Python and Gensim
@Aniello De Santo, Feb 06, 2020

## Setting up and getting some data

In [None]:
# We need a resource for our data
import nltk

In [None]:
# If you are using this in CoLab, also run this cell, otherwise you can skip it
import warnings
warnings.filterwarnings('ignore')

In [None]:
# We are going to use the brown corpus
nltk.download('brown')
from nltk.corpus import brown

Let's start by printing a few sentences out of the "brown" corpus, to get an idea of what the data looks like.

In [None]:
brown_sent = brown.sents()
print(brown_sent[:3])

## Building the Model

We don't want to build the whole model from scratch, we will use the Gensim library instead.

In [None]:
from gensim.models import Word2Vec

We can now build an instance of the model!

In [None]:
# This is the whole model for the brown corpus (it might take a few minutes)!
brown_model = Word2Vec(brown_sent)

Let's look at an example!

In [None]:
test1 = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test1[:3])

## Refining the model

Word2Vec takes a broad range of parameters. In our example above, we only chose where to get our sentences from, and we used the *default* settings for the rest. But let's now look at a few that are most relevant (you can find a full list here: https://radimrehurek.com/gensim/models/word2vec.html):

- **size**: The dimensionality of our embeddings (i.e. the length of each word vector).
- **window**: Which words are considered contexts of the target. The size of window affects the type of similarity captured in the embeddings.
- **negative**: The number of negative samples (incorrect training-pair instances) that are drawn for each good.
- **sg**: Training algorithm -- 1 for skip-gram; otherwise CBOW.
- **min_count**: Ignores all words with total frequency lower than this.
- **iter**: Number of iterations (epochs) over the corpus.

So let's now train our model by explicitly setting some of these parameters!

In [None]:
# This is the whole model (it's going to take a few minutes!)
brown_model = Word2Vec(brown_sent, size = 300, window = 5, negative = 5, sg = 1, min_count = 5, iter = 10)

In [None]:
# We can do the same test as before
test = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test[:10])

## Evaluating the Model

We are going to rely on our own **human intuitions** to decide how well the model is doing!

In [None]:
sim = brown_model.wv.similarity("cup", "water")
print("How similar is 'cup' to 'water':\n", sim)

sim = brown_model.wv.similarity("cup", "book")
print("How similar is 'cup' to 'book':\n", sim)

In [None]:
brown_test = brown_model.wv.most_similar('child')
print("Most similar to 'child':\n", brown_test[:3])

We can do more complex comparisons, but some results will be less intuitive than others!

In [None]:
brown_test = brown_model.wv.most_similar(positive = ['child'], negative = ['person'])
print("Most similar to 'child' but dissimilar to 'person':\n", brown_test[:3])

### Let's try a few more interesting tests.

Which word is a mismatch in the sequence?

In [None]:
mismatch = brown_model.wv.doesnt_match(['teacher','professor','doctor','red','athlete','runner'])
print(mismatch)

Maybe not **just** semantic relations?

In [None]:
mismatch = brown_model.wv.doesnt_match(['running','swimming','singing','paper','reading','booking','catch'])
print(mismatch)

In [None]:
compare = brown_model.wv.similarity('walk','walked') 
print("The similarity between 'walk' and 'walked':\n", compare)

compare = brown_model.wv.similarity('look','looked') 
print("The similarity between 'look' and 'looked':\n", compare)

compare = brown_model.wv.similarity('look','walk') 
print("The similarity between 'look' and 'walk':\n", compare)

## The choice of training data

As for the other parameters that we looked at, the **choice of training data** (our corpus) is essential in driving model performance.
For example, consider a very famous test case for Word2Vec: is the model able to derive the fact that "woman" is to "queen" what "man" is to "king"?

We can represent this question algebraically as:

$$vector(woman) +  vector(king) - vector(man) = vector(queen)$$

In [None]:
test = brown_model.wv.most_similar(positive=['woman','king'], negative=['man'], topn=1)
print(test)

We got a *weird* result!

However, consider the fact that the brown corpus is not too big (1M words) and it is fairly old. What would happen if we used a bigger, more recent corpus?

### Working with a pretrained model

Luckily, NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

In [None]:
# we need to get the data
from nltk.data import find
nltk.download('word2vec_sample')

# we are going to use a pruned set
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

In [None]:
# This time we are **not** training it from scratch, we are just loading it in (it is still going to take a bit)!
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

Let's do a sanity check!

In [None]:
model.most_similar("blue")[:3]

Let's try our example once more!

In [None]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)

In [None]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

We can do more! Let's track **semantic shifts** (e.g. historical changes in meaning)

In [None]:
change1 = brown_model.wv.most_similar('gay')
print("Most similar to 'gay' in the brown corpus:\n", change1[:5])

In [None]:
change2 = model.most_similar('gay')
print("Most similar to 'gay' in Google News:\n", change2[:5])

## Biases

Relying on frequency patterns in human-generated data to make inferences has some problems...

In [None]:
compare1 = model.similarity('she','engineer')
print("The similarity between 'she' and 'engineer':\n", compare1)

compare2 = model.similarity('he','engineer')
print("The similarity between 'he' and 'engineer':\n", compare2)

In [None]:
compare1 = model.similarity('woman','nurse')
print("The similarity between 'woman' and 'nurse':\n", compare1)

compare2 = model.similarity('man','nurse')
print("The similarity between 'man' and 'nurse':\n", compare2)

In [None]:
compare1 = model.similarity('black','criminal') 
print("The similarity between 'black' and 'criminal':\n", compare1)

compare2 = model.similarity('white','criminal') 
print("The similarity between 'white' and 'criminal':\n", compare2)