# Lab 6: A Quick Introduction to Word2Vec with Python and Gensim


## Setting up and getting some data

In [1]:
# We need a resource for our data
import nltk

In [2]:
# If you are using this in CoLab, also run this cell, otherwise you can skip it
import warnings
warnings.filterwarnings('ignore')

In [2]:
# We are going to use the brown corpus
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to C:\Users\Henry
[nltk_data]     Pham\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Let's start by printing a few sentences out of the "brown" corpus, to get an idea of what the data looks like.

In [3]:
brown_sent = brown.sents()
print(brown_sent[:3])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']]


## Building the Model

We don't want to build the whole model from scratch, we will use the Gensim library instead.

In [4]:
from gensim.models import Word2Vec

We can now build an instance of the model!

In [5]:
# This is the whole model for the brown corpus (it might take a few minutes)!
brown_model = Word2Vec(brown_sent)

Let's look at an example!

In [6]:
test1 = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test1[:3])

Most similar to 'blue':
 [('gray', 0.9581364393234253), ('red', 0.9496833682060242), ('brown', 0.9451567530632019)]


## Refining the model

Word2Vec takes a broad range of parameters. In our example above, we only chose where to get our sentences from, and we used the *default* settings for the rest. But let's now look at a few that are most relevant (you can find a full list here: https://radimrehurek.com/gensim/models/word2vec.html):

- **size**: The dimensionality of our embeddings (i.e. the length of each word vector).
- **window**: Which words are considered contexts of the target. The size of window affects the type of similarity captured in the embeddings.
- **negative**: The number of negative samples (incorrect training-pair instances) that are drawn for each good.
- **sg**: Training algorithm -- 1 for skip-gram; otherwise CBOW.
- **min_count**: Ignores all words with total frequency lower than this.
- **iter**: Number of iterations (epochs) over the corpus.

So let's now train our model by explicitly setting some of these parameters!

In [7]:
# This is the whole model (it's going to take a few minutes!)
brown_model = Word2Vec(brown_sent, window = 5, negative = 5, sg = 1, min_count = 5)

In [8]:
# We can do the same test as before
test = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test[:3])

Most similar to 'blue':
 [('gray', 0.933104932308197), ('pale', 0.9232453107833862), ('pink', 0.9185042381286621)]


## Evaluating the Model

We are going to rely on our own **human intuitions** to decide how well the model is doing!

In [9]:
sim = brown_model.wv.similarity("cup", "water")
print("How similar is 'cup' to 'water':\n", sim)

sim = brown_model.wv.similarity("cup", "book")
print("How similar is 'cup' to 'book':\n", sim)

How similar is 'cup' to 'water':
 0.71435404
How similar is 'cup' to 'book':
 0.22795838


In [10]:
brown_test = brown_model.wv.most_similar('child')
print("Most similar to 'child':\n", brown_test[:3])

Most similar to 'child':
 [('artist', 0.8327049016952515), ('teacher', 0.8274621963500977), ('joy', 0.8111901879310608)]


We can do more complex comparisons, but some results will be less intuitive than others!

In [11]:
brown_test = brown_model.wv.most_similar(positive = ['child'], negative = ['person'])
print("Most similar to 'child' but dissimilar to 'person':\n", brown_test[:3])

Most similar to 'child' but dissimilar to 'person':
 [('your', 0.29096704721450806), ('living', 0.2397458702325821), ('health', 0.22753654420375824)]


### Let's try a few more interesting tests.

Which word is a mismatch in the sequence?

In [12]:
mismatch = brown_model.wv.doesnt_match(['teacher','professor','doctor','red','athlete','runner'])
print(mismatch)

red


Maybe not **just** semantic relations?

In [14]:
mismatch = brown_model.wv.doesnt_match(['running','swimming','singing','paper','reading','booking','catch'])
print(mismatch)

reading


In [15]:
compare = brown_model.wv.similarity('walk','walked') 
print("The similarity between 'walk' and 'walked':\n", compare)

compare = brown_model.wv.similarity('look','looked') 
print("The similarity between 'look' and 'looked':\n", compare)

compare = brown_model.wv.similarity('look','walk') 
print("The similarity between 'look' and 'walk':\n", compare)

The similarity between 'walk' and 'walked':
 0.77256596
The similarity between 'look' and 'looked':
 0.76084566
The similarity between 'look' and 'walk':
 0.76290905


## The choice of training data

As for the other parameters that we looked at, the **choice of training data** (our corpus) is essential in driving model performance.
For example, consider a very famous test case for Word2Vec: is the model able to derive the fact that "woman" is to "queen" what "man" is to "king"?

We can represent this question algebraically as:

$$vector(woman) +  vector(king) - vector(man) = vector(queen)$$

In [16]:
test = brown_model.wv.most_similar(positive=['woman','king'], negative=['man'], topn=1)
print(test)

[('singing', 0.8299245238304138)]


We got a *weird* result!

However, consider the fact that the brown corpus is not too big (1M words) and it is fairly old. What would happen if we used a bigger, more recent corpus?

### Working with a pretrained model

Luckily, NLTK includes a pre-trained model. In particular, it includes part of a model trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

In [17]:
# we need to get the data
from nltk.data import find
nltk.download('word2vec_sample')

# we are going to use a pruned set
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

[nltk_data] Downloading package word2vec_sample to C:\Users\Henry
[nltk_data]     Pham\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping models\word2vec_sample.zip.


In [18]:
# This time we are **not** training it from scratch, we are just loading it in (it is still going to take a bit)!
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

Let's do a sanity check!

In [19]:
model.most_similar("blue")[:3]

[('red', 0.7225173115730286),
 ('purple', 0.7134224772453308),
 ('white', 0.6606029868125916)]

Let's try our example once more!

In [20]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)

[('queen', 0.7118193507194519)]

In [21]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

[('France', 0.7884091734886169)]

We can do more! Let's track **semantic shifts** (e.g. historical changes in meaning)

In [22]:
change1 = brown_model.wv.most_similar('gay')
print("Most similar to 'gay' in the brown corpus:\n", change1[:5])

Most similar to 'gay' in the brown corpus:
 [('lonely', 0.9122912883758545), ('Tommy', 0.906644880771637), ('unhappy', 0.900947630405426), ('awfully', 0.8988267779350281), ('gaiety', 0.8983256220817566)]


In [23]:
change2 = model.most_similar('gay')
print("Most similar to 'gay' in Google News:\n", change2[:5])

Most similar to 'gay' in Google News:
 [('homosexual', 0.8145633935928345), ('homosexuals', 0.7562745213508606), ('lesbians', 0.7516927719116211), ('queer', 0.6972684264183044), ('Gay', 0.6740463376045227)]


## Biases

Relying on frequency patterns in human-generated data to make inferences has some problems...

In [24]:
compare1 = model.similarity('she','engineer')
print("The similarity between 'she' and 'engineer':\n", compare1)

compare2 = model.similarity('he','engineer')
print("The similarity between 'he' and 'engineer':\n", compare2)

The similarity between 'she' and 'engineer':
 0.0032564793
The similarity between 'he' and 'engineer':
 0.107617


In [25]:
compare1 = model.similarity('woman','nurse')
print("The similarity between 'woman' and 'nurse':\n", compare1)

compare2 = model.similarity('man','nurse')
print("The similarity between 'man' and 'nurse':\n", compare2)

The similarity between 'woman' and 'nurse':
 0.44135568
The similarity between 'man' and 'nurse':
 0.25472283


**Exercise 1. (5 points)**

Pick 2 words of your choice as use the code below to extract their 3 closest words from the brown semantic space and the google semantic space. Which model better captures your intuition? Sum up your considerations in a few sentences.

In [None]:
# NOTE: I just realized I should haev used some gendered words but :p
# Word 1
# Brown modify this code with a word of your choice
brown_test = brown_model.wv.most_similar('people')
print("Brown Corpus, most similar to 'people':\n", brown_test[:3])



#Google modify this code with words of your choice

google_test = model.most_similar("people")
print("Google Corpus, most similar to 'people':\n", google_test[:3])

Brown Corpus, most similar to 'people':
 [('Americans', 0.7912866473197937), ('Negroes', 0.7862326502799988), ('readers', 0.7780669331550598)]
Google Corpus, most similar to 'people':
 [('individuals', 0.5827619433403015), ('folks', 0.5794458985328674), ('citizens', 0.5653229355812073)]


In [31]:
# Word 2
# Brown modify this code with a word of your choice

brown_test = brown_model.wv.most_similar('amazing')
print("Brown Corpus, most similar to 'amazing':\n", brown_test[:3])

#Google modify this code with words of your choice

google_test = model.most_similar("amazing")
print("Google Corpus, most similar to 'amazing':\n", google_test[:3])

Brown Corpus, most similar to 'amazing':
 [('contradiction', 0.9285436868667603), ('uncertain', 0.9285029768943787), ('god', 0.925239622592926)]
Google Corpus, most similar to 'amazing':
 [('incredible', 0.9054001569747925), ('awesome', 0.8282866477966309), ('unbelievable', 0.8201264142990112)]


*write your considerations for exercise 1 here*

The Google model much better encapsualtes my intuition for both of the words I provided. For my first word, I would assume 'people' to be a unanimous term that doesn't have any specifities behind it yet the brown model added specifity and even just included 'Negroe' out of nowhere. When it comes to the second word, amazing, I would assume it is also a unanimous term though the Brown model tagged 'god' to it. Moreover the brown model tagged terms that are more related to something like 'unknown' instead of my intuitive defeinition of amazing which is 'cool', 'radical', or 'awesome'.

**Exercise 2. (5 points)**

Think of two more cases of implicit biases that you can test the model on (they can be based on gender as above, but it would be even better if you could think of other dimensions for bias). Then, modify the code below by switching w1 and w2 with words of your choice to test your idea. Did the model output what you expected? Summarize your conclusions in a couple of sentences.

In [35]:
#Bias example 1


compare1 = model.similarity('money', 'successful')
print("The similarity between 'money' and 'successful':\n", compare1)

compare2 = model.similarity('broke', 'successful')
print("The similarity between 'broke' and 'successful':\n", compare2)

The similarity between 'money' and 'successful':
 0.16267024
The similarity between 'broke' and 'successful':
 0.0012424086


In [36]:
#Bias example 2
compare3 = model.similarity('education', 'job')
print("The similarity between 'education' and 'job':\n", compare3)

compare4 = model.similarity('unemployed', 'job')
print("The similarity between 'unemployed' and 'job':\n", compare4)

The similarity between 'education' and 'job':
 0.22259279
The similarity between 'unemployed' and 'job':
 0.4136857


#Write your consideration here

The model partly gave results I expected. It showed a strong connection between money and success, and almost none between broke and success, which matches common bias. But it surprisingly linked unemployed to job more than education to job. This suggests the model sometimes reflects how words are used together, not just real-world values or assumptions. It reminds us that language models learn from patterns in text, not from understanding the meaning like humans do. So while they can reflect real biases, they can also behave in unexpected ways based on word usage.