# Natural Language Processing

## Gensim Implementation

Author: Bingchen Wang

Last updated: 10 Jan, 2023

In [17]:
import torch
import gensim
import gensim.downloader as api
import numpy as np

## Load the Pre-trained Word2vec Google News Model

Here we use the <a href = "https://github.com/RaRe-Technologies/gensim-data">Gensim API</a> to import the model. A helpful Word2Vec Demo can be found <a href= 'https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py'>here</a>.

In [2]:
wv = api.load("word2vec-google-news-300")

Check the dimension of the word embedding:

In [96]:
print(f"Dimensionality of a word vector: {wv['king'].shape}.")
print(f"Norm of the word vector for 'king': {np.linalg.norm(wv['king'])}")

Dimensionality of a word vector: (300,).
Norm of the word vector for 'king': 2.90225887298584


In [93]:
!python --version

Python 3.8.16


## Calculate the cosine similarity between two words

$$
\cos(\vec u, \vec v) = \frac{\vec u \cdot \vec v}{\Vert \vec u \Vert_2 \Vert \vec v \Vert_2}
$$

In [10]:
def cosine_similarity(u,v):
  # Special case. u = [0, 0] and v = [0 ,0]
  if np.all(u == v):
    return 1
  # Calculate the components needed for the cosine similarity
  norm_u = np.linalg.norm(u, ord = 2)
  norm_v = np.linalg.norm(v, ord = 2)
  dot_product = np.dot(u,v)
  # Avoid the division by 0 issue
  if np.isclose(norm_u * norm_v, 0, atol = 1e-32):
    return 0
  
  cosine_similarity = dot_product/(norm_u*norm_v)
  
  return cosine_similarity
  

In [25]:
# Compare our self-defined function with the function provided by gensim
print(cosine_similarity(wv['king'], wv['queen']), wv.similarity('king','queen'))

# We can also use our function to check the similarity between differences
print(cosine_similarity(wv['king']- wv['queen'], wv['male'] - wv['female']))

0.6510956 0.6510957
0.27453682


## Analogical Reasoning

Here we design a function that performs analogical reasoning. But note that the gensim library has default functions to do that.

In [22]:
print(gensim.__version__)
for index, word in enumerate(wv.index2word): #note that this attribute is replaced by index_to_key in the latest version of gen
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index2word)} is {word}")

3.6.0
word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


In [84]:
def complete_analogy(w_a,w_b,w_c,word2vec_map):
  e_a, e_b, e_c = word2vec_map[w_a], word2vec_map[w_b], word2vec_map[w_c]
  words = word2vec_map.index2word
  cos_sim = [cosine_similarity(e_a - e_b, e_c - word2vec_map[w]) for w in words[:100000]] #due to computation time, we limit the scope to the first 100k words
  max_index = np.argmax(cos_sim)
  best_word =  words[max_index]
  return best_word

In [85]:
print(complete_analogy('man','boy','woman', wv))
print(complete_analogy('man','king','woman', wv))

girl
king


In [75]:
np.argwhere(np.array(wv.index2word) == 'queen')

print(cosine_similarity(wv['man'] - wv['king'], wv['woman']- wv['queen']))
print(cosine_similarity(wv['man'] - wv['king'], wv['woman']- wv['king']))

0.7580351
0.8824964


In [88]:
wv.most_similar(positive = ['woman', 'king'], negative = ['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [91]:
wv.most_similar(positive = ['woman', 'footballer'], negative = ['man'])

[('footballers', 0.6041663885116577),
 ('sportswoman', 0.5877651572227478),
 ('actress', 0.5462757349014282),
 ('cricketer', 0.5384135246276855),
 ('supermodel', 0.5337250232696533),
 ('Footballer', 0.5334779620170593),
 ('popstar', 0.5287024974822998),
 ('model_Sophie_Anderton', 0.5286864042282104),
 ('Sarah_Marbeck', 0.5237902998924255),
 ('starlet', 0.5232126712799072)]

The differences between our `complete_analogy` function and gensim's `most_similar` seem to suggest that the latter exclude certain words (specifically, the words used for the analogy task) from consideration.

In [82]:
def complete_analogy_improved(w_a,w_b,w_c,word2vec_map):
  e_a, e_b, e_c = word2vec_map[w_a], word2vec_map[w_b], word2vec_map[w_c]
  words = word2vec_map.index2word
  cos_sim = np.array([[w,cosine_similarity(e_a - e_b, e_c - word2vec_map[w])] for w in words[:100000] if w not in {w_a,w_b,w_c}]) #due to computation time, we limit the scope to the first 100k words
  max_index = np.argmax(cos_sim[:,1])
  best_word =  cos_sim[max_index,0]

  return best_word

In [90]:
print(complete_analogy_improved('man','king','woman', wv))
print(complete_analogy_improved('male','footballer','female', wv))

queen
footballers


Now the results match. (Note that in `complete_analogy_improved`, we considered only the first 100,000 words in the vocabulary to save on computation.)

Interestingly, the analogical reasoning task does not suggest similar gender bias that were evident in <a href = "https://arxiv.org/abs/1607.06520">Bolukbasi et al. (2016)</a>. It should be noted that:
- Here, we did not normalize the word embeddings.
- Perhaps, the word2vec Google News pretrained embedding has been updated since <a href = "https://arxiv.org/abs/1607.06520">Bolukbasi et al. (2016)</a>.