# Based on: `GloVe: Global Vectors for Word Representation` Jeffrey Pennington,   Richard Socher,   Christopher D. Manning, https://nlp.stanford.edu/projects/glove/

### Imports

In [2]:
import numpy as np

### Download pre-trained word vectors

In [1]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip 'glove.6B.zip'

--2023-10-19 05:32:05--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-10-19 05:32:05--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2023-10-19 05:34:45 (5.16 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


There are 4 files in the archive and each file contains embedding vectors of different dimensionality (50, 100, 200, 300).

### Function to parse downloaded files

In [8]:
def load_data(file_name):
  '''
  Loads data into list of vectors and disctionary that maps each word to its corresponding
  embedding vector

  Args:
    file_name (string): name of the file to load

  Returns:
    words (list): list of words in Glove embedding
    word_to_vec (dict): dictionary mapping word to embedding
  '''
  # Initialize list of words and mapping dictionary
  words = []
  word_to_vec = {}

  with open(file_name, 'r') as file:
    for line in file:
      line_splitted = line.split()
      # First entry in each line is embedded word
      word = line_splitted[0]
      words.append(word)
      # The rest is vector representation of a word (map each entry to float)
      vector_representation = list(map(lambda x: float(x), line_splitted[1:]))
      word_to_vec[word] = np.array(vector_representation)

  return words, word_to_vec


### Load 50-dimensional word representation

In [9]:
words, word_to_vec = load_data('glove.6B.50d.txt')

In [11]:
# Check if everything works
print(words[0]) #the
print(word_to_vec['the'])
print(type(word_to_vec['the']))

the
[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
<class 'numpy.ndarray'>


The similarity between two words represented by vectors $v$ and $w$ is defined as the cosine of the angle between them. Hence

$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u|| ||v||} = \cos(\theta) $

In [13]:
def cosine_similarity(v, w):
  '''
  Computes cosine similarity measure between two vectors

  Args:
    v (1-d array): vector representation of a word
    w (1-d array): vector representation of a word

  Returns:
    distance (float): (cosine) distance between v and w
  '''
  # Calculate norms of both vectors
  norm_v = np.linalg.norm(v)
  norm_w = np.linalg.norm(w)

  # Avoid division by 0
  if np.isclose(norm_v * norm_w, 0, atol=1e-32):
    return 0

  # Calculate distance between v and w
  distance = np.dot(v, w)/(norm_v * norm_w)
  return distance

### Find analogies between words

The goal is to find $?$ such that $a$ is to $b$ as $c$ is to $?$. To do this we use vector representations of words. Since $a$ is related to $b$ and $c$ is related to $?$, then $v_b - v_a \approx v_? - v_c$ which means that $\text{CosineSimilarity(v_b - v_a, v_? - v_c)} $ is maximized.

In [16]:
def find_analogy(word_a, word_b, word_c, word_to_vec = word_to_vec):
  '''
  Finds analogy as described above: a is to b as c is to __

  Args:
    word_a, word_b, word_c (strings): words composing an analogy
    word_to_vec (dict): dictionary mapping each vord to its corresponding vector

  Returns:
    word_d (string): word that best matches the analogy
  '''
  # Get word embeddings for word_a, word_b, word_c
  vec_a = word_to_vec[word_a]
  vec_b = word_to_vec[word_b]
  vec_c = word_to_vec[word_c]

  # Loop through all available words and find word_d such that
  # CosineSimilarity(vec_b - vec_a, vec_d - vec_c) is as high as possible
  word_d = None
  best_score = -100
  all_words = list(word_to_vec.keys())

  for word in all_words:
    # to avoid best_word being one the input words, skip the input word_c
    if word == word_c:
      continue
    # Get embedding for word
    vec_word = word_to_vec[word]
    # Calculate similarity between v_b - v_a and v_word - v_c
    score = cosine_similarity(vec_b - vec_a, vec_word - vec_c)
    # Check if this gives best score so far
    if score > best_score:
      best_score = score
      word_d = word

  return word_d

In [18]:
# Test find_analogy function
analogies = [('poland', 'polish', 'england'),
            ('man', 'king', 'woman'),
             ('man', 'woman', 'boy'),
             ('italy', 'rome', 'spain')]

for elem in analogies:
  print(f"{elem[0]} is to {elem[1]} as {elem[2]} is to {find_analogy(elem[0], elem[1], elem[2])}")

poland is to polish as england is to scottish
man is to king as woman is to king
man is to woman as boy is to girl
italy is to rome as spain is to rome


As we can see, it works rather poorly. Let's try higher dimensional embeddings (300-dim instead of 50-dim)

In [19]:
# Load 300-dim embeddings
words_300, word_to_vec_300 = load_data('glove.6B.300d.txt')

In [20]:
# Comparison between 50-dim and 300-dim embeddings:

analogies = [('poland', 'polish', 'england'),
            ('man', 'king', 'woman'),
             ('man', 'woman', 'boy'),
             ('italy', 'rome', 'spain')]
print("50-dim EMBEDDING FIND ANALOGY TASK:")
for elem in analogies:
  print(f"{elem[0]} is to {elem[1]} as {elem[2]} is to {find_analogy(elem[0], elem[1], elem[2])}")

print("300-dim EMBEDDING FIND ANALOGY TASK:")
for elem in analogies:
  print(f"{elem[0]} is to {elem[1]} as {elem[2]} is to {find_analogy(elem[0], elem[1], elem[2], word_to_vec = word_to_vec_300)}")

50-dim EMBEDDING FIND ANALOGY TASK:
poland is to polish as england is to scottish
man is to king as woman is to king
man is to woman as boy is to girl
italy is to rome as spain is to rome
300-dim EMBEDDING FIND ANALOGY TASK:
poland is to polish as england is to english
man is to king as woman is to king
man is to woman as boy is to girl
italy is to rome as spain is to rome


As we can see, 300-dim representation performs slightly better comapred to 50-dim embedding

## Remove gender bias from word embeddings

To motivate the following, let's try to calculate cosine similarity between vector $g = v_{man} - v_{woman}$ which should (roughly) represent gender and couple of other word vectors

In [21]:
g = word_to_vec['man'] - word_to_vec['woman']

words_to_check = ['computer', 'doctor', 'physics', 'science', 'nurse', 'singer', 'fashion', 'teacher']

for word in words_to_check:
  print(word, cosine_similarity(word_to_vec[word], g))

computer 0.10330358873850498
doctor -0.11895289410935041
physics 0.09697462160304735
science 0.06082906540929701
nurse -0.38030879680687524
singer -0.1850051813649629
fashion -0.03563894625772699
teacher -0.17920923431825664


As we can see, some words tend to have gender stereotypes encoded in them. Let's try to remove gender bias from word vectors as proposed in paper `Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings` https://arxiv.org/abs/1607.06520. The idea is to project every word vector to a subspace orthogonal to $g$ (vector representing gender). That means we need to subtract from each vector its projection onto g:

$v = v - \frac{v \cdot g}{||g||^2} g$

In [23]:
# The paper assumes all word vectors to have L2 norm as 1
word_to_vec_unit_vectors = {
    word: embedding / np.linalg.norm(embedding)
    for word, embedding in word_to_vec.items()
}
g_unit = word_to_vec_unit_vectors['man'] - word_to_vec_unit_vectors['woman']

In [24]:
def remove_gender_bias(word, g = g_unit, word_to_vec = word_to_vec_unit_vectors):
  '''
  Removes gender bias from word as described above

  Args:
    word (string): word to be debiased
    g (1-d array): vector representation of gender (v_{man} - v_{woman})
    word_to_vec (dict): dictionary mapping each vord to its corresponding vector

  Returns:
    v_deb (1-d array): debiased word representation for word
  '''
  # Get vector representation of word
  vec_word = word_to_vec[word]

  # Remove projection of v_vec onto g from v_vec
  v_deb = vec_word - (np.dot(vec_word, g)/np.linalg.norm(g)**2)*g

  return v_deb

In [27]:
# Test if everything works
word = 'nurse'

print(f"Cosine similarity before debiasing for '{word}' is {cosine_similarity(word_to_vec_unit_vectors[word], g_unit)}")

print(f"Cosine similarity after debiasing for '{word}' is {cosine_similarity(remove_gender_bias(word), g_unit)}")

Cosine similarity before debiasing for 'nurse' is -0.3008480330254639
Cosine similarity after debiasing for 'nurse' is -4.6672835363612994e-17


### Debiasing pairs of words

To see another problem that can arise, let's try to (gender) debias words `actor`, `actress` and `babysit`

In [35]:
vec_actor_debiased = remove_gender_bias('actor')
vec_actress_debiased = remove_gender_bias('actress')
vec_babysit_debiased = remove_gender_bias('babysit')

Now let's calculate cosine similarity between debiased words for `actor`, `actress` and `babysit`

In [36]:
print(f"Cosine similarity between actor and babysit is {cosine_similarity(vec_actor_debiased, vec_babysit_debiased)}")
print(f"Cosine similarity between actress and babysit is {cosine_similarity(vec_actress_debiased, vec_babysit_debiased)}")

Cosine similarity between actor and babysit is 0.04888115571301206
Cosine similarity between actress and babysit is -0.012170527529280502


As we can see, even though we debiased all words, `actor` and `actress` are not equidistant from `babysit`. The key idea behind equalization is to make sure that a particular pair of words are equidistant from the 49-dimensional $g_\perp$. The equalization step also ensures that the two equalized steps are now the same distance from $e_{receptionist}^{debiased}$, or from any other word that has been neutralized. See Bolukbasi et al.

* $\mu = \frac{e_{w1} + e_{w2}}{2}$

* $ \mu_{B} = \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$

* $\mu_{\perp} = \mu - \mu_{B}$

* $ e_{w1B} = \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$
* $ e_{w2B} = \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$


* $e_{w1B}^{corrected} = \sqrt{{1 - ||\mu_{\perp} ||^2_2}} * \frac{e_{\text{w1B}} - \mu_B} {||e_{w1B} - \mu_B||_2}$


* $e_{w2B}^{corrected} = \sqrt{{1 - ||\mu_{\perp} ||^2_2}} * \frac{e_{\text{w2B}} - \mu_B} {||e_{w2B} - \mu_B||_2}$

* $e_1 = e_{w1B}^{corrected} + \mu_{\perp}$
* $e_2 = e_{w2B}^{corrected} + \mu_{\perp}$