# Week 8: Word Vectors

In this weeks sesssion we are going to be looking at word vectors. Word vectors (or *word embeddings*) are pretrained numerical representations of words, within a high-dimensional vector space. The number of dimensions for our word vectors is arbitrary and can range anywhere from 50 to 300 dimensions. 

Word vectors are calculated from very large datasets of texts, with the goal of words that are similiar being close to each other in vector space, and words being dissimilar being far away in vector space. After processing a vast amount of data, we end up with a unique vector for every word in the corpus. This gives us feature representations of words (that unlike other representations of words we have seen like one-hot, bag of words or TF-IDF) encode a representation that captures the meaning of the word. 

As these word vectors are **numerical representations**, we can perform mathematical functions on them to do some interesting (and revealing) insights into what kind of data and biases these models contain. 

First lets do some imports:

In [1]:
import torch
import torch.nn as nn
import torchtext.vocab as vocab

#### Download word vectors

Here we are going to load our set of word vectors using the torchtext library. Here we are downloading the [GloVe pretrained word embeddings](https://nlp.stanford.edu/projects/glove/) trained a data dump of Wikipedia from 2014. There are [other pretrained word embeddings](https://torchtext.readthedocs.io/en/latest/vocab.html#pretrained-word-embeddings) available in torchtext. You can try loading in other ones later and see how that effects results. 

This download is about 1GB. You should **run this before the class**. If you haven't done this before the class for whatever reason and it is taking too long to download in class then kill the cell (or restart the kernel) and instead use the function in the cell following the next one. 

In [4]:
word_vectors = vocab.GloVe(name="6B",dim=100) 

##### (Alternative) load a sub-sample of word vectors

If the previous cell is taking too long to download, you can uncomment this line to load in a sample of the top 30K word vectors from GloVe to use for this exercise:

In [5]:
#word_vectors = vocab.Vectors(name = '../data/glove.6B.100d.top30k.txt')

Lets take a look of one of our word vectors. It looks like a big list of numbers!

In [6]:
word_vectors['dog']

tensor([ 0.3082,  0.3094,  0.5280, -0.9254, -0.7367,  0.6348,  0.4420,  0.1026,
        -0.0914, -0.5661, -0.5327,  0.2013,  0.7704, -0.1398,  0.1373,  1.1128,
         0.8930, -0.1787, -0.0020,  0.5729,  0.5948,  0.5043, -0.2899, -1.3491,
         0.4276,  1.2748, -1.1613, -0.4108,  0.0428,  0.5487,  0.1890,  0.3759,
         0.5803,  0.6697,  0.8116,  0.9386, -0.5100, -0.0701,  0.8282, -0.3535,
         0.2109, -0.2441, -0.1655, -0.7836, -0.4848,  0.3897, -0.8636, -0.0164,
         0.3198, -0.4925, -0.0694,  0.0189, -0.0983,  1.3126, -0.1212, -1.2399,
        -0.0914,  0.3529,  0.6464,  0.0896,  0.7029,  1.1244,  0.3864,  0.5208,
         0.9879,  0.7995, -0.3462,  0.1409,  0.8017,  0.2099, -0.8601, -0.1531,
         0.0745,  0.4082,  0.0192,  0.5159, -0.3443, -0.2453, -0.7798,  0.2743,
         0.2242,  0.2016,  0.0174, -0.0147, -1.0235, -0.3970, -0.0056,  0.3057,
         0.3175,  0.0214,  0.1184, -0.1132,  0.4246,  0.5340, -0.1672, -0.2718,
        -0.6255,  0.1288,  0.6253, -0.52

Lets take a look at another one:

In [28]:
word_vectors['cream']

tensor([-0.7574,  0.1693, -0.7839, -0.1091,  0.0082,  0.7234,  1.4583, -0.0700,
         0.0121, -0.1024, -0.4730, -0.3713, -0.1743,  0.9255,  0.5588,  0.2687,
         0.5315, -0.8269,  0.0700, -0.1635, -0.4140,  0.8368, -0.3771, -0.3150,
        -0.1433,  1.3757,  0.2553, -0.8395, -0.4538, -0.6820,  0.7295,  0.6717,
        -0.2971, -0.7698, -0.1653,  0.6540,  0.3992,  0.4613,  0.1260, -1.4694,
         0.9445, -1.7318, -0.4817, -1.0355,  0.1341,  0.4327, -0.2064,  0.0087,
         0.6242, -0.9442, -0.2482, -0.3284, -0.1797,  1.2036, -0.8806, -1.0946,
        -0.4835,  0.7340,  0.5827,  0.3725,  0.6041,  0.4534,  0.0388, -0.1667,
         0.2082, -0.5358,  0.6453, -0.1996, -0.0616, -0.8759, -0.2334, -0.0343,
        -0.0174,  0.6223,  0.6372,  0.8106,  0.4091, -0.8603,  0.8655, -0.0143,
         0.1666, -0.4490, -0.2643,  1.0010, -0.1944, -0.8739,  0.3933,  0.0464,
         0.3095, -0.0749, -0.0024,  0.1525, -1.1183, -0.5085, -0.3071, -1.1481,
        -0.5662,  0.0923,  1.0424,  0.35

On their own, these word vectors are not particularly meaningful. No person looking a this would not be able to make sense of it's meaning. 

Where word vectors become powerful is when we make comparisons between them. We can use the [cosine similarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html) function in PyTorch to get a measure of similarity between our two vectors. 

As this is a similarity measurement, the higher the value the most similar. 1 is the highest value we can get and 0 is the lowest value. Lets compare our word vectors:

In [36]:
cosine_sim = nn.CosineSimilarity(dim=0)

similarity = cosine_sim(word_vectors['dog'], word_vectors['dog'])
print(f'The words dog and dog have a cosine similiarity of {similarity.item():3f}')

similarity = cosine_sim(word_vectors['dog'], word_vectors['phone'])
print(f'The words dog and phone have a cosine similiarity of {similarity.item():3f}')

The words dog and dog have a cosine similiarity of 1.000000
The words dog and phone have a cosine similiarity of 0.302810


Now lets compare some more words:

In [9]:
similarity = cosine_sim(word_vectors['dog'], word_vectors['fox'])
print(f'The words dog and fox have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim( word_vectors['cat'], word_vectors['fox'])
print(f'The words cat and fox have a cosine similiarity of {similarity.item():3f}')

The words dog and fox have a cosine similiarity of 0.414250
The words cat and fox have a cosine similiarity of 0.393508


Foxes are in the canine family so this is accurate! 

Now lets compare London to some cities around the world:

In [10]:
similarity = cosine_sim(word_vectors['london'], word_vectors['paris'])
print(f'The words london and paris have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['london'], word_vectors['madrid'])
print(f'The words london and madrid have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['london'], word_vectors['beirut'])
print(f'The words london and beirut have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['london'], word_vectors['beijing'])
print(f'The words london and beijing have a cosine similiarity of {similarity.item():3f}')

The words london and paris have a cosine similiarity of 0.733768
The words london and madrid have a cosine similiarity of 0.481906
The words london and beirut have a cosine similiarity of 0.396094
The words london and beijing have a cosine similiarity of 0.454936


And cities in the UK:

In [11]:
similarity = cosine_sim(word_vectors['london'], word_vectors['edinburgh'])
print(f'The words london and edinburgh have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['london'], word_vectors['glasgow'])
print(f'The words london and glasgow have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['glasgow'], word_vectors['edinburgh'])
print(f'The words glasgow and edinburgh have a cosine similiarity of {similarity.item():3f}')

The words london and edinburgh have a cosine similiarity of 0.683086
The words london and glasgow have a cosine similiarity of 0.669558
The words glasgow and edinburgh have a cosine similiarity of 0.840325


And cities in Ireland:

In [12]:
similarity = cosine_sim(word_vectors['london'], word_vectors['dublin'])
print(f'The words london and dublin have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['london'], word_vectors['belfast'])
print(f'The words london and belfast have a cosine similiarity of {similarity.item():3f}')
similarity = cosine_sim(word_vectors['dublin'], word_vectors['belfast'])
print(f'The words dublin and belfast have a cosine similiarity of {similarity.item():3f}')

The words london and dublin have a cosine similiarity of 0.691274
The words london and belfast have a cosine similiarity of 0.603107
The words dublin and belfast have a cosine similiarity of 0.797412


#### Measure distances on your own words

Try putting your own words in here to see the distance scores:

In [13]:
word1 = ''
word2 = ''
similarity = cosine_sim(word_vectors[word1], word_vectors[word2])
print(f'These words have a distance of: {similarity.item():3f}')

These words have a distance of: 0.000000


### Finding closest words

The following function will let us look for the closest words in vector space to a target word. The following function calculates this using the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) instead of the Cosine Similarity. 

In [14]:
#This function is sourced from: https://www.cs.toronto.edu/~lczhang/321/lec/glove_notes.html
def print_closest_words(vec, n=5):
    dists = torch.norm(word_vectors.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]: 					       # take the top n
        print(word_vectors.itos[idx], difference)

In [15]:
print_closest_words(word_vectors["dog"], n=10)

cat 2.681131
dogs 3.2425272
puppy 3.950055
pet 3.9634414
horse 4.328852
pig 4.4629855
cats 4.518958
animal 4.5231004
rabbit 4.547051
boy 4.598282


In [16]:
print_closest_words(word_vectors["london"], n=10)

sydney 4.234696
paris 4.6191545
melbourne 4.6299014
dublin 4.6676564
edinburgh 4.843591
glasgow 4.863154
york 4.871911
opened 5.031126
birmingham 5.0848064
amsterdam 5.094095


In [17]:
print_closest_words(word_vectors["camberwell"], n=10)

hornsey 3.5537856
islington 3.5999157
croydon 3.648258
plaistow 3.7073412
shoreditch 3.7357748
coulsdon 3.8395576
eltham 3.9164743
highgate 3.939703
southfields 3.9632952
beckenham 3.971325


In [18]:
print_closest_words(word_vectors["potato"], n=10)

potatoes 3.9128337
peanut 3.9875195
tomato 4.1266975
bean 4.1955256
pumpkin 4.199014
baked 4.2010217
bread 4.3381357
fried 4.392579
toast 4.4026785
mashed 4.424257


In [19]:
print_closest_words(word_vectors["doctor"], n=10)

physician 3.6094282
nurse 3.8185012
patient 4.220124
dentist 4.256469
dr. 4.27268
surgeon 4.3080616
psychiatrist 4.4265003
doctors 4.4447513
colleague 4.538765
pharmacist 4.575756


#### Try putting your own words into this function:

In [20]:
my_word = 'word'
print_closest_words(word_vectors[my_word], n=10)

phrase 2.7446012
meaning 2.9835567
words 3.1709633
name 3.6496246
literally 3.9742608
referred 4.0050144
refer 4.006663
refers 4.013667
simply 4.051273
instance 4.0517354


#### Doing arithmetic on word vectors

We can do arithmetic on word vectors to create new vectors:

In [21]:
new_word_vector = word_vectors['king'] - word_vectors['man'] + word_vectors['woman']
new_word_vector

tensor([-0.1023, -0.8129,  0.1021,  0.9859,  0.3422,  1.0910, -0.4891, -0.0562,
        -0.2103, -1.0300, -0.8685,  0.3679,  0.0196,  0.5926, -0.2319, -1.0169,
        -0.0122, -1.1719, -0.5233,  0.6065, -0.9854, -1.0010,  0.4891,  0.6301,
         0.5822,  0.1591,  0.4369, -1.2535,  0.9705, -0.0655,  0.7338,  0.4422,
         1.2092,  0.1970, -0.1595,  0.3436, -0.4622,  0.3377,  0.1479, -0.2496,
        -0.7709,  0.5227, -0.1283, -0.9188, -0.0176, -0.4404, -0.5266,  0.3373,
         0.6064, -0.4507, -0.0416,  0.0841,  1.3146,  0.6774, -0.2432, -2.0710,
        -0.6065,  0.1971,  0.6357,  0.0782,  0.4916,  0.0817,  0.7086,  0.2019,
         0.5156, -0.2303, -0.4047,  0.3921, -0.5093, -0.1392,  0.2161, -0.6287,
         0.0889,  0.4917, -0.0664,  0.7610, -0.1944,  0.4113, -1.0448, -0.1480,
        -0.0984, -0.2512,  0.8090,  0.3631, -0.7820, -0.1048,  0.0834, -1.2407,
         0.6553, -0.9363,  0.6484, -0.5583,  0.4562,  0.2758, -1.5490, -0.1991,
        -0.5080, -0.1382,  0.2773, -0.75

Once again, this is not very interpretable. But we can use mathematical functions to learn more about the new word vectors we have created: 

In [22]:
similarity = cosine_sim(new_word_vector, word_vectors['man'])
print(f'Our new vector has a cosine similarity of {similarity.item():3f} to the word man')
similarity = cosine_sim(new_word_vector, word_vectors['woman'])
print(f'Our new vector has a cosine similarity of {similarity.item():3f} to the word woman')

Our new vector has a cosine similarity of 0.393379 to the word man
Our new vector has a cosine similarity of 0.557549 to the word woman


And we can use our our search function to find the closest word vectors to our new word in vector space:

In [23]:
print_closest_words(new_word_vector, n=10)

queen 4.081079
monarch 4.6429076
throne 4.9055004
elizabeth 4.921559
prince 4.981147
daughter 4.985715
mother 5.0640874
cousin 5.077497
princess 5.0786853
widow 5.1283097


### Investigating bias in word vectors

Now lets use these tools to see expose the biases encoded in word vectors. 

If we subtract man from the word doctor, and add the vector for woman, the closest word vectors are:

In [24]:
print_closest_words(word_vectors['doctor'] - word_vectors['man'] + word_vectors['woman'])

nurse 4.2283154
physician 4.7054324
woman 4.8734255
dentist 4.969891
pregnant 5.014848


However when we subtract woman from doctor and add man, we do not get the same effect:

In [25]:
print_closest_words(word_vectors['doctor'] - word_vectors['woman'] + word_vectors['man'])

man 4.8998694
dr. 5.05853
brother 5.144743
physician 5.1525483
taken 5.2571893


However when we do the same thing with the word nurse, then we do get the word doctor:

In [26]:
print_closest_words(word_vectors['nurse'] - word_vectors['woman'] + word_vectors['man'])

doctor 4.2283154
technician 4.7353873
sergeant 4.775118
physician 4.786661
paramedic 4.8385634


#### Try investigating your own words for bias:

Plug in different words here and investigate your own kinds of bias. It does not have to be gender bias, it could be racial, class, sexuality, disability or other. 

In [27]:
original_word = 'academic'
negative_word = 'man'
positive_word = 'woman'
new_word_vector = word_vectors[original_word] - word_vectors[negative_word] + word_vectors[positive_word]
print_closest_words(new_word_vector)

undergraduate 5.0620193
graduate 5.3035765
student 5.346771
faculty 5.43235
educational 5.6167006


Have a go at [loading in](#download-word-vectors) some of the [different word vectors available on torchtext](https://torchtext.readthedocs.io/en/latest/vocab.html#pretrained-word-embeddings), or using a different dimensionality for the GloVe vectors and re-run the cells in this notebook. How does that impact the results? (You may want to make a copy of this notebook to make a side-by-side comparison)