https://www.cs.toronto.edu/~lczhang/360/lec/w06/w2v.html

In [1]:
import torch
import torchtext

In [2]:
# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=50)   # embedding size = 50

In [6]:
glove['cat']

tensor([ 0.4528, -0.5011, -0.5371, -0.0157,  0.2219,  0.5460, -0.6730, -0.6891,
         0.6349, -0.1973,  0.3368,  0.7735,  0.9009,  0.3849,  0.3837,  0.2657,
        -0.0806,  0.6109, -1.2894, -0.2231, -0.6158,  0.2170,  0.3561,  0.4450,
         0.6089, -1.1633, -1.1579,  0.3612,  0.1047, -0.7832,  1.4352,  0.1863,
        -0.2611,  0.8328, -0.2312,  0.3248,  0.1449, -0.4455,  0.3350, -0.9595,
        -0.0975,  0.4814, -0.4335,  0.6945,  0.9104, -0.2817,  0.4164, -1.2609,
         0.7128,  0.2378])

Measuring Distance  

Euclidean distance

In [None]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

Cosine Similarity.   
The cosine similarity measures the angle between two vectors, and has the property that it only considers the direction of the vectors, not their the magnitudes.

In [9]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

tensor([0.9218])

In fact, we can look through our entire vocabulary for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word like "cat".

In [10]:
def print_closest_words(v, n=5):
    dists = torch.norm(glove.vectors - v, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:         # take the top n
        print(glove.itos[idx], difference)

In [12]:
print_closest_words(glove['nurse'])

doctor 3.1274529
dentist 3.1306612
nurses 3.26872
pediatrician 3.3212206
counselor 3.3987114


Analogies  

One surprising aspect of GloVe vectors is that the directions in the embedding space can be meaningful. The structure of the GloVe vectors certain analogy-like relationship like this tend to hold:

```
king − man + woman ≈ queen
```

In [13]:
print_closest_words(glove['king'] - glove['man'] + glove['woman'])

queen 2.8391209
prince 3.6610038
elizabeth 3.7152522
daughter 3.8317878
widow 3.8493774


In [14]:
print_closest_words(glove['programmer'] - glove['bad'] + glove['good'])

versatile 4.381561
creative 4.5690007
entrepreneur 4.6343737
enables 4.7177725
intelligent 4.7349973


In [15]:
print_closest_words(glove['programmer'] - glove['good'] + glove['bad'])

hacker 3.8383653
glitch 4.003873
originator 4.041952
hack 4.047719
serial 4.2250676
