# Tutorial 6

At first, we will take a more detailed look at embeddings using Appendix B - 'A Closer Look at Word Embeddings' from [here](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/main/legacy/B%20-%20A%20Closer%20Look%20at%20Word%20Embeddings.ipynb).

All of the code is taken from the following [Github page](https://github.com/bentrevett/pytorch-sentiment-analysis), which has a lot additional tutorials to offer.

## Embeddings

Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a *sparse vector*, whilst the real valued vector is known as a *dense vector*.

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

If you want to know how *word2vec* works, check out a two part series [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [here](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/), and if you want to find out more about *GloVe*, check the website [here](https://nlp.stanford.edu/projects/glove/).

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor.


### Loading the GloVe vectors

First, we'll load the GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions.

**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection.

In [None]:
%pip install torchtext torch==2.2.0

Collecting torchtext
  Downloading torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting torch==2.2.0
  Downloading torch-2.2.0-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.19.3 (from torch==2.2.0)
  Downloading nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl.metadata (1.8 kB)
Collecting triton==2.2.0 (from torch==2.2.0)
  Downloading triton-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
INFO: pip is looking at multiple versions of torchtext to determine which version is compatible with other requirements. This could take a while.
Collecting torchtext
  Downloading torchtext-0.17.2-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
  Downloading torchtext-0.17.1-cp311-cp311-manylinux1_x86_64.whl.metadata (7.6 kB)
  Downloading torchte

In [None]:
import torchtext.vocab

glove = torchtext.vocab.GloVe(name = '6B', dim = 50)

print(f'There are {len(glove.itos)} words in the vocabulary')

.vector_cache/glove.6B.zip: 862MB [02:39, 5.39MB/s]                           
100%|█████████▉| 399999/400000 [00:16<00:00, 24040.95it/s]


There are 400000 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [None]:
glove.vectors.shape

torch.Size([400000, 50])

In [None]:
glove.vectors[0]

tensor([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
        -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
         2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
         1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
        -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
        -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
         4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
         7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
        -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
         1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])

We can see what word is associated with each row by checking the `itos` (int to string) list.

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [None]:
glove.itos[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [None]:
glove.stoi["the"]

0

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [None]:
glove.vectors[glove.stoi['distribution']]

tensor([ 0.7140, -0.6182, -0.0561,  0.8156,  0.2053,  0.5305, -0.3795, -1.3388,
         1.7468,  0.4604,  1.1172, -0.0601, -0.0469, -0.4204,  0.2040,  0.2522,
         0.0595, -0.0693,  0.2508, -0.6176,  0.7069, -0.6959,  0.0909,  0.5258,
        -0.9721,  0.0638, -0.1614,  0.0575,  0.7902,  0.3413,  3.2132, -0.1495,
         0.1039, -0.8629, -0.4394,  0.0373, -0.3620, -0.0531, -0.0706,  0.7165,
         0.5031,  0.2062, -0.3848,  0.1896, -1.2933, -0.1960, -0.1562,  0.3917,
        -0.1183,  0.3970])

We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word then returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary.

In [None]:
def get_vector(embeddings, word):
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    return embeddings.vectors[embeddings.stoi[word]]

As before, we use a word to get the associated vector.

In [None]:
get_vector(glove, 'test')

tensor([ 0.1318, -0.2552, -0.0679,  0.2619, -0.2616,  0.2357,  0.1308, -0.0118,
         1.7659,  0.2078,  0.2620, -0.1643, -0.8464,  0.0201,  0.0702,  0.3978,
         0.1528, -0.2021, -1.6184, -0.5433, -0.1786,  0.5389,  0.4987, -0.1017,
         0.6626, -1.7051,  0.0572, -0.3241, -0.6683,  0.2665,  2.8420,  0.2684,
        -0.5954, -0.5004,  1.5199,  0.0396,  1.6659,  0.9976, -0.5597, -0.7049,
        -0.0309, -0.2830, -0.1356,  0.6429,  0.4149,  1.2362,  0.7659,  0.9780,
         0.5851, -0.3018])

## Similar Contexts

Now to start looking at the context of different words.

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary calculating the distance between the vector of each word and our input word vector. We then sort these from closest to furthest away.

The function below returns the closest 10 words to an input word vector:

In [None]:
import torch

def closest_words(embeddings, vector, n = 10):

    distances = [(word, torch.dist(vector, get_vector(embeddings, word)).item())
                 for word in embeddings.itos]

    return sorted(distances, key = lambda w: w[1])[:n]

In [None]:
norm()

<scipy.stats._continuous_distns.norm_gen at 0x7b1b1f215510>

In [None]:
from numpy.linalg import norm

for vec in glove.vectors:
  print(norm(vec))

In [None]:
word_vector = get_vector(glove, 'tower')

closest_words(glove, word_vector)

[('tower', 0.0),
 ('towers', 2.4298789501190186),
 ('gate', 3.0436489582061768),
 ('building', 3.4286906719207764),
 ('skyscraper', 3.4428133964538574),
 ('roof', 3.4700679779052734),
 ('built', 3.4704749584198),
 ('dome', 3.524683713912964),
 ('facade', 3.633889675140381),
 ('constructed', 3.6388208866119385)]

Let's try it out with 'korea'. The closest word is the word 'korea' itself (not very interesting), however all of the words are related in some way. Pyongyang is the capital of North Korea, DPRK is the official name of North Korea, etc.

Interestingly, we also get 'Japan' and 'China',  implies that Korea, Japan and China are frequently talked about together in similar contexts. This makes sense as they are geographically situated near each other.

In [None]:
word_vector = get_vector(glove, 'korea')
closest_words(glove, word_vector)

[('korea', 0.0),
 ('korean', 2.937217950820923),
 ('pyongyang', 3.229891061782837),
 ('dprk', 3.329789400100708),
 ('seoul', 3.4446284770965576),
 ('japan', 3.669916868209839),
 ('china', 3.727811098098755),
 ('iran', 3.766406774520874),
 ('beijing', 3.93996262550354),
 ('koreans', 4.017017364501953)]

Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India (geographically), but Thailand and Malaysia are closer. So why is Australia closer to India in vector space? This is most probably due to India and Australia appearing in the context of [cricket](https://en.wikipedia.org/wiki/Cricket) matches together.

In [None]:
word_vector = get_vector(glove, 'the')

closest_words(glove, word_vector)

[('the', 0.0),
 ('which', 1.9375851154327393),
 ('part', 1.9737050533294678),
 ('of', 2.189652919769287),
 ('in', 2.193087339401245),
 ('on', 2.2333903312683105),
 ('one', 2.2391767501831055),
 ('.', 2.248121738433838),
 ('as', 2.264446258544922),
 ('same', 2.3636555671691895)]

We'll also create another function that will nicely print out the tuples returned by our `closest_words` function.

In [None]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}')

A final word to look at, 'sports'. As we can see, the closest words are most of the sports themselves.

In [None]:
word_vector = get_vector(glove, 'the')

print_tuples(closest_words(glove, word_vector))

(0.0000) the
(1.9376) which
(1.9737) part
(2.1897) of
(2.1931) in
(2.2334) on
(2.2392) one
(2.2481) .
(2.2644) as
(2.3637) same


## Analogies

Another property of word embeddings is that they can be operated on just as any standard vector and give interesting results.

We'll show an example of this first, and then explain it:

In [None]:
def analogy(embeddings, word1, word2, word3, n=5):

    #get vectors for each word
    word1_vector = get_vector(embeddings, word1)
    word2_vector = get_vector(embeddings, word2)
    word3_vector = get_vector(embeddings, word3)

    #calculate analogy vector
    analogy_vector = word2_vector - word1_vector + word3_vector

    #find closest words to analogy vector
    candidate_words = closest_words(embeddings, analogy_vector, n+3)

    #filter out words already in analogy
    candidate_words = [(word, dist) for (word, dist) in candidate_words
                       if word not in [word1, word2, word3]][:n]

    print(f'{word1} is to {word2} as {word3} is to...')

    return candidate_words

In [None]:
print_tuples(analogy(glove, 'male', 'king', 'female'))

male is to king as female is to...
(3.1078) prince
(3.5638) uncle
(3.6519) brother
(3.6756) queen
(3.7817) grandson


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. For example, this gets an "acting career vector":

In [None]:
print_tuples(analogy(glove, 'distribution', 'probability', 'expectation'))

distribution is to probability as expectation is to...
(5.0663) certainty
(5.5545) likelihood
(5.6148) calculated
(5.6347) inclination
(5.6383) downside


For a "baby animal vector":

In [None]:
print_tuples(analogy(glove, 'cat', 'kitten', 'dog'))

cat is to kitten as dog is to...
(3.0314) puppy
(3.2785) rottweiler
(3.5163) spunky
(3.5478) toddler
(3.5482) mannequin


A "capital city vector":

In [None]:
print_tuples(analogy(glove, 'france', 'paris', 'germany'))

france is to paris as germany is to...
(2.3015) berlin
(3.4018) vienna
(3.4697) munich
(3.4750) frankfurt
(3.5025) hamburg


A "musician's genre vector":

In [None]:
print_tuples(analogy(glove, 'elvis', 'rock', 'eminem'))

elvis is to rock as eminem is to...
(4.5673) rap
(5.1407) hip-hop
(5.1510) rappers
(5.2317) hop
(5.2441) rapper


And an "ingredient vector":

In [None]:
print_tuples(analogy(glove, 'beer', 'barley', 'wine'))

beer is to barley as wine is to...
(4.1063) grape
(4.4254) legumes
(4.4577) grapes
(4.4731) varieties
(4.5731) beans


## Bonus (A Short Intro to Neural Style Transfer)

Take a look at the following [tutorial](https://pytorch.org/tutorials/advanced/neural_style_tutorial.html).