# Word Analogies

Word embeddings allow us to process text data in all kinds of interesting ways. 
One experiment is to use code to solve _word analogies_.

> Solving a word analogy "A is to B as X is to Y" means to find one of the parameters, given the other three.

For example:
- London is to UK as Moscow is to what?
- Cat is to kitten as dog is to what?

> Word analogies can be solved using word embeddings

What is the point of this?
- There seems to be little practical application
- But it can help 
    - To understand what word vectors represent
    - To determine if you've found a useful set of word embeddings

Mathematically, that means finding the vector between $a$ and $b$, then adding that to $x$.

# TODO diagram of adding analogy vector to source

Firstly, let's get some pre-trained word embeddings from an extremely widely used embedding model named BERT:

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m92.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1


In [2]:
from transformers import BertModel, BertTokenizer

# %% GET BERT
model_name = 'bert-base-uncased' 
model = BertModel.from_pretrained(model_name) # TODO get BERT model from huggingface
bert_tokenizer = BertTokenizer.from_pretrained(model_name) # TODO get BERT tokeniser from huggingface

# EXAMPLE TOKENISATION
sentence = "Now I want to know what does this vector refers to in dictionary"
tokens = bert_tokenizer.encode(sentence) # TODO encode the sentence

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [3]:
print(model.modules)

<bound method Module.modules of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(

In that list of modules, you can see that the first one is the embedding layer. 
The weights of this layer are the input representation that BERT has learnt for each word.
These are the pre-trained embeddings that we will use.

In [4]:

embedding_matrix = model.embeddings.word_embeddings.weight # TODO get weight parameters from model
embedding_matrix = embedding_matrix.detach() # TODO detach parameters from graph

n_embeddings = 30000
embedding_matrix = embedding_matrix[:n_embeddings] # TODO get the first n_embeddings

print("Embedding shape:", embedding_matrix.shape) # TODO print embedding matrix shape


Embedding shape: torch.Size([30000, 768])


Now we have the embeddings, we want to determine which row corresponds to which token. We can get this mapping from the pre-trained BERT tokeniser:

In [5]:
embedding_labels = list(bert_tokenizer.ids_to_tokens.values())[:n_embeddings] # TODO get the names of the tokens from the tokeniser

Let's quickly define a helper function to visualise our embeddings using Tensorboard:

In [11]:
!pip install torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
from torch.utils.tensorboard import SummaryWriter
from time import time

def visualise_embeddings(embeddings, labels=None, label_names="Label"):
    print("Embedding")

    writer = SummaryWriter() # TODO initialise tensorboard summarywriter
    start = time()
    writer.add_embedding( # TODO add embeddings to tensorboard
        mat=embeddings,
        metadata=labels,
        metadata_header=label_names
    )
    print(f"Total time:", time() - start)

    print("Embedding done")

visualise_embeddings(embedding_matrix, embedding_labels) # TODO call visualise_embeddings

Embedding


AttributeError: ignored

To determine the vector that represents the transformation between $a$ & $b$, we'll need to firstly get the embedding for each of them:

In [13]:
def get_word_embedding(word):

    tokens_to_ids = {token: id for id, token in bert_tokenizer.ids_to_tokens.items()} # TODO create a mapping from the tokeniser's ids_to_tokens attribute by reversing it with a dictionary comprehension

    token_id = tokens_to_ids[word] # TODO get the id from the tokeniser
    embedding = embedding_matrix[token_id] # TODO index embedding for this id out of the embedding matrix
    return embedding

example_word_embedding = get_word_embedding("apple")
print(example_word_embedding)

tensor([-1.6887e-02, -7.4078e-03, -7.0792e-02, -7.2979e-02,  2.6306e-02,
         1.2412e-02, -1.5166e-02, -5.7818e-02, -1.7665e-02, -6.0178e-02,
        -6.9499e-02, -8.4558e-02, -6.2827e-02, -3.8619e-02, -4.2123e-02,
        -3.3479e-02,  7.6708e-03, -5.8426e-02,  1.4515e-02, -1.3542e-01,
         4.4417e-02, -7.0895e-02,  3.5826e-02, -2.9868e-02, -3.8617e-02,
        -4.9124e-02, -7.3432e-02, -4.7727e-02, -1.3144e-02, -6.3145e-02,
        -8.0265e-02,  8.6743e-03, -2.0196e-02, -2.2212e-02, -4.2043e-02,
        -4.5627e-02, -5.2184e-02, -1.3404e-02, -3.0210e-02, -3.4542e-02,
        -6.8846e-03, -5.2005e-02,  9.3773e-03, -3.4767e-02,  1.5441e-02,
        -1.1546e-02, -4.0174e-02, -2.2193e-02, -9.8711e-02, -4.5019e-02,
        -2.8062e-02,  3.6789e-02, -1.1174e-02, -6.9229e-02, -4.1744e-03,
         1.7117e-02, -2.2168e-02,  2.7866e-02,  2.4114e-02, -5.9043e-03,
         2.2167e-02, -1.1400e-01, -6.4697e-02, -2.7417e-02, -9.1516e-02,
        -2.1976e-02,  1.4345e-02,  6.0132e-02, -5.4

To find the closest vector to an embedding, we'll need to compare its distace to all other token embeddings. An effective way to do that is by taking their cosine similarity. 

## TODO diagram of comparing word vectors with cosine similarity

We could implement the cosine similarity ourselves, but we can also get a function to do that off the shelf, from the `torchmetrics` library. You can check out the documentation [here](https://torchmetrics.readthedocs.io/en/stable/pairwise/cosine_similarity.html).

In [7]:
!pip install torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchmetrics
  Downloading torchmetrics-0.11.1-py3-none-any.whl (517 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m517.2/517.2 KB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchmetrics
Successfully installed torchmetrics-0.11.1


Often, the nearest token to the solved analogy embedding is the token that you started with or its plural.
So, we might want to get more than just the closest one. 

Now let's define a function to get the nearest $n$ tokens to an embedding:

In [14]:
import torch
import torchmetrics

def get_nearest_n_tokens_from_embedding(embedding, n=20):
    # cosine similarity from d_embedding to embedding of all words
    similarity = torchmetrics.functional.pairwise_cosine_similarity( # TODO take the pairwise cosine distance
        embedding.unsqueeze(0), embedding_matrix).squeeze()
    similarity_idx = reversed(torch.argsort(similarity, dim=0)) # TODO argsort by similarity score
    print(similarity_idx.shape)
    similarity_idx = similarity_idx[:n] # TODO slice out the indexes of the top n
    return [list(bert_tokenizer.ids_to_tokens.values())[idx] for idx in similarity_idx] # TODO get the top n most similar tokens from the tokeniser

get_nearest_n_tokens_from_embedding(example_word_embedding)


torch.Size([30000])


['apple',
 'apples',
 '880',
 '1620',
 '930',
 '1100',
 '910',
 '870',
 '1621',
 '1682',
 '840',
 '680',
 '1650',
 '850',
 '820',
 '280',
 '1628',
 '1683',
 '318',
 '980']

Now, let's implement a function to solve the analogy:

In [15]:
def analogy_solver(a, b, c, embedding_matrix, labels, n=5):
    """
    Solves A is to B what C is to D, given, A, B & C, returning D

    """

    # GET EMBEDDINGS FOR KNOWN WORDS
    a_embedding = get_word_embedding(a)
    b_embedding = get_word_embedding(b)

    # GET TRANSFORMATION APPLIED
    transformation_vector = b_embedding - a_embedding # TODO calculate vector difference between a and b

    c_embedding = get_word_embedding(c) # TODO get word embedding of c
    print(c_embedding.shape)
    d_embedding = c_embedding + transformation_vector # TODO add difference between a and b to c
    print(d_embedding.shape)
    nearest_tokens = get_nearest_n_tokens_from_embedding(d_embedding, n=n+1) # TODO get n+1 nearest tokens (n+1 because the most similar to c is often itself)
    for d in nearest_tokens: # TODO for each nearest token
        if d == c: # TODO skip if d == c
            continue
        print(f"{a} is to {b} as {c} is to {d}")
    print()

Now let's use that to solve a few analogies:

In [16]:
analogy_solver("man", "woman", "king", embedding_matrix, embedding_labels)
analogy_solver("london", "uk", "moscow", embedding_matrix, embedding_labels)
analogy_solver("puppy", "dog", "kitten", embedding_matrix, embedding_labels)

torch.Size([768])
torch.Size([768])
torch.Size([30000])
man is to woman as king is to queen
man is to woman as king is to woman
man is to woman as king is to princess
man is to woman as king is to kings
man is to woman as king is to queens

torch.Size([768])
torch.Size([768])
torch.Size([30000])
london is to uk as moscow is to uk
london is to uk as moscow is to ussr
london is to uk as moscow is to russians
london is to uk as moscow is to kyiv
london is to uk as moscow is to leningrad

torch.Size([768])
torch.Size([768])
torch.Size([30000])
puppy is to dog as kitten is to dog
puppy is to dog as kitten is to dogs
puppy is to dog as kitten is to cat
puppy is to dog as kitten is to cats
puppy is to dog as kitten is to parrot



You can see that the analogies seem to work (roughly)