In [1]:
import os.path as path

import numpy as np
import torch as t
import torchtext as tt

In [2]:
DATAROOT = path.expanduser("~/mldata")

# Embedding Layer
An embedding is a look up table, that when given a word/token index, will return the vector representation of that word/token. The `num_embeddings` param tells the `Embedding` class big is this lookup table. This is set to the size of our vocabulary. The `embedding_dim` param specifies the size of the resulting word vector. By default, this embedding will have random values for each word/token index.

In [3]:
vocab_size = 100_000
embedding = t.nn.Embedding(num_embeddings=vocab_size, embedding_dim=11)

Now lets create a batch of 3 sequences, each sequence being 2 words long. Such a tensor has BATCH_SIZE x SEQ_LEN dimensions.

$$
\begin{bmatrix}
hello & world \\
goodbye & everybody \\
cookie & monster \\
\end{bmatrix}
$$

When I pass this through the embedding layer, it should replace each word/token with its corresponding vector. The output tensor will have dimensions BATCH_SIZE x SEQ_LEN x EMBEDDING_DIM. This is important, because for RNN layers, the default dimensions are SEQ_LEN x BATCH_SIZE x INPUT_DIM

For now we don't have a real dataset, so lets pretend that these words have the following index values:
  * hello => 12545
  * world => 51
  * goodbye => 7373
  * everybody => 7771
  * cookie => 17185
  * monster => 6290

In [4]:
contents = t.tensor([
    [12545,    51],
    [ 7373,  7771],
    [17185,  6290]
])

In [5]:
emb = embedding(contents)
emb.shape

torch.Size([3, 2, 11])

In the generated embedding, the word vector corrosponding to `hello` is `emb[0][0]`

In [6]:
emb[0][0]

tensor([ 0.6146, -0.9629,  0.4635, -2.5220,  1.6283, -0.5316,  0.3511,  0.7853,
        -0.8683,  1.1452, -1.1283], grad_fn=<SelectBackward0>)

# Pre-trained Embeddings
One way is to train the embeddings along with the rest of my network. For big datasets with a very specific vocabulary, e.g., legal documents, this works well. However for most other problems it makes sense to use existing pre-trained word vectors like those from GloVe.

For this notebook, lets use the `glove.6B.100d` embeddings, which has around 6B unique tokens and the embeddings have a size of 100.

In [7]:
glove_datapath = path.join(DATAROOT, "glove")
glove = tt.vocab.GloVe(name="6B", dim=100, cache=glove_datapath)

/Users/avilay/mldata/glove/glove.6B.zip: 862MB [02:39, 5.41MB/s]                               
100%|█████████▉| 399999/400000 [00:05<00:00, 76183.61it/s]


In [8]:
glove.stoi["the"]

0

In [9]:
glove.vectors[0]

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

The glove dataset is very simple. Each line starts with the word/token followed by the vector values in the same line. So to get the word vector for `the`, which happens to be the first word in the dataset, just read the first line of `glove.6B.100d.txt`. 

In [10]:
expected_glove_the = t.tensor([-0.038194,-0.24487,0.72812,-0.39961,0.083172,0.043953,-0.39141,0.3344,-0.57545,0.087459,0.28787,-0.06731,0.30906,-0.26384,-0.13231,-0.20757,0.33395,-0.33848,-0.31743,-0.48336,0.1464,-0.37304,0.34577,0.052041,0.44946,-0.46971,0.02628,-0.54155,-0.15518,-0.14107,-0.039722,0.28277,0.14393,0.23464,-0.31021,0.086173,0.20397,0.52624,0.17164,-0.082378,-0.71787,-0.41531,0.20335,-0.12763,0.41367,0.55187,0.57908,-0.33477,-0.36559,-0.54857,-0.062892,0.26584,0.30205,0.99775,-0.80481,-3.0243,0.01254,-0.36942,2.2167,0.72201,-0.24978,0.92136,0.034514,0.46745,1.1079,-0.19358,-0.074575,0.23353,-0.052062,-0.22044,0.057162,-0.15806,-0.30798,-0.41625,0.37972,0.15006,-0.53212,-0.2055,-1.2526,0.071624,0.70565,0.49744,-0.42063,0.26148,-1.538,-0.30223,-0.073438,-0.28312,0.37104,-0.25217,0.016215,-0.017099,-0.38984,0.87424,-0.72569,-0.51058,-0.52028,-0.1459,0.8278,0.27062])

In [11]:
glove.vectors[glove.stoi["the"]].allclose(expected_glove_the)

True

Now I can create the `Embedding` layer with these pre-trained vectors.

In [12]:
embedding = t.nn.Embedding.from_pretrained(glove.vectors)

In [13]:
contents = t.tensor([
    [glove.stoi["hello"], glove.stoi["world"]],
    [glove.stoi["goodbye"], glove.stoi["everybody"]],
    [glove.stoi["cookie"], glove.stoi["monster"]],
])
contents

tensor([[13075,    85],
        [10926,  2587],
        [13816,  7519]])

Now when I pass this tensor through the embedding layer, it will replace 13705 with the word vector for `hello`, 10926 with the word vector for `goodbye`, and so on.

In [14]:
hello = glove.vectors[glove.stoi["hello"]]
world = glove.vectors[glove.stoi["world"]]
goodbye = glove.vectors[glove.stoi["goodbye"]]
everybody = glove.vectors[glove.stoi["everybody"]]
cookie = glove.vectors[glove.stoi["cookie"]]
monster = glove.vectors[glove.stoi["monster"]]

exp = t.stack((
    t.stack((hello, world)), 
    t.stack((goodbye, everybody)), 
    t.stack((cookie, monster))
))
exp.shape

torch.Size([3, 2, 100])

In [15]:
emb = embedding(contents)
emb.shape

torch.Size([3, 2, 100])

In [16]:
exp.allclose(emb)

True

In [27]:
emb[0,1,-3:]

tensor([-0.5088,  0.6256,  0.4392])

## Recap

First assign all unique tokens in the corpus a unique index. The pseudocode for this will look something like -
```python
tok_to_idx = {}
last_idx = 0
for token in corpus:
    if token not in tok_to_idx:
        tok_to_idx[token] = last_idx
        last_idx += 1
```

The actual input that we want to feed to our model is -

$$
\begin{bmatrix}
hello & world \\
goodbye & everybody \\
cookie & monster \\
\end{bmatrix}
$$

I'll need to convert the input tokens to their corresponding indexes. The pseudocode will look something like -
```python
def convert(input : List[List[str]], tok_to_idx: Dict[str, int]):
    batch_size = len(input)
    n_features = len(input[0])
    return t.tensor([[tok_to_idx[input[i][j]] for j in n_featurs] for i in batch_size])
```

For ranking and recommendation problems, the tokens can be **post ids** or **location ids** or **product ids**, etc. They may look numerical, but I need to map them to some index space. This is why it is best to treat these ids as tokens. After all of this I'll end up with with a $m \times n$ integer tensor where $m$ is the batch size and $n$ is the number of categorical features.

$$
\begin{bmatrix}
13075 & 85 \\
10926 & 2587 \\
13816 & 7519
\end{bmatrix}
$$

When I pass this input to the `Embedding` layer, it will convert each idx into the corresponding embedding vector and return a $m \times n \times d$ where $d$ is the embedding dimension.

$$
\begin{bmatrix}
\begin{bmatrix}
0.2669 & 0.3963 & 0.6169 & \cdots & 0.3584 & -0.4846 & 0.3073 \\
0.4918 & 1.1164 & 1.1424 & \cdots & -0.5088 & 0.6256 & 0.4392 \\
\end{bmatrix}\\
\begin{bmatrix}
0.2669 & 0.3963 & 0.6169 & \cdots & 0.3584 & -0.4846 & 0.3073 \\
0.4918 & 1.1164 & 1.1424 & \cdots & -0.5088 & 0.6256 & 0.4392 \\
\end{bmatrix}\\
\begin{bmatrix}
0.2669 & 0.3963 & 0.6169 & \cdots & 0.3584 & -0.4846 & 0.3073 \\
0.4918 & 1.1164 & 1.1424 & \cdots & -0.5088 & 0.6256 & 0.4392 \\
\end{bmatrix}
\end{bmatrix}
$$

# Embedding Bag

A lot of times the value of a feature is not a single token, but multiple tokens. E.g., if my input consists of bi-grams then each position will have two tokens. If my feature is the posts that a user has liked in the last 5 days, it can be multiple post IDs. In this case I generally want to get the embeddings of each token from the embedding table and then reduce them somehow by summing, averaging, etc. Lets say my input is -

$$
\begin{bmatrix}
\begin{bmatrix} 1 & 2 \end{bmatrix} & \begin{bmatrix}  \end{bmatrix} \\
\end{bmatrix}
$$

In [29]:
embeddings = t.nn.EmbeddingBag(num_embeddings=10, embedding_dim=3)

In [30]:
input = t.tensor([1, 2, 4, 5, 4, 3, 2, 9], dtype=t.long)
offsets = t.tensor([0, 4], dtype=t.long)
embs = embeddings(input, offsets)
embs

tensor([[-0.3959, -0.2102,  0.1154],
        [ 0.3190,  0.7162, -0.7253]], grad_fn=<EmbeddingBagBackward0>)

# Using embedding with a real dataset
So far I have been creating my `contents` matrix by hand and using the index values provided by the GloVe dataset. In reality, I'll have a text corpus and the vocabulary for that will be auto-genrated by PyTorch. In such cases, the words will have different indexes. E.g., in the AG News dataset, the word `the` has index 3, whereas in the GloVe dataset it has an index of 0. For such pre-existing vocabulary objects, I can load the vectors of a pre-trained word vector dataset and the `Vocab` object will automatically map the words to their vectors. The index will still be what was in the original vocab. E.g., in the AG News dataset, after loading the GloVe vectors, the index of `the` will still be 3, but now its vector value will be the GloVe word vector.

When creating the `Embedding` object, I must take care to use the pre-trained vector from the vocab, and not from the glove vector.

### Aug 20, 2022: Code in this section is broken because `torchtext` has taken a dependency on `torchdata` which seems to be broken.

In [31]:
datapath = path.join(DATAROOT, "CoLA")
trainset, testset = tt.datasets.CoLA(datapath)
print(len(trainset), len(testset))

NameError: name 'IterableWrapper' is not defined

In [19]:
vocab = trainset.get_vocab()
vocab.stoi["the"]

AttributeError: '_RawTextIterableDataset' object has no attribute 'get_vocab'

Initially `vocab` does not have any vectors. All it has is the token and its index.

In [21]:
vocab.vectors[3]

NameError: name 'vocab' is not defined

After we load the `glove` vectors, the vocab will have automatically map the right word indexes to the right vectors. As can be seen in the example below, the word with index 3 `the` is mapped to the right word vector from `glove` where it had the index 0.

In [None]:
vocab.load_vectors(glove)

In [None]:
vocab.vectors[3]

In [None]:
vocab.vectors[3].allclose(expected_glove_the)

In [None]:
embedding = t.nn.Embedding.from_pretrained(vocab.vectors)

In [None]:
contents = t.tensor([
    [vocab.stoi["hello"], vocab.stoi["world"]],
    [vocab.stoi["goodbye"], vocab.stoi["everybody"]],
    [vocab.stoi["cookie"], vocab.stoi["monster"]]
])
print(contents.shape)
contents

In [None]:
emb = embedding(contents)
emb.shape

Even though the `contents` values, i.e., the word indexes are different, the embedding matrix is same as before, i.e., the word `hello` is still replaced with the word vector for `hello` and so on.

In [None]:
exp.allclose(emb)

In [21]:
t.empty((2, 2), dtype=t.int)

tensor([[253231109,         1],
        [253311776,         1]], dtype=torch.int32)