In [39]:
import os.path as path
import torch as t
# import torchtext as tt

In [40]:
DATAROOT = path.expanduser("~/mldata")

# Embeddings

## Embedding Layer

An embedding is a lookup table, that when given a token index, will return the vector representation of that token. 

![embedding_1](./imgs/embedding_1.png)

It takes two parameters, `num_embeddings` is the height or the rows in the embedding table, and `embedding_dim` is the width or the cols.

When initialized, the `Embedding` layer will have random embedding values, just like any other layer in PyTorch.

In [41]:
embedding = t.nn.Embedding(num_embeddings=5_000, embedding_dim=5)
for idx in [0, 3435, 988]:
    emb = embedding(t.tensor(idx))
    print(idx)
    print(emb.shape)
    print(emb)
    print()

0
torch.Size([5])
tensor([-0.4781, -0.1330,  0.9880,  3.0628,  0.0660],
       grad_fn=<EmbeddingBackward0>)

3435
torch.Size([5])
tensor([-2.7041,  1.1003, -0.1545, -0.1265, -0.6299],
       grad_fn=<EmbeddingBackward0>)

988
torch.Size([5])
tensor([ 0.8695,  0.6965, -0.5266, -0.3331, -0.2609],
       grad_fn=<EmbeddingBackward0>)



Like all other layers, `Embedding` layer can accept a batch of indexes. Moreover, each row can have multiple indexes. If the input is an $m \times n$ integer tensor, where $m$ is the batch size and $n$ is the number of indexes in each row, then the output will be $m \times n \times d$ where $d$ is the embedding dimension.

In [4]:
input = t.tensor([
    [0, 3435, 988],
    [3840, 8, 2123]
])
emb = embedding(input)
print(emb.shape)
emb

torch.Size([2, 3, 5])


tensor([[[ 5.8387e-01, -7.7427e-01,  1.3158e+00,  1.4077e+00, -7.3324e-01],
         [-7.7331e-01, -1.7010e-01,  8.7688e-01, -1.3446e+00,  6.4063e-01],
         [-3.0518e-01,  8.4026e-01, -5.6898e-01,  9.8898e-01, -6.1587e-04]],

        [[-5.8915e-03, -4.6279e-02, -2.0383e-01,  4.3963e-01,  1.2567e+00],
         [ 1.9461e+00,  1.1002e+00, -5.0909e-01, -1.2554e+00,  1.7956e+00],
         [ 9.0478e-01, -5.3915e-02, -4.1651e-01,  2.5856e-02, -3.2406e-01]]],
       grad_fn=<EmbeddingBackward0>)

## Features
Embeddings are used to model sparse or categorical features, including words in a text/language model. Lets understand each use case in more detail. Lets say I am an e-commerce website and my training set has the following features:
  * Monthly average of transaction amount: float
  * Monthly average of number of transactions: float
  * Number of days since last transaction: int
  * Product IDs of the last three purchased products: List[str]

As can be seen from above, the first three features are my normal "dense" features, but the last feature is a **"sparse"** feature.

Lets add another feature to our training set above which is the home state of the user. This is similar to a sparse feature, except instead of a list of strings, its dtype is a single string. This is often referred to as a **"categorical"** feature.

Lets add in a final feature which is their review comments, more specifically, the first 5 words used by them. If they have less than 5 words, then we pad the end with empty tokens. This is a text or language feature.

| avg_amt | avg_n | n_days | purchased          | state | review                                      |
| ------- | ----- | ------ | ------------------ | ----- | ------------------------------------------- |
| 10.23   | 3     | 2      | [q2x8, 9juk, 90u7] | WA    | [I, was, very, happy, with]       |
| 28.21   | 10    | 0.5    | [89i2, a8p0, q2x8]  | CA    | [Very, good, quality, but, bad] |


Language models are another setting where we see text features. Lets say my text is compsed of "it was the best of times it was the worst of times". I want my sequence length to be 5 and batch size to be 2, then my input will be -
```
["it", "was", "the", "best", "of"]
["of", "times", "it", "was", "the"]
```
Just for interest my output in this case will be -
```
["was", "the", "best", "of", "times"]
["times", "it", "was", "the", "worst"]
```
This is not germane to the discussion on embeddings.

The string tokens in the above table will need to be converted to their corresponding indexes before being processed by the embedding layers. This can happen at a pre-processing step or for each batch. Regardless, we need a way to map the string token to its corresponding index. In the simple case each unique string token is given a unique index starting from 0. The pseudocode for creating this token to index mapping will look like -
```python
alltoks = set()
for row in trainset:
  toks = row["feature_name"].strip().lower().split(" ")
  for tok in toks:
    alltoks.add(tok)

tok_to_idx = {}
curr_idx = 0
for tok in alltoks:
  if tok not in tok_to_idx:
    tok_to_idx[tok] = curr_idx
    curr_idx += 1
```
However, for most sparse features the number of unique tokens are very large which will make the embedding table very big. E.g., for a very big e-commerce company, their product catalog could be in the millions. Another problem is that new items are always being added to the catalog, this will force to retrain our embeddings everytime a new product is added to the catalog. To workaround this problem we use the so-called hashing trick, where the token is hashed to a value within a reasonable range.

Regardless of how this is done, we will eventually end up replacing the string tokens with their corresponding indexes. After replacement, the above table might look like -
| avg_amt | avg_n | n_days | purchased          | state | review                                      |
| ------- | ----- | ------ | ------------------ | ----- | ------------------------------------------- |
| 10.23   | 3     | 2      | [39091, 72705, 31948] | 46    | [41, 15, 191, 1751, 17]       |
| 28.21   | 10    | 0.5    | [65162, 31528, 39091]  | 4    | [191, 219, 1506, 34, 978] |


## Model
Typically we will have an embedding table per feature. In the above e-commerce example, we will have three embedding tables. The height of the product embedding table will be the number of products in our catalog, the height of the home state embedding table will be the total number of states in the country, and the height of the review comment table will be the vocabulary size of my entire training set. 

In the langauge model we will have a single embedding table whose size will be the same as the vocabulary size of my text corpus. 

## Input Batch
In "regular" DNN models my input batch is an $m \times n$ float tensor where $m$ is the batch size and $n$ is the number of features. This entire tensor is fed to the first layer of my DNN. This setup won't work when I have mixed dense and sparse features. In the e-commerce model, I'll need to feed each sparse, categorical, and text feature to its own embedding table separately. 

To accomodate this, one way to structure my input batch is to have four tensors, the dense part which will be a float tensor $m \times 3$ for the 3 dense features, the purchased tensor will be an integer tensor of $m \times 3$ for the 3 product ids in each row, the state tensor will be an integer tensor of $m \times 1$ and review tensor will be an integer tensor of $m \times 5$ for the 5 words. The pseudocode for feeding an input batch into my DNN will be -
```python
for batch in traindl:
    dense, purchased, state, review = batch
    dense_out = model.mlp(dense)
    purchased_embs_batch = model.purchased_embeddings(purchased)
    state_embs_batch = model.state_embeddings(state)
    review_embs_batch = model.review_embeddings(review)
    # mix these outputs in some interaction arch
```

$$
dense = \begin{bmatrix}
10.23 & 3.0 & 2.0 \\
28.21 & 10.0 & 0.5 \\
\end{bmatrix} \\
$$

$$
purchased = \begin{bmatrix}
39091 & 72705 & 31948 \\
65162 & 31528 & 39091 \\
\end{bmatrix}
$$

$$
state = \begin{bmatrix}
46 \\
4 \\
\end{bmatrix}
$$

$$
review = \begin{bmatrix}
41 & 15 & 191 & 1751 & 17 \\
191 & 219 & 1506 & 34 & 978 \\
\end{bmatrix}
$$

Note, this is just one way to structure my input batch, there are other ways as well.

Of course for the language model example the "regular" setup continues to work because -
  * The input is a $m \times n$ integer tensor which is similar to the regular float input tensor.
  * The first layer is typically only the embedding layer that can be fed this entire input.

## Training
Like any other layer, `Embedding` layers can also be trained as part of our DNN, i.e., they have a `.backward()` method, optimizers know how to update the weights, etc. However, in a lot of cases the embeddings are typically trained in some upstream model and are only used in the main model during forward propagation. The `Embedding` class has a conveinece factory method to load pre-trained embeddings.

Lets say I have a 100,000 items in my product catalog and my embedding dim is 8. Further lets say I have obtained pre-trained product embeddings from some upstream model somewhere.

In [5]:
pre_trained_product_embs = t.rand((100_000, 8))

In [6]:
purchased_embeddings = t.nn.Embedding.from_pretrained(pre_trained_product_embs)
purchased_embeddings

Embedding(100000, 8)

In [7]:
purchased_batch = t.tensor([
    [39091, 72705, 31948],
    [65162, 31528, 39091]
])

In [8]:
purchased_embs_batch = purchased_embeddings(purchased_batch)
purchased_embs_batch

tensor([[[0.3683, 0.0027, 0.2053, 0.2611, 0.9904, 0.2372, 0.4524, 0.8966],
         [0.6351, 0.1236, 0.1456, 0.4281, 0.3921, 0.0811, 0.7195, 0.9048],
         [0.3984, 0.5188, 0.7603, 0.0374, 0.8143, 0.3342, 0.2345, 0.0059]],

        [[0.3247, 0.1131, 0.2974, 0.2839, 0.4851, 0.7985, 0.4210, 0.8215],
         [0.4340, 0.3442, 0.1957, 0.5510, 0.9181, 0.1466, 0.6244, 0.2238],
         [0.3683, 0.0027, 0.2053, 0.2611, 0.9904, 0.2372, 0.4524, 0.8966]]])

In [9]:
emb_39091 = pre_trained_product_embs[39091]
emb_72705 = pre_trained_product_embs[72705]
emb_31948 = pre_trained_product_embs[31948]
emb_65162 = pre_trained_product_embs[65162]
emb_31528 = pre_trained_product_embs[31528]

expected_purchased_embs_batch = t.stack((
    t.stack((emb_39091, emb_72705, emb_31948)),
    t.stack((emb_65162, emb_31528, emb_39091))
))
expected_purchased_embs_batch

tensor([[[0.3683, 0.0027, 0.2053, 0.2611, 0.9904, 0.2372, 0.4524, 0.8966],
         [0.6351, 0.1236, 0.1456, 0.4281, 0.3921, 0.0811, 0.7195, 0.9048],
         [0.3984, 0.5188, 0.7603, 0.0374, 0.8143, 0.3342, 0.2345, 0.0059]],

        [[0.3247, 0.1131, 0.2974, 0.2839, 0.4851, 0.7985, 0.4210, 0.8215],
         [0.4340, 0.3442, 0.1957, 0.5510, 0.9181, 0.1466, 0.6244, 0.2238],
         [0.3683, 0.0027, 0.2053, 0.2611, 0.9904, 0.2372, 0.4524, 0.8966]]])

In [10]:
expected_purchased_embs_batch.allclose(purchased_embs_batch)

True

The above was a roundabout and usage specific way of seeing that the pretrained embeddings are the captured correctly in the embedding layer, i.e., `pre_trained_product_embs[i] == purchased_embeddings(t.tensor(i))`. Below is a more direct verification.

In [11]:
t.allclose(pre_trained_product_embs[39091], purchased_embeddings(t.tensor(39091)))

True

### GloVe Embeddings
A common set of pre-trained embeddings for English is GloVe. For this notebook lets use the `glove.6B.100d` embeddings which has around six billion unique tokens and the embedding dimension is 100. 

In [12]:
glove_datapath = path.join(DATAROOT, "glove")
glove = tt.vocab.GloVe(name="6B", dim=100, cache=glove_datapath)

The GloVe dataset is very simple. Each line starts with the word or token followed by the vector values in the same line. Here is what the first line looks like -

```shell
(base) ॐ  glove $ head -1 glove.6B.100d.txt
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
```

The `GloVe` class has a convenience dict called `stoi` to get the index of any given string in its corpus. The pre-trained embeddings are in a tensor called `vectors`.

In [13]:
glove.stoi["the"]

0

In [14]:
glove.vectors[0]

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

In [15]:
glove.vectors.shape

torch.Size([400000, 100])

In [16]:
review_embeddings = t.nn.Embedding.from_pretrained(glove.vectors)
review_embeddings

Embedding(400000, 100)

In [18]:
review_batch = t.tensor([
    [glove.stoi["i"], glove.stoi["was"], glove.stoi["very"], glove.stoi["happy"], glove.stoi["with"]],
    [glove.stoi["very"], glove.stoi["good"], glove.stoi["quality"], glove.stoi["but"], glove.stoi["bad"]],
])
review_batch

tensor([[  41,   15,  191, 1751,   17],
        [ 191,  219, 1506,   34,  978]])

In [19]:
review_embs_batch = review_embeddings(review_batch)
review_embs_batch.shape

torch.Size([2, 5, 100])

In [20]:
emb_i = glove.vectors[glove.stoi["i"]]
emb_was = glove.vectors[glove.stoi["was"]]
emb_very = glove.vectors[glove.stoi["very"]]
emb_happy = glove.vectors[glove.stoi["happy"]]
emb_with = glove.vectors[glove.stoi["with"]]
emb_good = glove.vectors[glove.stoi["good"]]
emb_quality = glove.vectors[glove.stoi["quality"]]
emb_but = glove.vectors[glove.stoi["but"]]
emb_bad = glove.vectors[glove.stoi["bad"]]

expected_review_embs_batch = t.stack((
    t.stack((emb_i, emb_was, emb_very, emb_happy, emb_with)),
    t.stack((emb_very, emb_good, emb_quality, emb_but, emb_bad))
))
expected_review_embs_batch.shape

torch.Size([2, 5, 100])

In [21]:
expected_review_embs_batch.allclose(review_embs_batch)

True

## Embedding Bags
As can be seen in the examples above, each row of sparse features in the batch results in multiple vectors after the embedding lookup. E.g., the first row of the purchased feature was $\begin{bmatrix}39091 & 72705 & 31948 \end{bmatrix}$ but it becomes -
$$
\begin{bmatrix}
0.6127 & 0.6788 & 0.9122 & 0.5206 & 0.8504 & 0.7362 & 0.6871 & 0.4667 \\
0.1555 & 0.8719 & 0.0692 & 0.1788 & 0.3327 & 0.0774 & 0.6614 & 0.0698 \\
0.4526 & 0.5650 & 0.5481 & 0.0946 & 0.5106 & 0.6710 & 0.1706 & 0.3724 \\
\end{bmatrix}
$$
after embedding lookup. A very common next step is to reduce these multiple vectors back into a single vector. This can be done by adding the vectors up, or by averaging them, etc. The pesudocode for this then will be -
```python
purchased_embs = model.purchased_embeddings(purchased)
purchased_vec = t.sum(purchased_embs, axis=1)
```
Because this is such a common step, PyTorch has as convenience class called `EmbeddingBag` to do this. It is also more efficient than the above code.

In [67]:
purchased_batch

tensor([[39091, 72705, 31948],
        [65162, 31528, 39091]])

In [68]:
expected_purchased_vecs = t.sum(purchased_embeddings(purchased_batch), axis=1)
expected_purchased_vecs

tensor([[1.4017, 0.6451, 1.1112, 0.7266, 2.1968, 0.6525, 1.4064, 1.8073],
        [1.1270, 0.4600, 0.6984, 1.0960, 2.3936, 1.1823, 1.4978, 1.9419]])

In [69]:
# A more direct way of calculating expected purchased vectors]
expected_purchased_vecs = t.stack((
    emb_39091 + emb_72705 + emb_31948,
    emb_65162 + emb_31528 + emb_39091
))
expected_purchased_vecs

tensor([[1.4017, 0.6451, 1.1112, 0.7266, 2.1968, 0.6525, 1.4064, 1.8073],
        [1.1270, 0.4600, 0.6984, 1.0960, 2.3936, 1.1823, 1.4978, 1.9419]])

In [70]:
purchase_embeddings_bag = t.nn.EmbeddingBag.from_pretrained(pre_trained_product_embs, mode="sum")

In [71]:
purchased_vecs = purchase_embeddings_bag(purchased_batch)
purchased_vecs

tensor([[1.4017, 0.6451, 1.1112, 0.7266, 2.1968, 0.6525, 1.4064, 1.8073],
        [1.1270, 0.4600, 0.6984, 1.0960, 2.3936, 1.1823, 1.4978, 1.9419]])

In [72]:
expected_purchased_vecs.allclose(purchased_vecs)

True

### Uneven Tokens
While in the above example each row has the same number of tokens, 3 in case of the purchased features, this is not always true. E.g., if my feature was product IDs of products purchased in the last month, this will vary for each user and there each row will have a different number of product IDs.  

| Purchase History |
| ---------------- |
| uxm3 a12o 8u2x   |
| eri8 wi3r        |
| w29k             |

To accomodate this use case, the `EmbeddingBag` layer also takes two flat integer tensors, the first one is a flattened list of all the tokens in the batch. The second is a list of indexes demarcating the example boundaries. Even though the documentation calls this "offsets", they mean offsets from the beginning of the list, so really...indexes 🤷🏾‍♂️

After replacing tokens with their indexes lets say we get the following column -

$$
\begin{bmatrix}
\color{lightgreen}4445 & \color{lightgreen}5576 & \color{lightgreen}251 \\
\color{orange}8747 & \color{orange}8236 & \\
\color{cyan}880
\end{bmatrix}
$$

After flattening this batch we will get -
$$
values = \begin{bmatrix} \color{lightgreen}4445 & \color{lightgreen}5576 & \color{lightgreen}251 & \color{orange}8747 & \color{orange}8236 & \color{cyan}880\end{bmatrix} \\
offsets = \begin{bmatrix} \color{lightgreen}0 & \color{orange} 3 & \color{cyan} 5 \end{bmatrix}
$$

In [73]:
emb_4445 = pre_trained_product_embs[4445]
emb_5576 = pre_trained_product_embs[5576]
emb_251 = pre_trained_product_embs[251]
emb_8747 = pre_trained_product_embs[8747]
emb_8236 = pre_trained_product_embs[8236]
emb_880 = pre_trained_product_embs[880]

print("product 4445 = ", emb_4445)
print("product 5576 = ", emb_5576)
print("product 251 = ", emb_251)
print("product 8747 = ", emb_8747)
print("product 8236 = ", emb_8236)
print("product 880 = ", emb_880)

product 4445 =  tensor([0.8331, 0.7415, 0.4044, 0.3748, 0.8785, 0.6459, 0.0478, 0.6405])
product 5576 =  tensor([0.5124, 0.3451, 0.7072, 0.7841, 0.6407, 0.1628, 0.7097, 0.6242])
product 251 =  tensor([0.3455, 0.6196, 0.0983, 0.4827, 0.4779, 0.3397, 0.1890, 0.9614])
product 8747 =  tensor([0.9107, 0.8435, 0.9811, 0.5535, 0.9616, 0.1972, 0.9600, 0.9077])
product 8236 =  tensor([0.9844, 0.5585, 0.5481, 0.8700, 0.7604, 0.0689, 0.0379, 0.7440])
product 880 =  tensor([0.2010, 0.2330, 0.0684, 0.6042, 0.3434, 0.3394, 0.6218, 0.7564])


In [74]:
expected_purchased_vecs = t.stack((
    emb_4445 + emb_5576 + emb_251,
    emb_8747 + emb_8236,
    emb_880
))

expected_purchased_vecs

tensor([[1.6910, 1.7062, 1.2100, 1.6415, 1.9971, 1.1485, 0.9465, 2.2262],
        [1.8951, 1.4019, 1.5293, 1.4235, 1.7220, 0.2661, 0.9979, 1.6517],
        [0.2010, 0.2330, 0.0684, 0.6042, 0.3434, 0.3394, 0.6218, 0.7564]])

In [75]:
purchased_batch_flat = t.tensor([4445, 5576, 251, 8747, 8236, 880])
purchased_batch_offsets = t.tensor([0, 3, 5])
purchased_vecs = purchase_embeddings_bag(purchased_batch_flat, purchased_batch_offsets)
purchased_vecs

tensor([[1.6910, 1.7062, 1.2100, 1.6415, 1.9971, 1.1485, 0.9465, 2.2262],
        [1.8951, 1.4019, 1.5293, 1.4235, 1.7220, 0.2661, 0.9979, 1.6517],
        [0.2010, 0.2330, 0.0684, 0.6042, 0.3434, 0.3394, 0.6218, 0.7564]])

In [76]:
t.allclose(expected_purchased_vecs, purchased_vecs)

True