## PyTorch's nn.Embedding

In [2]:
import torch
from torch import nn

In [3]:
# define the dictionary
word_to_ix = {"p22": 0, "p23": 1, "p42": 2, "endp22": 3, "wait2": 4, "wait3": 5}

In [4]:
# set parameters
vocab_size = len(word_to_ix)
embedding_dim = 5

In [5]:
# example of word to integer
word_to_ix["p42"]

2

In [6]:
# create embedding layer
embeds = nn.Embedding(vocab_size, embedding_dim)  # 6 words in vocab, 5 dimensional embeddings

In [7]:
# convert text -> integer -> 1d-tensor
example_tensor = torch.tensor([word_to_ix["p42"]], dtype=torch.long)
print(example_tensor)

tensor([2])


In [17]:
example_tensor = torch.tensor([1,2,3,4,5,1,2,3,4,5])
example_tensor.shape

torch.Size([10])

In [18]:
# embed 1d-tensor into 5-dim vector
example_embed = embeds(example_tensor)
print(example_embed)

tensor([[ 1.4236, -0.5190,  0.0436,  0.0023,  1.0424],
        [-0.1578,  0.3750,  0.0773,  0.7948, -0.1578],
        [ 0.0985,  1.0753,  0.5511, -0.6132, -0.0939],
        [-0.4251,  0.9446,  2.0483,  0.0751, -0.1066],
        [ 2.0693,  1.5477, -1.0879, -0.3237,  0.2905],
        [ 1.4236, -0.5190,  0.0436,  0.0023,  1.0424],
        [-0.1578,  0.3750,  0.0773,  0.7948, -0.1578],
        [ 0.0985,  1.0753,  0.5511, -0.6132, -0.0939],
        [-0.4251,  0.9446,  2.0483,  0.0751, -0.1066],
        [ 2.0693,  1.5477, -1.0879, -0.3237,  0.2905]],
       grad_fn=<EmbeddingBackward>)


In [42]:
# example of embedding a batch of 5 words for a vocab size of 149

example_batch = torch.tensor([22, 23, 46, 52, 72])

embeds = nn.Embedding(149, 5)  # 149 words in vocab, 5 dimensional embeddings

test_embed = embeds(test)

print(test_embed)

tensor([[ 1.6425,  0.0796,  0.4239, -0.1386, -0.7610],
        [ 0.0021, -0.1807,  0.1980,  0.7447, -0.0712],
        [ 0.0964,  0.0788,  0.1742, -1.0710,  1.4882],
        [-0.4784,  0.8262,  0.7358, -1.5041,  0.8019],
        [-0.0998,  0.3637, -0.4066,  1.2723, -1.0680]],
       grad_fn=<EmbeddingBackward>)


## nn.Embedding

1. define vocab length
2. takes in a 0D-tensor (i.e. 49, 23, 24)
3. no need for one-hot encoding
4. if input tensor([49,23,34])
    - will output tensor of 3*5
    - in general, will output tensor of shape input_length*embedding_dim

## to do
1. remove one-hot encoding
2. input should be 1d tensor
3. output is still the same?
    - usually output should be an embedding vector as well
    - then use arg.max to get token_id
4. use jupyter notebook
5. remove keras's categorical

## questions

1. should the output still be the same?
    - usually output should be an embedding vector as well
    - then use arg.max to get token_id
2. what should the embedding dimension be?
    - use hyperparameter optimization to get best number, eg. ranging from 50 to 1000
    - A good rule of thumb is 4th root of the vocab_length, eg. 149^(1/4) = 3.5
    - The typical number of dimensions is between 200–300.
    - The number of dimensions does not greatly impact how distances in the word embedding space encode semantic relationships. You can pick a power of 32 (64, 128, 256) to speed up modeling training.