Source: https://youtu.be/ISNdQcPhsts?si=_1mO7CBcvFHg15cJ

Umar Jamil has taken inspiration from the [Harvard pytorch transformer article](https://nlp.seas.harvard.edu/annotated-transformer/)

In [2]:
import torch
import torch.nn as nn
import numpy as np
import math

### Input Embedding

TODO:

- explore what `nn.Embedding` does


In [4]:
class InputEmbedding(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super(InputEmbedding, self).__init__()
        self.d_model = d_model  # in this paper, it 512
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)
        
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)
        # check the last line on page 5: 
        # "In the embedding layers, we multiply those weights by d model."

### Positional Encoding

TODO:
- check Amirhossein Kazamnejad's blog on positional encoding

Umar Jamil uses the [Harvard pytorch transformer article implementation of positional encoding formula](https://nlp.seas.harvard.edu/annotated-transformer/#positional-encoding) mentioned in the paper by using log. He mentions in his video that applying log to an exponential nullifies the effect of log but makes the calculation more numerically stable. The value of the positional encoding calculated this way will be slightly different but the model will learn. Click [here](https://youtu.be/ISNdQcPhsts?si=HNaqDgkw6CfwgO-M&t=470) to watch that particular scene from the video.

Click [here](https://youtu.be/ISNdQcPhsts?si=cvEfkDJyW7LiBqkn&t=720) to see the reasoning behind using `self.register_buffer("pe", pe)`. The reasoning that when we want to save some variable not as a learned parameter (like weights and biases) but we want it to be saved when we save the file of the model, the we should register it as a buffer. This way it will be saved along with the state of the model.

Original formula:

$$PE_{(pos, 2i)} = sin \left( \frac{pos}{10000^{\frac{2i}{d_{model}}}} \right)$$

$$PE_{(pos, 2i+1)} = cos \left( \frac{pos}{10000^{\frac{2i}{d_{model}}}} \right)$$

<br></br>

Modified formula by Harvard Transformer article:



In [47]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super(PositionalEncoding, self).__init__()
        self.d_model = d_model  # in this paper, it 512
        self.seq_len = seq_len  # maximum length of the sequence
        self.dropout = nn.Dropout(p=dropout)
        # create a matrix of shape (seq_len, d_model)
        # pe stands for positional encoding
        pe = torch.zeros(seq_len, d_model)
        # create a vector of shape (seq_len, 1)
        position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
        # now, we will create the denominator of the positional encoding formulae
        # since it is a bit long, we will break it into a few lines
        # first, we need a vector containing multiples of 2 from 0 to d_model (here, 512)
        # this line is because of the 2i term which is the power of 10000
        # thus, this vector provides for the numbers we need for 2i
        vector = torch.arange(0, d_model, 2, dtype=torch.float32)
        # now, we raise 10,000 to the power of 2i/d_model
        denominator_original = torch.pow(10000, vector/d_model)
        # this is the one used by Harvard Transformer article
        denominator_harvard = torch.exp(vector * (-math.log(10000.0)/d_model))
        # we apply sin for even dimension and cos for odd dimenion
        # apply sin and store it in even indices of pe
        pe[:, 0::2] = torch.sin(position * denominator_original)
        # apply cos and store it in odd indices of pe
        pe[:, 1::2] = torch.cos(position * denominator_original)
        # we need to add the batch dimension so that we can apply it to 
        # batches of sentences
        pe = pe.unsqueeze(0)  # new shape: (1, seq_len, d_model)
        # register the pe tensor as a buffer so that it can be saved along with the
        # state of the model
        self.register_buffer("pe", pe)
        
    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Let's see how the positional encoding works by doing it on a smaller example.

In [46]:
def dummyfn():
    seq_len = 10
    d_model = 10
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
    vector = torch.arange(0, d_model, 2, dtype=torch.float32)
    denominator_original = torch.pow(10000, vector/d_model)
    denominator_harvard = torch.exp(vector * (-math.log(10000.0)/d_model))
    pe[:, 0::2] = torch.sin(position * denominator_original)
    pe[:, 1::2] = torch.cos(position * denominator_original)
    print(pe, pe[:, 0::2], pe[:, 1::2], sep='\n\n\n')

dummyfn()

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000,
          0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0264,  0.9997,  0.8573, -0.5148, -0.1383,  0.9904,
          0.9992,  0.0402],
        [ 0.9093, -0.4161,  0.0528,  0.9986, -0.8827, -0.4699, -0.2739,  0.9618,
          0.0803, -0.9968],
        [ 0.1411, -0.9900,  0.0791,  0.9969,  0.0516,  0.9987, -0.4042,  0.9147,
         -0.9927, -0.1205],
        [-0.7568, -0.6536,  0.1054,  0.9944,  0.8296, -0.5584, -0.5268,  0.8500,
         -0.1600,  0.9871],
        [-0.9589,  0.2837,  0.1316,  0.9913, -0.9058, -0.4237, -0.6393,  0.7690,
          0.9799,  0.1993],
        [-0.2794,  0.9602,  0.1577,  0.9875,  0.1031,  0.9947, -0.7395,  0.6732,
          0.2392, -0.9710],
        [ 0.6570,  0.7539,  0.1837,  0.9830,  0.7997, -0.6005, -0.8254,  0.5645,
         -0.9606, -0.2778],
        [ 0.9894, -0.1455,  0.2095,  0.9778, -0.9265, -0.3764, -0.8955,  0.4450,
         -0.3160,  0.9488],
        [ 0.4121, -