The code (and other stuff like images) is heavily inspired and taken from the following sources:
- [Attention is all you need - Vaswani, et al](https://arxiv.org/pdf/1706.03762)
- [Harvard Transformer Implementation](https://nlp.seas.harvard.edu/annotated-transformer/)
- [Umar Jamil Transformer Implementation](https://github.com/hkproj/pytorch-transformer)
- [Umar Jamil Transformer video](https://youtu.be/ISNdQcPhsts?si=_1mO7CBcvFHg15cJ)
- [Datacamp transformer tutorial](https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch)

**This notebook is used for trying out and finalizing the code, after which the relevant code is put in the corresponding python files, so that the model can be run from the terminal.**

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import random
import math
import time
import os
import sys
import argparse
import matplotlib.pyplot as plt
from torchinfo import summary

**Code skeleton:**
- input and output embedding
- positional encoding
- encoder
    - multi-head attention
    - feed forward network
    - layer normalization and residual connections
- decoder
    - masked multi-head attention
    - multi-head attention
    - feed forward network
    - layer normalization and residual connections
- final linear layer and softmax (head)
- full transformer architecture

### Input and Output Embedding



In [3]:
class InputEmbedding(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super(InputEmbedding, self).__init__()
        # you can also do this:
        # super().__init__()
        self.d_model = d_model  # in this paper, it 512
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)
        
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)
        # check the last line on page 5: 
        # "In the embedding layers, we multiply those weights by d model."

In [4]:
class OutputEmbedding(nn.Module):
    pass

### Positional Encoding

This is the one mentioned in the [Attention is all you need](https://arxiv.org/pdf/1706.03762) paper.
- $\displaystyle PE_{(pos, 2i)} = sin \left( \frac{pos}{10000^{\frac{2i}{\text{d\_model}}}} \right)$ 
    
- &nbsp;  $\displaystyle PE_{(pos, 2i+1)} = cos \left( \frac{pos}{10000^{\frac{2i}{\text{d\_model}}}} \right)$

To make the above 2 formulae more numerically stable, we apply logarithms and then raise them to exponents. You can see PE_derivation.png for the derivation:

- $\displaystyle PE_{(pos, 2i)} = sin \left( pos \times e^{\frac{-2i}{d} ln(10000)} \right)$ 
    
- &nbsp;  $\displaystyle PE_{(pos, 2i+1)} = cos \left( pos \times e^{\frac{-2i}{d} ln(10000)} \right)$

### Positional Encoding

TODO: check Amirhossein Kazamnejad's blog on positional encoding

Umar Jamil uses the [Harvard pytorch transformer article implementation of positional encoding formula](https://nlp.seas.harvard.edu/annotated-transformer/#positional-encoding) mentioned in the paper by using log. He mentions in his video that applying log to an exponential nullifies the effect of log but makes the calculation more numerically stable. The value of the positional encoding calculated this way will be slightly different but the model will learn. Click [here](https://youtu.be/ISNdQcPhsts?si=HNaqDgkw6CfwgO-M&t=470) to watch that particular scene from the video.

Click [here](https://youtu.be/ISNdQcPhsts?si=cvEfkDJyW7LiBqkn&t=720) to see the reasoning behind using `self.register_buffer("pe", pe)`. The reasoning that when we want to save some variable not as a learned parameter (like weights and biases) but we want it to be saved when we save the file of the model, the we should register it as a buffer. This way it will be saved along with the state of the model.

In [None]:
def sin_cos_pe(q_len, d_model, normalize=True):
    """
    Applies the sin-cos positional encoding to the input and output embedding.
    q_len: number of patches
    d_model: dimension of the model
    normalize: whether to normalize the positional encoding or not

    Returns: A tensor of shape (q_len, d_model)
    """
    # initialize the positional encoding matrix of size (q_len, d_model) with zeros
    pe = torch.zeros(q_len, d_model)
    
    # this is the 'pos' in the formula
    pos = torch.arange(0, q_len).unsqueeze(1)
    
    # this is the 'i' in the formula
    # no need to multiply by 2, because we are considering only even numbers
    i = torch.arange(0, d_model, 2)

    # below we implement the numerically more stable version of the sin and cos 
    # positional encoding (you can check the PE_derivation.png for the derivation) 
    denominator = torch.exp(-(i/d_model) * math.log(10000.0))

    # apply sin positional encoding to the even indices
    pe[:, 0::2] = torch.sin(pos * denominator)
    # apply cos positional encoding to the odd indices
    pe[:, 1::2] = torch.cos(pos * denominator)

    # normalize the positional encoding, ie, subtract the mean and 
    # divide by the standard deviation
    if normalize:
        # i think we are multiplying by 10 so as to make sure we don't get very small values
        # this is because division by very small values can lead to very large values
        pe = (pe - pe.mean()) / (pe.std() * 10)

    return pe



In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super(PositionalEncoding, self).__init__()
        self.d_model = d_model  # in this paper, it 512
        self.seq_len = seq_len  # maximum length of the sequence
        self.dropout = nn.Dropout(p=dropout)
        # create a matrix of shape (seq_len, d_model)
        # pe stands for positional encoding
        pe = torch.zeros(seq_len, d_model)
        # create a vector of shape (seq_len, 1)
        position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
        # now, we will create the denominator of the positional encoding formulae
        # since it is a bit long, we will break it into a few lines
        # first, we need a vector containing multiples of 2 from 0 to d_model (here, 512)
        # this line is because of the 2i term which is the power of 10000
        # thus, this vector provides for the numbers we need for 2i
        vector = torch.arange(0, d_model, 2, dtype=torch.float32)
        # now, we raise 10,000 to the power of 2i/d_model
        denominator_original = torch.pow(10000, vector/d_model)
        # this is the one used by Harvard Transformer article
        denominator_harvard = torch.exp(vector * (-math.log(10000.0)/d_model))
        # we apply sin for even dimension and cos for odd dimenion
        # apply sin and store it in even indices of pe
        pe[:, 0::2] = torch.sin(position * denominator_original)
        # apply cos and store it in odd indices of pe
        pe[:, 1::2] = torch.cos(position * denominator_original)
        # we need to add the batch dimension so that we can apply it to 
        # batches of sentences
        pe = pe.unsqueeze(0)  # new shape: (1, seq_len, d_model)
        # register the pe tensor as a buffer so that it can be saved along with the
        # state of the model
        self.register_buffer("pe", pe)
        
    def forward(self, x):
        # we don't want to train the positional encoding, ie, we don't want to make it
        # a learnable parameter, so we set its requires_grad to False
        x = x + self.pe[:, :x.size(1)].requires_grad_(False)  # (batch, seq_len, d_model)
        return self.dropout(x)

In [57]:
class PositionalEncoding(nn.Module):
    """
    Applies the sin-cos positional encoding to the input and output embedding.
    max_seq_len: The maximum length of the sequence for which positional encodings 
    are pre-computed.
    d_model: dimension of the input
    
    Returns: A tensor of shape (batch_size, max_seq_len, d_model)
    """
    def __init__(self, max_seq_len, d_model):
        super(PositionalEncoding, self).__init__()
        # initialize the positional encoding matrix of size (max_seq_len, d_model) 
        # with zeros
        pe = torch.zeros(max_seq_len, d_model)
        # this is the 'pos' in the formula
        # we want it to be of shape (max_seq_len, 1)
        pos = torch.arange(0, max_seq_len).unsqueeze(1)
        # this is the 'i' in the formula
        # no need to multiply by 2, because we are considering only even numbers
        i = torch.arange(0, d_model, 2)
        # below we implement the numerically more stable version of the sin and cos
        # positional encoding (you can check the PE_derivation.png for the derivation)
        denominator = torch.exp(-(i/d_model) * math.log(10000.0))
        # apply sin positional encoding to the even indices
        pe[:, 0::2] = torch.sin(pos * denominator)
        # apply cos positional encoding to the odd indices
        pe[:, 1::2] = torch.cos(pos * denominator)
        # we need to add the batch dimension so that we can apply it to 
        # batches of sentences
        pe = pe.unsqueeze(0)  # new shape: (1, max_seq_len, d_model)
        # pe is registered as a buffer, which means it will be part of the module's 
        # state but will not be considered a trainable parameter.
        self.register_buffer('pe', pe)

    def forward(self, x):
        # we don't want to train the positional encoding, ie, we don't want to make it
        # a learnable parameter, so we set its requires_grad to False
        # shape of x is (batch_size, seq_len, d_model)
        x = x + self.pe[:, :x.size(1)].requires_grad_(False)
        return x
    

Let's see how the positional encoding works by doing it on a smaller example.

In [None]:
def dummyfn1():
    seq_len = 10
    d_model = 10
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
    vector = torch.arange(0, d_model, 2, dtype=torch.float32)
    denominator_original = torch.pow(10000, vector/d_model)
    denominator_harvard = torch.exp(vector * (-math.log(10000.0)/d_model))
    pe[:, 0::2] = torch.sin(position * denominator_original)
    pe[:, 1::2] = torch.cos(position * denominator_original)
    print(pe, pe[:, 0::2], pe[:, 1::2], sep='\n\n\n')

dummyfn1()

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000,
          0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0264,  0.9997,  0.8573, -0.5148, -0.1383,  0.9904,
          0.9992,  0.0402],
        [ 0.9093, -0.4161,  0.0528,  0.9986, -0.8827, -0.4699, -0.2739,  0.9618,
          0.0803, -0.9968],
        [ 0.1411, -0.9900,  0.0791,  0.9969,  0.0516,  0.9987, -0.4042,  0.9147,
         -0.9927, -0.1205],
        [-0.7568, -0.6536,  0.1054,  0.9944,  0.8296, -0.5584, -0.5268,  0.8500,
         -0.1600,  0.9871],
        [-0.9589,  0.2837,  0.1316,  0.9913, -0.9058, -0.4237, -0.6393,  0.7690,
          0.9799,  0.1993],
        [-0.2794,  0.9602,  0.1577,  0.9875,  0.1031,  0.9947, -0.7395,  0.6732,
          0.2392, -0.9710],
        [ 0.6570,  0.7539,  0.1837,  0.9830,  0.7997, -0.6005, -0.8254,  0.5645,
         -0.9606, -0.2778],
        [ 0.9894, -0.1455,  0.2095,  0.9778, -0.9265, -0.3764, -0.8955,  0.4450,
         -0.3160,  0.9488],
        [ 0.4121, -

In [None]:
def dummyfn2():
    torch.manual_seed(42)
    seq_len = 4
    d_model = 4
    dropout = 0.2
    x = torch.randn(d_model, seq_len)
    obj = PositionalEncoding(d_model, seq_len, dropout)
    return obj(x)

dummyfn2()

tensor([[[ 2.4086,  3.1091,  1.1259, -0.0000],
         [ 1.8999, -0.8678, -0.6868, -0.9279],
         [ 0.1965,  1.5407, -1.5822, -0.0000],
         [-0.0000, -1.9368, -2.2107,  0.9254]]])

### Multi-Head Attention

Queries, Keys, and Values are all just the duplication of the input for the encoder. In other words, in the encoder block, we store the same value of input in queries, keys, and values. So, they are all the same thing. You can also think of them as just the input used 3 times.

&nbsp;  

Check [this](https://sentry.io/answers/difference-between-staticmethod-and-classmethod-function-decorators-in-python/#:~:text=We%20can%20decorate%20a%20function,object%20to%20it%2C%20as%20below.&text=This%20can%20be%20useful%20when,the%20instance%20it's%20called%20on.) article for information on `@staticmethod`. Basically, when you put `@staticmethod` on top of a method in a class, then that method does not take the `self` argument, which is the object of the class.

&nbsp;  

**Scaled dot-product attention:**

$$\text{Attention(Q,K,V)} = \text{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) \cdot V$$

&nbsp;

**Multi-head attention:**

$$\text{MultiHead(Q,K,V)} = \text{Concat}(\text{head}_1, \; ..., \; \text{head}_h) \cdot W^O$$

$$, \text{where head}_i = \text{Attention}(Q \cdot W_i^Q, \;\;\; K \cdot W_i^K, \;\;\; V \cdot W_i^V)$$

, where the projections are parameter matrices:
- $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$
- $W_i^O \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}$
- $\text{h = number of heads} = 8 = \text{number of parallel attention layers}$
- $d_{model} = 512$
- $d_k = d_v = d_{model} // h = 64, \quad \text{where `//' is integer division}$


&nbsp;  

- The encoder has 6 identical layers.
- Each layer has 2 sub-layers:
    1. multi-head attention
    2. feed-forward network
- Each of the sub-layer is connected by a residual connection, followed by layer normalization, ie, `LayerNorm(x + sub-layer(x))`, where `sub-layer` is multi-head attention and feed-forward network, ie,
    - `LayerNorm(x + multihead_attention(x))`
    - `LayerNorm(x + feed_forward_network(x))`
- To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension `d_model = 512`.

&nbsp;  

**TODO**: understand why `dim=-1` in the function `def scaled_dotproduct_attention` on the line `attn_weights = F.softmax(attn_scores, dim=-1)  # why dim=-1?`.

In [9]:
torch.arange(10).reshape(2,-1).transpose(-2,-1).shape
# .transpose(-2,-1)

torch.Size([5, 2])

In [6]:
def fn ():
    a = torch.arange(40).reshape(2,5,4)
    return a.size(-1)

fn()

4

In [27]:
def fn():
    d_model = 5
    d_k = 4
    m = nn.Linear(d_model, d_k)
    x1 = torch.randn(5, d_model)
    x2 = torch.randn(10, d_model)
    x3 = torch.randn(1, 12, d_model)
    print(x1.size())
    print(x2.size())
    print(x3.size())
    # note: the number of input features of m should be equal to the last dimension of x
    print(m(x1).size())
    print(m(x2).size())
    print(m(x3).size())    

fn()

torch.Size([5, 5])
torch.Size([10, 5])
torch.Size([1, 12, 5])
torch.Size([5, 4])
torch.Size([10, 4])
torch.Size([1, 12, 4])


In [29]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, h=8, dropout=0.0):
        """
        d_model: dimension of the model (size of the embedding vector)
        num_heads: number of heads in multihead attention
        dropout: dropout probability
        """
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model           # embedding vector size
        self.h = h                       # number of heads
        # make sure d_model is divisible by h
        assert d_model % h == 0, "d_model must be divisible by h"
        # we assume d_v is always equal to d_k, so we just write d_k insead of d_v
        self.d_k = d_model // h          # dimension of vector seen by each head
        # weight matrices for Q, K, V, and O
        # The shapes of these matrices are mentioned in the paper
        # I have written them above for reference 
        
        self.wq = nn.Linear(d_model, self.d_k, bias=False)     # Wq
        self.wk = nn.Linear(d_model, self.d_k, bias=False)     # Wk
        self.wv = nn.Linear(d_model, self.d_k, bias=False)     # Wv
        # self.wo = nn.Linear(h*self.d_k, d_model, bias=False)   # Wo
        
        # self.wq = nn.Linear(d_model, d_model, bias=False)  # Wq
        # self.wk = nn.Linear(d_model, d_model, bias=False)  # Wk
        # self.wv = nn.Linear(d_model, d_model, bias=False)  # Wv
        self.wo = nn.Linear(d_model, d_model, bias=False)  # Wo

        # if dropout is zero, then nn.Dropout does not do anything
        self.dropout = nn.Dropout(p=dropout)  


    @staticmethod
    def scaled_dotproduct_attention(query, key, value, mask=None, dropout=None):
        """
        Compute the scaled dot-product attention.
        
        query: the query tensor
        key: the key tensor
        value: the value tensor
        mask: the mask tensor
        dropout: the dropout probability
        """
        d_k = query.size(-1)
        attn_scores = torch.matmul(query, key.transpose(-2,-1)) / math.sqrt(d_k)
        if mask is not None:
            # write a very low value (indicating -infinity) to the positions where mask == 0, 
            # this will tell softmax to replace those values with zero
            attn_scores = attn_scores.masked_fill(mask==0, -1e9)
        attn_weights = F.softmax(attn_scores, dim=-1)  # why dim=-1?
        attn_weights = dropout(attn_weights)
        output = torch.matmul(attn_weights, value)
        return output, attn_weights

    
    def forward(self, q, k, v, mask=None):
        """
        q: query matrix of shape (batch_size, q_len, d_model)
        k: key matrix of shape (batch_size, k_len, d_model)
        v: value matrix of shape (batch_size, v_len, d_model)
        mask: mask to prevent attention to certain positions

        Returns: output matrix of shape (batch_size, q_len, d_model)
        """
        # linear transformation for Q, K, and V
        # multiply Wq matrix by q
        # this matrix multiplication does not change the shape of q
        query = self.wq(q)              # (batch_size, q_len, d_model)
        print(1,query.shape,sep='\t-->\t')
        # similary for key and value
        key = self.wk(k)
        print(2,key.shape,sep='\t-->\t')
        value = self.wv(v)
        print(3,value.shape,sep='\t-->\t')
        # (batch, seq_len, d_model) --> (batch, seq_len, num_heads, d_k) 
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k)
        print(4,query.shape,sep='\t-->\t')
        # (batch, seq_len, num_heads, d_k) --> (batch, num_heads, seq_len, d_k)
        query = query.transpose(1,2) # interchange the indices 1 and 2 with each other
        print(5,query.shape,sep='\t-->\t')
        # similarly, we will change the dimensions of key and value
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k)
        print(6,key.shape,sep='\t-->\t')
        key = key.transpose(1,2)
        print(7,key.shape,sep='\t-->\t')
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k)
        print(8,value.shape,sep='\t-->\t')
        value = value.transpose(1,2)
        print(9,value.shape,sep='\t-->\t')
        x, self.attn_scores = MultiHeadAttention.scaled_dotproduct_attention(query, key, 
                                                                             value, mask,
                                                                             self.dropout)
        print(10,x.shape,self.attn_scores.shape,sep='\t-->\t')
        # combine all the heads together
        # (batch, num_heads, seq_len, d_k) --> (batch, seq_len, num_heads, d_k)
        x = x.transpose(1,2)
        print(11,x.shape,sep='\t-->\t')
        # (batch, seq_len, num_heads, d_k) --> (batch, seq_len, d_model)
        x = x.contiguous().view(x.shape[0], -1, self.h * self.d_k)
        print(12,x.shape,sep='\t-->\t')
        # now, multiply by Wo
        # this matrix multiplication does not change the shape of x
        # (batch, seq_len, d_model) --> (batch, seq_len, d_model)
        x = self.wo(x)
        print(13,x.shape,sep='\t-->\t')
        return x




In [58]:
# Second way of doing multihead attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        

    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        # Recall that d_model = num_heads * d_k   OR   d_k = d_model // num_heads
        assert d_model == (self.num_heads * self.d_k), "d_model must be equal \
            to self.num_heads * self.d_k"
        x = x.view(batch_size, seq_length, self.num_heads, self.d_k)
        x = x.transpose(1, 2)
        # shape of x becomes: [batch_size, self.num_heads, seq_length, self.d_k]
        return x
        
        
    def combine_heads(self, x):
        """
        x: tensor of shape [batch_size, num_heads, seq_length, d_k]

        Returns a tensor of shape [batch_size, seq_length, d_model]

        Description:
        Combine the multiple heads back to original shape, ie,
        we want the shape of x to be [batch_size, seq_length, d_model]
        """
        # After calling split_heads function, the shape of x became:
        # [batch_size, self.num_heads, seq_length, self.d_k]
        batch_size, num_heads, seq_length, d_k = x.size()
        
        assert num_heads == self.num_heads, "Number of heads must be equal to self.num_heads"
        assert d_k == self.d_k, "d_k must be equal to self.d_k"
        assert self.d_model == (num_heads * d_k), "d_model must be equal to \
            self.num_heads * self.d_k"
        
        x = x.transpose(1, 2).contiguous()
        # shape of x becomes: [batch_size, seq_length, self.num_heads, self.d_k]
        x = x.view(batch_size, seq_length, self.d_model)
        return x
        

    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations to Q, K, V
        Q = self.W_q(Q)
        K = self.W_k(K)
        V = self.W_v(V)
        
        # Split heads
        Q = self.split_heads(Q)
        K = self.split_heads(K)
        V = self.split_heads(V)

        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads 
        heads_combined = self.combine_heads(attn_output)
        
        # Apply output transformation
        output = self.W_o(heads_combined)
        return output
    
    

In [46]:
print(torch.randn(4,5,8).T.size())
print(torch.randn(4,5,8).transpose(-2,-1).size())

torch.Size([8, 5, 4])
torch.Size([4, 8, 5])


In [54]:
def fn():
    a = torch.arange(1,10).reshape(3,3)
    print(a, "\n")
    mask = torch.tensor([[1,1,0], [1,0,1], [0,1,1]])
    b = a.masked_fill(mask==0, -100)
    c = a.masked_fill(mask==1, 100)
    print(b,c,sep='\n\n')

fn()

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]) 

tensor([[   1,    2, -100],
        [   4, -100,    6],
        [-100,    8,    9]])

tensor([[100, 100,   3],
        [100,   5, 100],
        [  7, 100, 100]])


In [35]:
def fn():
    x = torch.randn(1,2,15)
    x = x.view(1,2,3,5)
    print(x.size())
    x = x.transpose(1,2)
    print(x.size())
    
fn()

torch.Size([1, 2, 3, 5])
torch.Size([1, 3, 2, 5])


In [30]:
def fn():
    d_model = 512
    h = 8
    d_k = d_model // h
    dropout = 0.0
    mha = MultiHeadAttention(d_model, h, dropout)
    x = torch.randn(1, d_k, d_model)
    mha(x, x, x)
    # return summary(model=mha,
    #         input_size=([(1, d_k, d_model), (1, d_k, d_model), (1, d_k, d_model)]),
    #         dtypes=[torch.float32, torch.float32, torch.float32],
    #         col_names=("input_size", "output_size", "num_params", "trainable"),
    #         col_width=20,
    #         row_settings=["var_names"])

fn()

1	-->	torch.Size([1, 64, 64])
2	-->	torch.Size([1, 64, 64])
3	-->	torch.Size([1, 64, 64])


RuntimeError: shape '[1, 64, 8, 64]' is invalid for input of size 4096

`.view()` is used to reshape a tensor. We can reshape a tensor using `.view()` and stored the reshaped version in another variable. We must note that `.view()` passes the reference of a tensor, ie, memory address of the tensor. So, if we make changes to one tensor, then they get reflected in the other tensor as well. 


Check [this](https://stackoverflow.com/questions/48915810/what-does-contiguous-do-in-pytorch) for information on the use of `.contiguous()` in PyTorch.

Instead of using the following line of code: 
```python
x = x.contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)
```

we can use the following:

```python
x = x.reshape(x.shape[0], -1, self.num_heads * self.d_k)
```

In [None]:
def dummyfn1():
    A = torch.ones(2,3)
    B = A.view(1,6)
    B[0,3] = 0  # change 1 at index [0,3] in B, this will also change the 
    # the 1 in A at the corresponding index
    print(A, B, sep='\n\n')
    pass
    
dummyfn1()

tensor([[1., 1., 1.],
        [0., 1., 1.]])

tensor([[1., 1., 1., 0., 1., 1.]])


In [37]:
def dummyfn1():
    A = torch.ones(3,4)
    A = A.transpose(-1,-2)
    A = A.view(A.shape[1], -1)  # we get error because we didn't use contiguous
    return A

dummyfn1()

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

In [None]:
def dummyfn1():
    A = torch.ones(3,4)
    A = A.transpose(-1,-2)
    A = A.contiguous().view(A.shape[1], -1)
    return A

dummyfn1()

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

In [None]:
def dummyfn1():
    A = torch.ones(3,4)
    A = A.transpose(-1,-2)
    # instead of using contiguous and view, we can use reshape 
    A = A.reshape(A.shape[1], -1)
    return A

dummyfn1()

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

**3 ways of doing transpose in PyTorch**

In [None]:
torch.ones(2,3)

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [None]:
torch.transpose(torch.ones(2,3), -1, -2)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])

In [None]:
torch.ones(2,3).T

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])

### (Position-wise) Feed-Forward Network

$$\text{FFN}(x) = max(0, \;\; xW_1 + b_1)W_2 \; + \; b_2$$

, where:
- $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$
- $W_1 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$

&nbsp;  

See section 3.3 on page 5 of the paper.

In [64]:
class FeedForwardNetwork(nn.Module):
    """Implementation of the FFN equation."""
    def __init__(self, d_model = 512, d_ff = 2048, dropout: float = 0.0):
        super(FeedForwardNetwork, self).__init__()
        # the shapes are mentioned in the paper
        # I have written them above for reference
        self.w1 = nn.Linear(d_model, d_ff, bias=True)
        self.w2 = nn.Linear(d_ff, d_model, bias=True)
        self.relu = nn.ReLU()
        # if dropout is zero, then nn.Dropout does not do anything
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, x):
        x = self.relu(self.w1(x))
        x = self.dropout(x)
        x = self.w2(x)
        return x
     

### Layer Normalization

In layer normalization, we calculate the mean and variance of each data point independently from other data points. Then, we calculate new values for each data point using their own mean and their own variance.

Note: $\text{variance} = \text{(standard deviation)}^2$

We will use this formula:

$$\hat{x}_j = \alpha \times \left(\frac{x_j - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}\right) + \beta $$

, where:
- $\alpha$ is the multiplicative factor
- $\beta$ is the additive factor

In [27]:
def fn():
    a = torch.arange(20, dtype=torch.float32)
    print(a)
    mu1 = a.mean(dim=-1, keepdim=True)
    mu2 = a.mean()
    print(mu1, mu2, sep='\n\n\n')
    b = torch.sqrt(a[16])
    print(b)

fn()

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
        14., 15., 16., 17., 18., 19.])
tensor([9.5000])


tensor(9.5000)
tensor(4.)


In [73]:
class LayerNormalization(nn.Module):
    """This is the 'norm' part in the 'add & norm' block in the paper."""
    def __init__(self, features: int, eps: float = 1e-6) -> None:
        super(LayerNormalization, self).__init__()
        self.eps = eps
        # instead of simply doing self.alpha = torch.ones(1)
        # we use nn.Parameter() so that when we call the state dict of the model
        # we are able to see this alpha
        # only using torch.ones() won't allow us to see this alpha
        self.alpha = nn.Parameter(torch.ones(features))  # multiplied
        self.beta = nn.Parameter(torch.zeros(features))  # added
        
    def forward(self, x):
        # apply mean after the batch dimension
        # mean usually cancels the dimension to which it is applied,  
        # but we want to keep it
        mean = x.mean(dim=-1, keepdim=True)
        # similarly for standard deviation
        std = x.std(dim=-1, keepdim=True)
        # apply the layer normalization
        fraction = (x - mean) / (torch.sqrt(std**2 + self.eps))
        x_normalized = self.alpha * fraction + self.beta
        return x_normalized

### Residual Connections

In [77]:
class ResidualConnection(nn.Module):
    """This is the 'add' part in the 'add & norm' block."""
    def __init__(self, d_model: int, dropout: float = 0.0) -> None:
        super(ResidualConnection, self).__init__()
        # if dropout is zero, then nn.Dropout does not do anything
        self.dropout = nn.Dropout(p=dropout)
        self.norm = LayerNormalization(d_model)
    
    def forward(self, x, sublayer):
        """
        x: input
        sublayer: different layers of the transformer architecture (eg: multi-head
        attention, feed-forward network, etc.), we will pass these layers as
        functions to this class.
        
        Returns the skip or residual connection.
        """
        # most implementations first do normalization and then pass x to the sublayer
        # we will also do this way
        return x + self.dropout(sublayer(self.norm(x)))
        # however, the paper first passes x to the sublayer and then does the norm
        # return x + self.dropout(self.norm(sublayer(x)))

### Encoder

In [92]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.0):
        super(EncoderBlock, self).__init__()
        self.mha = MultiHeadAttention(d_model, h, dropout)
        self.ffn = FeedForwardNetwork(d_model, d_ff, dropout)
        self.rc = ResidualConnection(d_model, dropout)

    def forward(self, x):
        """
        x: position-aware embedding (positional encoding + input embedding)
        """
        sublayer1 = lambda x: self.mha(x,x,x)  # q=x, k=x, v=x
        x = self.rc(x, sublayer1)
        x = self.rc(x, self.ffn)
        return x
    

In [63]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.0):
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForwardNetwork(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, mask=None):
        attn_output = self.mha(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x
    

In [96]:
def fn():
    enb = EncoderBlock(512, 8, 2048, 0.0)
    summary(model=enb,
            input_size=(64, 100, 512),
            col_names=("input_size", "output_size", "num_params", "trainable"),
            col_width=20,
            row_settings=["var_names"])

fn()

RuntimeError: Failed to run torchinfo. See above stack traces for more details. Executed layers up to: [LayerNormalization: 2, Linear: 2, Linear: 2, Linear: 2]

In [76]:
def fn():
    sublayer1 = lambda x: x + 1
    return sublayer1(x=2)

fn()

3

In [86]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, h: int, dropout: float) -> None:
        self.mha = MultiHeadAttention(d_model, h, dropout)
        self.ffn = FeedForwardNetwork(d_model, d_ff, dropout)
        self.rc = ResidualConnection(d_model, dropout)
        # store 2 residual connection layers
        # we'l use one after self-attention layer and the other after feed-forward 
        # network as shown in figure 1 of the paper
        self.res_con = nn.ModuleList([ResidualConnection(d_model, dropout)
                                      for _ in range(2)])
        
    def forward(self, x, src_mask):
        # we apply the source mask because we don't want the padding word to 
        # interact with other words
        x = self.res_con[0](x, lambda x: self.mha(x,x,x,src_mask))
        x = self.res_con[1](x, self.ffn)

In [87]:
def fn():
    enb = EncoderBlock(512, 2048, 8, 0.0)
    print(enb)

fn()

AttributeError: cannot assign module before Module.__init__() call

### Decoder

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, features: int, selfattn_block: MultiHeadAttention,
                 crossattn_block: MultiHeadAttention, dropout: float,
                 feedforward_block: FeedForwardNetwork) -> None:
        super(DecoderBlock, self).__init__()
        self.selfattn_block = selfattn_block
        self.crossattn_block = crossattn_block
        self.feedforward_block = feedforward_block
        self.res_con = nn.ModuleList([ResidualConnection(features, dropout) 
                                      for _ in range(3)])
        
    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.res_con[0](x, lambda x: self.selfattn_block(x, x, x, tgt_mask))
        x = self.res_con[1](x, lambda x: self.crossattn_block(x, encoder_output, 
                                                              encoder_output, src_mask))
        x = self.res_con[2](x, self.feedforward_block)
        return x

In [None]:
class Decoder(nn.Module):
    def __init__(self, features: int, layers: nn.ModuleList):
        super(Decoder, self).__init__()
        self.layers = layers
        self.norm = LayerNormalization(features)
        
    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

In [61]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForwardNetwork(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, x, enc_out, src_mask=None, tgt_mask=None):
        # multi-head attention
        selfattn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(selfattn_output))
        # masked multi-head attention
        # query from decoder
        # key and value from encoder
        crossattn_output = self.cross_attn(x, enc_out, enc_out, src_mask)
        x = self.norm2(x + self.dropout(crossattn_output))
        # feed forward network
        ffn_output = self.ffn(x)
        x = self.norm3(x + self.dropout(ffn_output))
        return x
    

In [None]:
class ProjectionLayer(nn.Module):
    def __init__(self, d_model, vocab_size) -> None:
        super(ProjectionLayer, self).__init__()
        self.proj = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        # (batch, seq_len, d_model) --> (batch, seq_len, vocab_size)
        return self.proj(x)

### The Transformer Class (collection of all the above methods)

In [None]:
class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbedding,
                 tgt_embed: InputEmbedding, src_pos: PositionalEncoding, 
                 tgt_pos: PositionalEncoding, proj_layer: ProjectionLayer) -> None:
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.proj_layer = proj_layer
    
    def encode(self, src, src_mask):
        # (batch, seq_len, d_model)
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src, src_mask)
    
    def decode(self, encoder_output, src_mask, tgt, tgt_mask):
        # (batch, seq_len, d_model)
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)
    
    def project(self, x):
        # (batch, seq_len, vocab_size)
        return self.proj_layer(x)

In [66]:
EncoderLayer(512,8,2048,0.0)

EncoderLayer(
  (mha): MultiHeadAttention(
    (W_q): Linear(in_features=512, out_features=512, bias=True)
    (W_k): Linear(in_features=512, out_features=512, bias=True)
    (W_v): Linear(in_features=512, out_features=512, bias=True)
    (W_o): Linear(in_features=512, out_features=512, bias=True)
  )
  (ffn): FeedForwardNetwork(
    (w1): Linear(in_features=512, out_features=2048, bias=True)
    (w2): Linear(in_features=2048, out_features=512, bias=True)
    (relu): ReLU()
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (dropout): Dropout(p=0.0, inplace=False)
)

In [69]:
nn.ModuleList([
    EncoderLayer(512,8,2048,0.0) 
    for _ in range(6)
    ])

ModuleList(
  (0): EncoderLayer(
    (mha): MultiHeadAttention(
      (W_q): Linear(in_features=512, out_features=512, bias=True)
      (W_k): Linear(in_features=512, out_features=512, bias=True)
      (W_v): Linear(in_features=512, out_features=512, bias=True)
      (W_o): Linear(in_features=512, out_features=512, bias=True)
    )
    (ffn): FeedForwardNetwork(
      (w1): Linear(in_features=512, out_features=2048, bias=True)
      (w2): Linear(in_features=2048, out_features=512, bias=True)
      (relu): ReLU()
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (1): EncoderLayer(
    (mha): MultiHeadAttention(
      (W_q): Linear(in_features=512, out_features=512, bias=True)
      (W_k): Linear(in_features=512, out_features=512, bias=True)
      (W_v): Linear(in_features=512, out_features=512, bias=Tr

In [67]:
DecoderLayer(512,8,2048,0.0)

DecoderLayer(
  (self_attn): MultiHeadAttention(
    (W_q): Linear(in_features=512, out_features=512, bias=True)
    (W_k): Linear(in_features=512, out_features=512, bias=True)
    (W_v): Linear(in_features=512, out_features=512, bias=True)
    (W_o): Linear(in_features=512, out_features=512, bias=True)
  )
  (cross_attn): MultiHeadAttention(
    (W_q): Linear(in_features=512, out_features=512, bias=True)
    (W_k): Linear(in_features=512, out_features=512, bias=True)
    (W_v): Linear(in_features=512, out_features=512, bias=True)
    (W_o): Linear(in_features=512, out_features=512, bias=True)
  )
  (ffn): FeedForwardNetwork(
    (w1): Linear(in_features=512, out_features=2048, bias=True)
    (w2): Linear(in_features=2048, out_features=512, bias=True)
    (relu): ReLU()
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm3): LayerNorm((512,), eps=1e-05, 

In [68]:
nn.Embedding(5000,512)

Embedding(5000, 512)

In [96]:
def fn():
    a = torch.triu(torch.ones(4, 4, dtype=torch.long), diagonal=1)
    print(a,"\n")
    b = torch.triu(torch.ones(4, 4, dtype=torch.long), diagonal=0)
    print(b,"\n")
    c = a & b
    print(c)

fn()

tensor([[0, 1, 1, 1],
        [0, 0, 1, 1],
        [0, 0, 0, 1],
        [0, 0, 0, 0]]) 

tensor([[1, 1, 1, 1],
        [0, 1, 1, 1],
        [0, 0, 1, 1],
        [0, 0, 0, 1]]) 

tensor([[0, 1, 1, 1],
        [0, 0, 1, 1],
        [0, 0, 0, 1],
        [0, 0, 0, 0]])


In [83]:
def fn():
    src = torch.tensor([[11,0,13], [0,15,16], [17,18,0]], dtype=torch.long)
    tgt = torch.tensor([[1,0,3], [0,5,6], [7,8,0]], dtype=torch.long)
    print(f"src.shape = {src.shape}")
    print(f"tgt.shape = {tgt.shape}")
    # Create a mask for the source sequence (src)
    src_mask = (src != 0)
    print(f"src_mask.shape = {src_mask.shape}")
    src_mask = src_mask.unsqueeze(1)
    print(f"src_mask.shape = {src_mask.shape}")
    src_mask = src_mask.unsqueeze(2)
    print(f"src_mask.shape = {src_mask.shape}")
    # Create a mask for the target sequence (tgt)
    tgt_mask = (tgt != 0)
    print(f"tgt_mask.shape = {tgt_mask.shape}")
    tgt_mask = tgt_mask.unsqueeze(1)
    print(f"tgt_mask.shape = {tgt_mask.shape}")
    tgt_mask = tgt_mask.unsqueeze(3)
    print(f"tgt_mask.shape = {tgt_mask.shape}")
    # Determine the length of the target sequence
    seq_length = tgt.size(1)
    print(f"seq_length = {seq_length}")
    # Create a no-peak mask for the target sequence to prevent peeking into future tokens
    ones_vector = torch.ones(1, seq_length, seq_length, dtype=torch.long)
    print(f"ones_vector.shape = {ones_vector.shape}")
    print(f"ones_vector = \n{ones_vector}")
    peak_mask = torch.triu(ones_vector, diagonal=1)
    peak_mask_bool = peak_mask.bool()
    print(f"peak_mask.shape = {peak_mask.shape}")
    print(f"peak_mask = \n{peak_mask}")
    print(f"peak_mask_bool.shape = {peak_mask_bool.shape}")
    print(f"peak_mask_bool = \n{peak_mask_bool}")
    nopeak_mask = (1 - peak_mask)
    nopeak_mask_bool = nopeak_mask.bool()
    print(f"nopeak_mask.shape = {nopeak_mask.shape}")
    print(f"nopeak_mask = \n{nopeak_mask}")
    print(f"nopeak_mask_bool.shape = {nopeak_mask_bool.shape}")
    print(f"nopeak_mask_bool = \n{nopeak_mask_bool}")
    # Combine the padding mask and the no-peak mask for the target sequence
    print(f"tgt_mask = \n{tgt_mask}\n")
    print(f"nopeak_mask = \n{nopeak_mask}")
    tgt_mask = tgt_mask & nopeak_mask
    print(f"tgt_mask = \n{tgt_mask}")
    

fn()

src.shape = torch.Size([3, 3])
tgt.shape = torch.Size([3, 3])
src_mask.shape = torch.Size([3, 3])
src_mask.shape = torch.Size([3, 1, 3])
src_mask.shape = torch.Size([3, 1, 1, 3])
tgt_mask.shape = torch.Size([3, 3])
tgt_mask.shape = torch.Size([3, 1, 3])
tgt_mask.shape = torch.Size([3, 1, 3, 1])
seq_length = 3
ones_vector.shape = torch.Size([1, 3, 3])
ones_vector = 
tensor([[[1, 1, 1],
         [1, 1, 1],
         [1, 1, 1]]])
peak_mask.shape = torch.Size([1, 3, 3])
peak_mask = 
tensor([[[0, 1, 1],
         [0, 0, 1],
         [0, 0, 0]]])
peak_mask_bool.shape = torch.Size([1, 3, 3])
peak_mask_bool = 
tensor([[[False,  True,  True],
         [False, False,  True],
         [False, False, False]]])
nopeak_mask.shape = torch.Size([1, 3, 3])
nopeak_mask = 
tensor([[[1, 0, 0],
         [1, 1, 0],
         [1, 1, 1]]])
nopeak_mask_bool.shape = torch.Size([1, 3, 3])
nopeak_mask_bool = 
tensor([[[ True, False, False],
         [ True,  True, False],
         [ True,  True,  True]]])
tgt_mask =

In [109]:
F.softmax(torch.tensor([0.667, 0.354, 0.281]), dim=-1)

tensor([0.4148, 0.3033, 0.2819])

In [110]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, 
                 num_layers, d_ff, max_seq_len, dropout=0.0):
        """
        src_vocab_size: Source vocabulary size.
        tgt_vocab_size: Target vocabulary size.
        d_model: The dimensionality of the model's embeddings.
        num_heads: Number of attention heads in the multi-head attention mechanism.
        num_layers: Number of layers for both the encoder and the decoder.
        d_ff: Dimensionality of the inner layer in the feed-forward network.
        max_seq_len: Maximum sequence length for positional encoding.
        dropout: Dropout rate for regularization.
        """
        super(Transformer, self).__init__()
        self.enc_emb = nn.Embedding(src_vocab_size, d_model)
        self.dec_emb = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_enc = PositionalEncoding(max_seq_len, d_model)

        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout) 
            for _ in range(num_layers)
            ])
        
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout) 
            for _ in range(num_layers)
            ])
        
        self.proj = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)


    def generate_mask(self, src, tgt):
        """
        This method is used to create masks for the source and target sequences, 
        ensuring that padding tokens are ignored and that future tokens are not 
        visible during training for the target sequence.
        """
        # Create a mask for the source sequence (src)
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        # Create a mask for the target sequence (tgt)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        # Determine the length of the target sequence
        seq_length = tgt.size(1)
        # Create a no-peak mask for the target sequence to prevent peeking into future tokens
        # create a vector of ones of shape (1, seq_length, seq_length)
        ones_vector = torch.ones(1, seq_length, seq_length)
        # Create an upper triangular mask starting above the diagonal (i.e., diagonal=1). 
        # This results in a matrix where elements above the diagonal are set to one, and 
        # elements on and below the diagonal are zero.
        peak_mask = torch.triu(ones_vector, diagonal=1)
        # Invert the peak mask to create a no-peak mask
        nopeak_mask = (1 - peak_mask)
        # Convert the nopeak mask to a boolean mask
        nopeak_mask = nopeak_mask.bool()
        # Combine the padding mask and the no-peak mask for the target sequence
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask


    def forward(self, src, tgt):
        """
        This method defines the forward pass for the Transformer, taking source 
        and target sequences and producing the output predictions. Below we outline
        the steps involved in the forward pass:
        1. Input Embedding and Positional Encoding: The source and target sequences 
        are first embedded using their respective embedding layers and then added to 
        their positional encodings.
        2. Encoder Layers: The source sequence is passed through the encoder layers, 
        with the final encoder output representing the processed source sequence.
        3. Decoder Layers: The target sequence and the encoder's output are passed 
        through the decoder layers, resulting in the decoder's output.
        4. Final Linear Layer: The decoder's output is mapped to the target vocabulary 
        size using a fully connected (linear) layer.
        
        Output: The final output is a tensor representing the model's predictions for 
        the target sequence.
        """
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.pos_enc(self.enc_emb(src)))
        tgt_embedded = self.dropout(self.pos_enc(self.dec_emb(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.proj(dec_output)
        # calculate output probabilities
        output_prob = F.softmax(output, dim=-1)
        return output_prob
    

### Final Transformer Code

In [None]:
def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len: int, 
                      tgt_seq_len: int, d_model: int = 512, Nx: int = 6, h: int = 8,
                      dropout: float = 0.1, d_ff: int = 2048) -> Transformer:
    # Create the input and output embedding layers
    src_embed = InputEmbedding(d_model, src_vocab_size)
    tgt_embed = InputEmbedding(d_model, tgt_vocab_size)
    
    # Create the input and output positional encoding layers
    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)
    
    # Create the encoder blocks
    encoder_blocks = []
    for _ in range(Nx):
        encoder_selfattn_block = MultiHeadAttentionBlock(d_model, h, dropout)
        encoder_feedforward_block = PositionWiseFeedForward(d_model, d_ff, dropout)
        encoder_block = EncoderBlock(d_model, encoder_selfattn_block, 
                                     encoder_feedforward_block, dropout)
        encoder_blocks.append(encoder_block)
        
    # Create the decoder blocks
    decoder_blocks = []
    for _ in range(Nx):
        decoder_selfttn_block = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_crossattn_block = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_feedforward_block = PositionWiseFeedForward(d_model, d_ff, dropout)
        decoder_block = DecoderBlock(d_model, decoder_selfttn_block, 
                                     decoder_crossattn_block, decoder_feedforward_block,
                                     dropout)
        decoder_blocks.append(decoder_block)
        
    # Create the encoder and decoder
    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))
    decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))
    
    # Create the projection layer
    proj_layer = ProjectionLayer(d_model, tgt_vocab_size)
    
    # Create the transformer
    transformer = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, 
                              tgt_pos, projection_layer)
    
    # Initialize the parameters
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    
    return transformer