# **ATTENTION IS ALL YOU NEED!**

## Language Translator

Reference: [Attention is all you need (Transformer) - Model explanation (including math), Inference and Training](https://www.youtube.com/@umarjamilai)

---

Hello Readers,

In this project, we are going to implement the transformer from the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762).

---

### Project Overview

This notebook will guide you through the following steps:

1. **Data Preparation:** Loading and preprocessing the English-Italian translation dataset.
2. **Model Architecture:** Building the transformer model as described in the paper.
3. **Training:** Training the transformer model on the dataset.
4. **Evaluation:** Evaluating the model's performance on test data.
5. **Inference:** Translating English sentences to Italian using the trained model.

### Workflow

We will start with some helper functions, then usual imports, then directly jump into model architecture, then prepare the dataset for training, and finally perform inference. Variable names used in the code aligns with paper.

Let's get started!

## Model Architecture

<img src="images/model_arch.png" alt="Image" width="300"/>

#### Required imports

In [36]:
import torch
import torch.nn as nn
import math
import torch.nn.functional as F

#### Helper Functions

In [37]:
def count_parameters(model):
    """
    Count and print the number of trainable parameters in a model.

    This function iterates over all parameters in the given model,
    filters those that require gradients (trainable parameters), 
    and prints each parameter's count. Finally, it prints the total
    count of trainable parameters.

    Args:
        model (torch.nn.Module): The model for which to count parameters.

    Example:
        model = YourModel()
        count_parameters(model)
    """
    params = [p.numel() for p in model.parameters() if p.requires_grad]
    for item in params:
        print(f'{item:>6}')
    print(f'______\n{sum(params):>6}')

### Input Embeddings

<img src="images/input_embedding.png" alt="Image" width="1000"/>

In [38]:
class InputEmbeddings(nn.Module):
    """
    Input Embeddings Class

    This class defines the input embedding layer used in the transformer model.
    It converts input tokens to their corresponding embeddings and scales them
    by the square root of the model's dimensionality (d_model).

    Args:
        d_model (int): The dimensionality of the embeddings.
        vocab_size (int): The size of the vocabulary (i.e., the number of unique tokens).

    Methods:
        forward(x): Computes the embeddings for the input tokens and scales them.
    """
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        # Define the embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        """
        Forward pass for the input embeddings.

        Args:
            x (torch.Tensor): Input tensor containing token indices.

        Returns:
            torch.Tensor: Scaled embeddings for the input tokens.
        """
        # Compute embeddings and scale them by sqrt(d_model)
        return self.embedding(x) * math.sqrt(self.d_model)


In [39]:
# Example usage
d_model = 512  # Dimension of the model
vocab_size = 25000  # Maximum length of the sequence
input_tensor = torch.randint(low = 0, high = vocab_size-1, size = (10,50), dtype=torch.int) # 10 lines, 50 words per line
input_embeddings = InputEmbeddings(d_model, vocab_size)
input_embeddings(input_tensor).shape

torch.Size([10, 50, 512])

In [40]:
count_parameters(input_embeddings) # weights=> 512*25000

12800000
______
12800000


### Positional Encoding

<img src="images/Positional_encoding_1.png" alt="Image" width="1000"/>
<img src="images/Positional_encoding_2.png" alt="Image" width="1000"/>

In [41]:
class PositionalEncoding(nn.Module):
    """
    Positional Encoding Class

    This class adds positional encoding to the input embeddings to provide 
    the model with information about the relative or absolute position of 
    tokens in the sequence. The implementation includes a dropout layer 
    for regularization.

    Args:
        d_model (int): The dimensionality of the embeddings.
        seq_len (int): The maximum length of the input sequences.
        dropout (float): The dropout rate for regularization.
        div_term_implementation (str): Specifies the method to compute the 
            division term. Options are 'original' or 'modified'.
    
    Methods:
        forward(x): Adds positional encoding to the input tensor.
    """
    def __init__(self, d_model: int, seq_len: int, dropout: float, div_term_implementation: str = 'modified') -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        # Create a matrix of shape (seq_len, d_model) for positional encodings
        pe = torch.zeros(seq_len, d_model)
        # Create a vector of shape (seq_len, 1) for position indices
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)

        # Calculate the division term based on the specified implementation
        if div_term_implementation == 'original':
            div_term = 1.0 / (10000 ** (torch.arange(0, d_model, 2).float() / d_model))
        else:
            div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add an extra dimension for batch size
        pe = pe.unsqueeze(0)

        # Register the positional encoding matrix as a buffer
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Forward pass for the positional encoding.

        Args:
            x (torch.Tensor): Input tensor containing embeddings.

        Returns:
            torch.Tensor: Input tensor with added positional encoding.
        """
        # Add positional encoding to the input tensor and disable gradient computation
        x = x + (self.pe[:, :x.shape[1],]).requires_grad_(False)
        return self.dropout(x)


In [42]:
# Example usage
d_model = 512  # Dimension of the model
max_len = 5000  # Maximum length of the sequence
pos_encoder = PositionalEncoding(d_model, max_len, dropout=0.2)

# Dummy input tensor with shape (batch size, sequence length, d_model)
x = torch.zeros(32, 100, d_model)

# Apply positional encoding
x = pos_encoder(x)
print(x.shape)  # Should output: torch.Size([100, 32, 512])

torch.Size([32, 100, 512])


In [43]:
count_parameters(pos_encoder)

______
     0


### Layer Normalization

<img src="images/Layer_Normalization.png" alt="Image" width="1000"/>

In [44]:
class LayerNormalization(nn.Module):
    """
    Layer Normalization Class

    This class applies layer normalization to the input tensor of batch. 
    Layer normalization normalizes the inputs across the features 
    (i.e., the last dimension) to have mean zero and variance one. 
    It also includes learnable parameters for scaling and bias.

    Args:
        features (int): Number of features in a model
        eps (float): A small value to avoid division by zero during normalization. 
                     Default is 10**-6.
    
    Methods:
        forward(x): Applies layer normalization to the input tensor.
    """
    def __init__(self, features: int, eps: float = 10**-6):
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(features))        # Learnable scaling parameter
        self.bias = nn.Parameter(torch.zeros(features))         # Learnable bias parameter

    def forward(self, x):
        """
        Forward pass for the layer normalization.

        Args:
            x (torch.Tensor): Input tensor to be normalized.

        Returns:
            torch.Tensor: Normalized input tensor with learnable scaling and bias.
        """

        # x: (batch, seq_len, hidden_size)
        mean = x.mean(dim=-1, keepdim=True) # (batch, seq_len, 1)
        std = x.std(dim=-1, keepdim=True) # (batch, seq_len, 1)
        # Apply normalization, scaling, and bias
        return self.alpha * (x - mean) / (std + self.eps) + self.bias


In [45]:
# Example usage
batch = 16
seq_len = 500
d_model = 512
# Input tensor of shape (batch_size, sequence_length, d_model)
x = torch.randn(batch, seq_len, d_model)
# Instantiate the LayerNormalization class
layer_norm = LayerNormalization(d_model)
# Apply layer normalization
output = layer_norm(x)

print("Input Shape:", x.shape)
print("Output Shape:", output.shape)

Input Shape: torch.Size([16, 500, 512])
Output Shape: torch.Size([16, 500, 512])


In [46]:
count_parameters(layer_norm)

   512
   512
______
  1024


### Feed Forward Block

In [47]:
class FeedForwardBlock(nn.Module):
    """
    FeedForward Block Class

    This class defines a feedforward neural network block used in the 
    transformer model. It consists of two linear transformations with 
    a ReLU activation in between, and dropout for regularization.

    Args:
        d_model (int): The dimensionality of the input and output features.
        d_ff (int): The dimensionality of the intermediate (hidden) features.
        dropout (float): The dropout rate for regularization.

    Methods:
        forward(x): Applies the feedforward block to the input tensor.
    """
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        # First linear layer with input size d_model and output size d_ff
        self.linear_1 = nn.Linear(d_model, d_ff) # W1 and B1
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        # Second linear layer with input size d_ff and output size d_model
        self.linear_2 = nn.Linear(d_ff, d_model) # W2 and B2

    def forward(self, x):
        """
        Forward pass for the feedforward block.

        Args:
            x (torch.Tensor): Input tensor with shape (Batch, Seq_len, d_model).

        Returns:
            torch.Tensor: Output tensor with shape (Batch, Seq_len, d_model).
        """
        # Apply the first linear layer, ReLU activation, dropout, and then the second linear layer
        # (Batch, Seq_len, d_model) -> (Batch, Seq_len, d_ff) -> (Batch, Seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))


In [48]:
# Example usage
batch_size = 16
seq_len = 500
d_model = 512

# Input tensor of shape (batch_size, sequence_length, d_model)
x = torch.randn(batch_size, seq_len, d_model)  # Example input tensor

# Instantiate the FeedForwardBlock class
d_ff = 2048  # Dimension of the feedforward layer
dropout = 0.1  # Dropout rate
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
# Apply the feedforward block
output = feed_forward_block(x)

print("Input Shape:", x.shape)
print("Output Shape:", output.shape)

Input Shape: torch.Size([16, 500, 512])
Output Shape: torch.Size([16, 500, 512])


In [49]:
count_parameters(feed_forward_block)

1048576
  2048
1048576
   512
______
2099712


# 26th March 10:32

### Multi-Head Attention

<img src="images/Multi_head_attention.png" alt="Image" width="1200"/>

In [50]:
class MultiHeadAttentionBlock(nn.Module):
    """
    Multi-Head Attention Block Class

    This class implements the multi-head attention mechanism as described 
    in the "Attention Is All You Need" paper. It allows the model to jointly 
    attend to information from different representation subspaces.

    Args:
        d_model (int): The dimensionality of the input and output features.
        h (int): The number of attention heads.
        dropout (float): The dropout rate for regularization.

    Methods:
        forward(q, k, v, mask): Applies multi-head attention to the input tensors.
    """
    def __init__(self, d_model: int, h: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.h = h
        assert d_model%h == 0, "d_model not divisible by h"

        self.d_k = d_model//h
        
        self.w_q = nn.Linear(d_model, d_model) #wq
        self.w_k = nn.Linear(d_model, d_model) #wk
        self.w_v = nn.Linear(d_model, d_model) #wv

        self.w_o = nn.Linear(d_model, d_model) #wo
        self.dropout = nn.Dropout(dropout)

    @staticmethod 
    def attention(query, key, value, mask, dropout: nn.Dropout):
        """
        Compute the attention scores and apply the attention mechanism.

        Args:
            query (torch.Tensor): Query tensor of shape (Batch, h, Seq_Len, d_k).
            key (torch.Tensor): Key tensor of shape (Batch, h, Seq_Len, d_k).
            value (torch.Tensor): Value tensor of shape (Batch, h, Seq_Len, d_k).
            mask (torch.Tensor): Mask tensor to avoid attending to certain positions.
            dropout (nn.Dropout): Dropout layer for regularization.

        Returns:
            torch.Tensor: The output tensor after applying attention.
            torch.Tensor: The attention scores.
        """
        d_k = query.shape[-1]
        
        # (Batch, h, seq_len, d_k) -> (Batch, h, seq_len, seq_len)
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)

        if mask is not None:
            # very low value (indicateing -inf) to positions where mask == 0
            attention_scores.masked_fill(mask==0,-1e9)

        attention_scores = attention_scores.softmax(dim = -1) # (Batch, h, seq_len, seq_len)
        
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        # (batch. h. seq_len, seq_len) -> (batch, h, seq_len, d_k)
        # also resturn attention scores used for visualization
        return (attention_scores @ value), attention_scores        

    def forward(self, q, k, v, mask):
        """
        Forward pass for the multi-head attention block.

        Args:
            q (torch.Tensor): Query tensor of shape (Batch, Seq_Len, d_model).
            k (torch.Tensor): Key tensor of shape (Batch, Seq_Len, d_model).
            v (torch.Tensor): Value tensor of shape (Batch, Seq_Len, d_model).
            mask (torch.Tensor): Mask tensor to avoid attending to certain positions.

        Returns:
            torch.Tensor: The output tensor after applying multi-head attention.
        """

        # Apply linear layers to get query, key, and value tensors
        query = self.w_q(q) # (Batch, Seq_Len, d_model) -> (Batch, Seq_Len, d_model)
        key = self.w_k(k) # (Batch, Seq_Len, d_model) -> (Batch, Seq_Len, d_model)
        value = self.w_v(v) # (Batch, Seq_Len, d_model) -> (Batch, Seq_Len, d_model)

        # (Batch, Seq_Len, d_model) -> (Batch, Seq_Len, h, d_k) -> (Batch, h, Seq_Len, d_k)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)
        
        # Apply the attention mechanism
        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)

        # Reshape and transpose back to the original shape
        # (Batch, h, seq_len, d_k) -> (Batch, seq_len, h, d_k) -> (Batch, seq_len, d_model)
        x = x.transpose(1,2).contiguous().view(x.shape[0], -1, self.h*self.d_k)

        # Apply the final linear layer (Batch, seq_len, d_model)
        return self.w_o(x)

    

In [51]:
batch_size = 2
seq_len = 5
d_model = 16
h = 4
dropout = 0.1

# Create a random tensor with shape (Batch, Seq_Len, d_model)
q = torch.rand(batch_size, seq_len, d_model)
k = torch.rand(batch_size, seq_len, d_model)
v = torch.rand(batch_size, seq_len, d_model)
    
# Create a mask tensor to avoid attending to certain positions
mask = torch.ones(batch_size, 1, seq_len, seq_len)
mask[:, :, 2:, :] = 0  # Masking out the last three positions

# Create a multi-head attention block
mha_block = MultiHeadAttentionBlock(d_model, h, dropout)

# Apply the multi-head attention block
output = mha_block(q, k, v, mask)
print("Output shape:", output.shape)

Output shape: torch.Size([2, 5, 16])


In [52]:
count_parameters(mha_block)

   256
    16
   256
    16
   256
    16
   256
    16
______
  1088


### Residual Layer

<img src="images/Residual_network.png" alt="Image" width="300"/>

In [53]:
class ResidualConnection(nn.Module):
    """
    Residual Connection with Layer Normalization.

    This module adds a residual connection around any given sublayer 
    with layer normalization and dropout applied before the residual 
    connection is added.

    Args:
        features (int): Number of input features
        dropout (float): Dropout rate to be applied after the sublayer.
    """
    def __init__(self, features: int, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x, sublayer):
        """
        Apply residual connection to any sublayer.

        Args:
            x (torch.Tensor): Input tensor.
            sublayer (Callable): A sublayer function or module.

        Returns:
            torch.Tensor: Output tensor after applying the residual connection and dropout.
        """
        # Norm and Add
        return x + self.dropout(sublayer(self.norm(x)))


In [54]:
# Example usage

batch_size = 16
seq_len = 500
d_model = 512
dropout_rate = 0.1

# Create a random tensor with shape (Batch, Seq_Len, d_model)
x = torch.rand(batch_size, seq_len, d_model)

# Define a simple sublayer function for demonstration
class SimpleSublayer(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.linear = nn.Linear(d_model, d_model)

    def forward(self, x):
        return F.relu(self.linear(x))

sublayer = SimpleSublayer(d_model)

# Create a residual connection block
residual_connection = ResidualConnection(d_model, dropout_rate)

# Apply the residual connection block with the sublayer
output = residual_connection(x, sublayer)
print("Output shape:", output.shape)

Output shape: torch.Size([16, 500, 512])


In [55]:
count_parameters(residual_connection)

   512
   512
______
  1024


### Encoder Block

<img src="images/Encoder.png" alt="Image" width="300"/>

In [56]:
class EncoderBlock(nn.Module):

    def __init__(self, features: int, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        """
        Encoder Block for a Transformer model.

        This block consists of a self-attention mechanism followed by a feed-forward network, 
        with residual connections and layer normalization applied to each sublayer.

        Args:
            features (int): Number of input Features
            self_attention_block (MultiHeadAttentionBlock): Multi-head self-attention mechanism.
            feed_forward_block (FeedForwardBlock): Feed-forward network.
            dropout (float): Dropout rate to be applied after each sublayer.
        """
        super().__init__()
        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        """
        Forward pass through the encoder block.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
            src_mask (torch.Tensor): Source mask tensor.

        Returns:
            torch.Tensor: Output tensor after applying self-attention, feed-forward network, and residual connections.
        """
        # Apply the first residual connection with the self-attention block
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))  # try without lambda

        # Apply the second residual connection with the feed-forward block
        x = self.residual_connections[1](x, lambda x: self.feed_forward_block(x))
        return x

In [57]:
# Example
batch_size = 2
seq_len = 5
d_model = 16
h = 4
d_ff = 32
dropout_rate = 0.1

# Create random tensors for input and mask
x = torch.rand(batch_size, seq_len, d_model)
# Create a mask tensor to avoid attending to certain positions
mask = torch.ones(batch_size, 1, seq_len, seq_len)
mask[:, :, 2:, :] = 0  # Masking out the last three positions

# Initialize self-attention block and feed-forward block
self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout_rate)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout_rate)

# Initialize the encoder block
encoder_block = EncoderBlock(d_model, self_attention_block, feed_forward_block, dropout_rate)

# Apply the encoder block
output = encoder_block(x, mask)
print("Output shape:", output.shape)

Output shape: torch.Size([2, 5, 16])


### Encoder

In [58]:
class Encoder(nn.Module):
    """
    Encoder block for a Transformer model.

    This block consists of a stack of layers (e.g., multi-head self-attention and feed-forward networks)
    followed by layer normalization.
    """
    def __init__(self, features: int, layers: nn.ModuleList) -> None:
        """
        Initializes the Encoder with a list of layers and a layer normalization module.

        Args:
            features (int): Number of input features
            layers (nn.ModuleList): List of layers to be included in the encoder.
        """
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization(features)  # Initialize layer normalization

    def forward(self, x, mask):
        """
        Forward pass through the encoder.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
            mask (torch.Tensor): Mask tensor of shape (batch_size, 1, 1, seq_len).

        Returns:
            torch.Tensor: Output tensor after applying all layers and layer normalization.
        """
        for layer in self.layers:
            x = layer(x, mask)  # Apply each layer in the list to the input
        return self.norm(x)  # Apply layer normalization to the final output

In [59]:
# Example
batch_size = 2
seq_len = 5
d_model = 16
h = 4
d_ff = 32
dropout_rate = 0.1

# Create random tensors for input and mask
x = torch.rand(batch_size, seq_len, d_model)
# Create a mask tensor to avoid attending to certain positions
mask = torch.ones(batch_size, 1, seq_len, seq_len)
mask[:, :, 2:, :] = 0  # Masking out the last three positions

# Initialize self-attention block and feed-forward block
self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout_rate)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout_rate)

# Initialize the encoder block
encoder_blocks = nn.ModuleList()
for i in range(10):
    encoder_blocks.append(EncoderBlock(d_model, self_attention_block, feed_forward_block, dropout_rate))

encoder = Encoder(d_model, encoder_blocks)

# Apply the encoder block
output = encoder(x, mask)
print("Output shape:", output.shape)

Output shape: torch.Size([2, 5, 16])


In [60]:
 # count_parameters(encoder)

### Decoder_Block

<img src="images/Decoder_Block.png" alt="Image" width="200"/>

In [61]:
class DecoderBlock(nn.Module):
    """
    Decoder block for a Transformer model.

    This block consists of a self-attention mechanism, a cross-attention mechanism (attending to encoder output),
    and a feed-forward network. Each sublayer is followed by a residual connection and layer normalization.
    """
    def __init__(
            self, 
            features: int,
            self_attention_block: MultiHeadAttentionBlock, 
            cross_attention_block: MultiHeadAttentionBlock,
            feed_forward_block: FeedForwardBlock,
            dropout: float = 0.1
        ) -> None:
        """
        Initializes the DecoderBlock with self-attention, cross-attention, feed-forward blocks,
        and residual connections.

        Args:
            self_attention_block (MultiHeadAttentionBlock): Multi-head self-attention mechanism.
            cross_attention_block (MultiHeadAttentionBlock): Cross-attention mechanism to attend to encoder output.
            feed_forward_block (FeedForwardBlock): Feed-forward network.
            dropout (float): Dropout rate to be applied after each sublayer.
        """
        super().__init__()
        self.self_attention_block = self_attention_block  # Self-attention mechanism
        self.cross_attention_block = cross_attention_block  # Cross-attention mechanism
        self.feed_forward_block = feed_forward_block  # Feed-forward network
        self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(3)])  # Residual connections with dropout

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        """
        Forward pass through the decoder block.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
            encoder_output (torch.Tensor): Encoder output tensor of shape (batch_size, seq_len, d_model).
            src_mask (torch.Tensor): Source mask tensor.
            tgt_mask (torch.Tensor): Target mask tensor.

        Returns:
            torch.Tensor: Output tensor after applying self-attention, cross-attention, feed-forward network, and residual connections.
        """
        # Apply the first residual connection with the self-attention block
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))

        # Apply the second residual connection with the cross-attention block
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))

        # Apply the third residual connection with the feed-forward block
        x = self.residual_connections[2](x, lambda x: self.feed_forward_block(x))

        return x  # Return the final output tensor


In [62]:
# Example
d_model = 512  # Dimension of the model
num_heads = 8  # Number of attention heads
d_ff = 2048  # Dimension of the feed-forward layer
dropout = 0.1  # Dropout rate
seq_len = 10  # Sequence length
batch_size = 32  # Batch size

# Create instances of the attention and feed-forward blocks
self_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
cross_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)

# Create an instance of the DecoderBlock
decoder_block = DecoderBlock(d_model, self_attention_block, cross_attention_block, feed_forward_block, dropout)

# Dummy input tensors
x = torch.randn(batch_size, seq_len, d_model)  # Input tensor
encoder_output = torch.randn(batch_size, seq_len, d_model)  # Encoder output tensor
src_mask = torch.ones(batch_size, 1, seq_len, seq_len)  # Source mask tensor
tgt_mask = torch.ones(batch_size, 1, seq_len, seq_len)  # Target mask tensor

# Forward pass through the DecoderBlock
output = decoder_block(x, encoder_output, src_mask, tgt_mask)

print("Output shape:", output.shape)

Output shape: torch.Size([32, 10, 512])


### Decoder

In [63]:
class Decoder(nn.Module):
    """
    Decoder for a Transformer model.

    This decoder consists of a stack of decoder layers, each containing self-attention, 
    cross-attention, and feed-forward networks with residual connections and layer normalization.
    """
    def __init__(self, features: int, layers: nn.ModuleList) -> None:
        """
        Initializes the Decoder with a list of layers and a layer normalization module.

        Args:
            features (int): Number of input feautres for layewr normalization
            layers (nn.ModuleList): List of decoder layers to be included in the decoder.
        """
        super().__init__()
        self.layers = layers  # List of decoder layers
        self.norm = LayerNormalization(features)  # Initialize layer normalization

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        """
        Forward pass through the decoder.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
            encoder_output (torch.Tensor): Encoder output tensor of shape (batch_size, seq_len, d_model).
            src_mask (torch.Tensor): Source mask tensor.
            tgt_mask (torch.Tensor): Target mask tensor.

        Returns:
            torch.Tensor: Output tensor after applying all decoder layers and layer normalization.
        """
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)  # Apply each decoder layer to the input

        return self.norm(x)  # Apply layer normalization to the final output


In [64]:
# Example
d_model = 512  # Dimension of the model
num_heads = 8  # Number of attention heads
d_ff = 2048  # Dimension of the feed-forward layer
dropout = 0.1  # Dropout rate
seq_len = 10  # Sequence length
batch_size = 32  # Batch size

# Create instances of the attention and feed-forward blocks
self_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
cross_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)

# Create a list of decoder blocks
decoder_blocks = nn.ModuleList([
    DecoderBlock(d_model, self_attention_block, cross_attention_block, feed_forward_block, dropout)
    for _ in range(6)  # Number of decoder layers
])

# Create an instance of the Decoder
decoder = Decoder(d_model, decoder_blocks)

# Dummy input tensors
x = torch.randn(batch_size, seq_len, d_model)  # Input tensor
encoder_output = torch.randn(batch_size, seq_len, d_model)  # Encoder output tensor
src_mask = torch.ones(batch_size, 1, seq_len, seq_len)  # Source mask tensor
tgt_mask = torch.ones(batch_size, 1, seq_len, seq_len)  # Target mask tensor

# Forward pass through the Decoder
output = decoder(x, encoder_output, src_mask, tgt_mask)

print("Output shape:", output.shape)

Output shape: torch.Size([32, 10, 512])


### Projection Layer

In [65]:
class ProjectionLayer(nn.Module):
    """
    Projection layer for a Transformer model.

    This layer projects the hidden states from the model's d_model dimensionality
    to the vocabulary size dimensionality and applies a log softmax function.
    """
    def __init__(self, d_model: int, vocab_size: int) -> None:
        """
        Initializes the ProjectionLayer with a linear projection layer.

        Args:
            d_model (int): Dimensionality of the model's hidden states.
            vocab_size (int): Size of the vocabulary (number of target classes).
        """
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)  # Linear layer to project to vocabulary size

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the projection layer.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).

        Returns:
            torch.Tensor: Output tensor of shape (batch_size, seq_len, vocab_size) with log softmax applied.
        """
        # (batch_size, seq_len, d_model) -> (batch_size, seq_len, vocab_size)
        return F.log_softmax(self.proj(x), dim=-1)  # Apply linear projection and log softmax


In [66]:
# Example
d_model = 512  # Dimensionality of the model's hidden states
vocab_size = 10000  # Size of the vocabulary
seq_len = 10  # Sequence length
batch_size = 32  # Batch size

# Create an instance of the ProjectionLayer
projection_layer = ProjectionLayer(d_model, vocab_size)

# Dummy input tensor of shape (batch_size, seq_len, d_model)
x = torch.randn(batch_size, seq_len, d_model)

# Forward pass through the ProjectionLayer
output = projection_layer(x)

# Output tensor should have shape (batch_size, seq_len, vocab_size)
print("Output shape:", output.shape)

Output shape: torch.Size([32, 10, 10000])


### Transformer

<img src="images/model_arch.png" alt="Image" width="300"/>

In [67]:
import torch
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size: int,
        tgt_vocab_size: int,
        src_seq_len: int,
        tgt_seq_len: int,
        d_model: int = 512,
        N: int = 6,
        h: int = 8,
        dropout: float = 0.1,
        d_ff: int = 2048
    ) -> None:
        super().__init__()

        # Embeddings
        self.src_embed = InputEmbeddings(d_model, src_vocab_size)
        self.tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)

        # Positional encodings
        self.src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
        self.tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)

        # Encoder
        encoder_blocks = []
        for _ in range(N):
            encoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
            feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
            encoder_block = EncoderBlock(d_model, encoder_self_attention_block, feed_forward_block, dropout)
            encoder_blocks.append(encoder_block)
        self.encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))

        # Decoder
        decoder_blocks = []
        for _ in range(N):
            decoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
            decoder_cross_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
            feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
            decoder_block = DecoderBlock(d_model, decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
            decoder_blocks.append(decoder_block)
        self.decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))

        # Projection layer
        self.projection_layer = ProjectionLayer(d_model, tgt_vocab_size)

        # Initialize parameters
        self._init_parameters()

    def _init_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def encode(self, src, src_mask):
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src, src_mask)

    def decode(self, encoder_output, src_mask, tgt, tgt_mask):
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

    def project(self, x):
        return self.projection_layer(x)

    def forward(self, src, tgt, src_mask, tgt_mask):
        encoder_output = self.encode(src, src_mask)
        decoder_output = self.decode(encoder_output, src_mask, tgt, tgt_mask)
        output = self.project(decoder_output)
        return output


In [68]:
def test_transformer_forward_pass():
    # Hyperparameters
    src_vocab_size = 30000
    tgt_vocab_size = 25000
    src_seq_len = 100
    tgt_seq_len = 80
    d_model = 512
    batch_size = 2

    # Instantiate the Transformer
    model = Transformer(
        src_vocab_size=src_vocab_size,
        tgt_vocab_size=tgt_vocab_size,
        src_seq_len=src_seq_len,
        tgt_seq_len=tgt_seq_len,
        d_model=d_model,
        N=6,
        h=8,
        dropout=0.1,
        d_ff=1024
    )

    # Create dummy inputs
    src = torch.randint(0, src_vocab_size, (batch_size, src_seq_len))  # [batch_size, src_seq_len]
    tgt = torch.randint(0, tgt_vocab_size, (batch_size, tgt_seq_len))  # [batch_size, tgt_seq_len]

    # Create correct masks
    src_mask = torch.ones((batch_size, 1, 1, src_seq_len))  # [batch_size, 1, 1, src_seq_len]
    tgt_mask = torch.ones((batch_size, 1, tgt_seq_len, tgt_seq_len))  # [batch_size, 1, tgt_seq_len, tgt_seq_len]

    # Forward pass
    out = model(src, tgt, src_mask, tgt_mask)

    # Check output shape: [batch_size, tgt_seq_len, tgt_vocab_size]
    assert out.shape == (batch_size, tgt_seq_len, tgt_vocab_size), f"Unexpected output shape: {out.shape}"

    print("✅ Transformer forward pass test passed!")

# Run the test
test_transformer_forward_pass()


✅ Transformer forward pass test passed!


In [69]:
# import torch
# import torch.nn as nn
# import torch.optim as optim
# from torch.utils.data import DataLoader, Dataset
# from tqdm import tqdm

# # -------------- Dummy Dataset --------------
# class DummyTranslationDataset(Dataset):
#     def __init__(self, src_vocab_size, tgt_vocab_size, src_seq_len, tgt_seq_len, num_samples=10000):
#         self.src_vocab_size = src_vocab_size
#         self.tgt_vocab_size = tgt_vocab_size
#         self.src_seq_len = src_seq_len
#         self.tgt_seq_len = tgt_seq_len
#         self.num_samples = num_samples
        
#     def __len__(self):
#         return self.num_samples
    
#     def __getitem__(self, idx):
#         src = torch.randint(0, self.src_vocab_size, (self.src_seq_len,))
#         tgt = torch.randint(0, self.tgt_vocab_size, (self.tgt_seq_len,))
#         return src, tgt

# # -------------- Mask Functions --------------
# def create_src_mask(src):
#     batch_size, src_len = src.shape
#     src_mask = torch.ones((batch_size, 1, 1, src_len), device=src.device)
#     return src_mask

# def create_tgt_mask(tgt):
#     batch_size, tgt_len = tgt.shape
#     nopeak_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
#     tgt_mask = nopeak_mask.unsqueeze(0).unsqueeze(0).expand(batch_size, 1, tgt_len, tgt_len)
#     return tgt_mask


# # -------------- Hyperparameters --------------
# src_vocab_size = 10000  # English vocab
# tgt_vocab_size = 12000  # French vocab
# src_seq_len = 30
# tgt_seq_len = 30
# batch_size = 64
# num_epochs = 10
# learning_rate = 1e-4
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# # -------------- Model, Loss, Optimizer --------------
# model = Transformer(
#     src_vocab_size=src_vocab_size,
#     tgt_vocab_size=tgt_vocab_size,
#     src_seq_len=src_seq_len,
#     tgt_seq_len=tgt_seq_len,
#     d_model=512,
#     N=6,
#     h=8,
#     dropout=0.1,
#     d_ff=1024
# ).to(device)

# optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# criterion = nn.CrossEntropyLoss(ignore_index=0)  # assume padding=0

# # -------------- Data Loaders --------------
# train_dataset = DummyTranslationDataset(src_vocab_size, tgt_vocab_size, src_seq_len, tgt_seq_len)
# train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# # -------------- Training Loop --------------
# for epoch in range(num_epochs):
#     model.train()
#     total_loss = 0

#     loop = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}", leave=False)

#     for src, tgt in loop:
#         src = src.to(device)
#         tgt = tgt.to(device)

#         src_mask = create_src_mask(src)
#         tgt_mask = create_tgt_mask(tgt)

#         # Prepare inputs and targets
#         tgt_input = tgt[:, :-1]    # Remove last token for input
#         tgt_output = tgt[:, 1:]    # Remove first token for output

#         tgt_mask = create_tgt_mask(tgt_input)

#         preds = model(src, tgt_input, src_mask, tgt_mask)  # (batch_size, tgt_seq_len-1, tgt_vocab_size)

#         preds = preds.reshape(-1, preds.shape[-1])  # Flatten for loss
#         tgt_output = tgt_output.reshape(-1)         # Flatten target

#         loss = criterion(preds, tgt_output)

#         optimizer.zero_grad()
#         loss.backward()
#         optimizer.step()

#         total_loss += loss.item()

#         loop.set_postfix(loss=loss.item())

#     avg_loss = total_loss / len(train_loader)
#     print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {avg_loss:.4f}")

# print("✅ Training finished!")
