<img src="../images/cover.jpg" width="1920"/>

# Transformer Architecture

The Transformer architecture, introduced in the paper *"[Attention Is All You Need](https://arxiv.org/abs/1706.03762)"* by Vaswani et al., revolutionized natural language processing by eliminating reliance on recurrence or convolutional layers, common in earlier models. Instead, it uses a mechanism called *self-attention* to compute the dependencies between all input tokens simultaneously, enabling efficient parallel processing and capturing long-range relationships in sequences. 

The architecture consists of an encoder-decoder structure, where the encoder processes input sequences and the decoder generates output sequences, both leveraging multi-head attention and feed-forward layers. This design achieved state-of-the-art results in machine translation and laid the foundation for subsequent advancements in language models like BERT and GPT.

<img src="../images/transformer architecture.webp" width="500"/>

<img src="../images/transformer explained.jpg" width="1920"/>

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
from torch.utils.data import Dataset, DataLoader
from collections import Counter
from typing import Optional

# Positional Encoding

The Transformer architecture, unlike RNNs, processes input sequences in parallel, making it inherently unaware of the sequential order of tokens. To address this, *positional encoding* is introduced, enabling the model to incorporate information about the position of each token in a sequence. This is achieved by embedding positional information directly into the input representations using a combination of sine and cosine functions. These functions encode position in a continuous and differentiable manner, allowing the model to capture relative and absolute positions of tokens effectively.

<img src="../images/positional encoding.png" width="1920"/>

## Sinusoidal Positional Encoding
The sine and cosine functions are chosen for their periodic nature, ensuring unique positional representations for each token. For a given position $ pos $ and model dimension $ i $, the positional encoding is calculated as:

$$
PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad 
PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
$$

Here:
- $ d_{\text{model}} $ is the model dimension.
- $ i $ alternates between even and odd indices in the embedding vector.

The alternating sine and cosine patterns ensure a distinct representation for each position while preserving the geometric relationships between positions. These positional encodings are added directly to the token embeddings, enabling the Transformer to incorporate positional information during training and inference. 

In the provided implementation, this encoding is computed using PyTorch. The sine function is applied to even indices, and the cosine function to odd indices. The resulting positional encodings are stored as a buffer and added to the input tensor during the forward pass, allowing the model to leverage position-awareness without requiring additional trainable parameters.

In [None]:
position = torch.arange(0, 10, dtype=torch.float).unsqueeze(1)
position

In [None]:
div_term = torch.exp(torch.arange(0, 8, 2).float() * (-math.log(10000.0) / 8))
div_term

In [None]:
position * div_term

In [None]:
torch.sin(position * div_term)

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_seq_length: int = 5000):
        super().__init__()

        # Create positional encoding matrix
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )

        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add batch dimension and register as buffer
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.pe[:, : x.size(1)]

## Simple Positional Encoding

The `SimplePositionalEncoding` class provides an alternative approach to encoding positional information in sequence models. Instead of using fixed sine and cosine functions, this method introduces a learnable positional embedding. A trainable parameter matrix, `positional_embedding`, is initialized randomly and has dimensions corresponding to the maximum sequence length (`max_seq_length`) and model dimension (`d_model`).  

During the forward pass, the positional embeddings for the relevant sequence length are added directly to the input tensor, enabling the model to learn optimal positional encodings during training. This approach is simpler and more flexible but introduces additional parameters compared to fixed encodings like sinusoidal functions.

In [None]:
# Altarnative simple positional encoding
class SimplePositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_seq_length: int = 5000):
        super().__init__()
        self.positional_embedding = nn.Parameter(torch.randn(max_seq_length, d_model))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.positional_embedding[: x.size(1)]

# Multi Head Attention

Multi-head attention is a core component of the Transformer architecture, enabling the model to focus on different parts of the input sequence simultaneously. It extends the self-attention mechanism by dividing the model's attention capacity into multiple "heads," each learning unique aspects of the input relationships.

### Query (Q), Key (K), and Value (V)  
The inputs to multi-head attention are transformed into three representations:  
- **Query (Q):** Represents the "what to attend to" aspect of each token.  
- **Key (K):** Encodes "where to attend" based on the relationships between tokens.  
- **Value (V):** Contains the actual information to be aggregated from the attended tokens.  

These representations are computed using learned linear transformations:  
$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$ 
where $ X $ is the input sequence and $ W_Q, W_K, W_V $ are trainable projection matrices.

### Multi-Head Mechanism  
The input embeddings are split into $ h $ heads (determined by `num_heads`), each operating on a subset of the total embedding dimension $ d_{\text{model}} $. For each head:  
1. **Scaled Dot-Product Attention** is computed:  
   $$
   \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
   $$ 
   Here, $ d_k = d_{\text{model}} / h $, and the scaling factor $ \sqrt{d_k} $ prevents overly large attention scores that could lead to small gradients.  
2. Outputs from all heads are concatenated and projected back into the original dimension $ d_{\text{model}} $ using a linear layer.  

<img src="../images/multi-head-attention.png" width="500"/>

### Scaled Dot-Product Attention  
The scaled dot-product attention computes attention scores between the query and key matrices, determines the relevance of each token, and uses these scores to aggregate the value matrix. The result is a context-aware representation of the input sequence. 

<img src="../images/scaled dot-product attention.png" width="500"/>

### Benefits of Multi-Head Attention  
Multi-head attention allows the model to jointly attend to information from different positions and capture diverse relationships in the input sequence. This enhances the model's ability to understand complex dependencies and patterns across tokens.

The provided implementation follows this structure, defining separate linear layers for $ Q $, $ K $, and $ V $, and applying scaled dot-product attention across multiple heads. Finally, the results are combined and projected back to the original embedding size, making the mechanism both powerful and efficient.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear layers for Q, K, V projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def scaled_dot_product_attention(
        self,
        Q: torch.Tensor,
        K: torch.Tensor,
        V: torch.Tensor,
        mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        return torch.matmul(attention_weights, V)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # Linear projections and reshape
        Q = (
            self.W_q(query)
            .view(batch_size, -1, self.num_heads, self.d_k)
            .transpose(1, 2)
        )
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = (
            self.W_v(value)
            .view(batch_size, -1, self.num_heads, self.d_k)
            .transpose(1, 2)
        )

        # Apply attention
        x = self.scaled_dot_product_attention(Q, K, V, mask)

        # Reshape and apply final linear layer
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(x)

# Feed-Forward Networks

In the Transformer architecture, the feed-forward network (FFN) is a fully connected sublayer applied independently to each position in the sequence. It consists of two linear transformations with a non-linear activation (typically ReLU) in between. The FFN enhances the model's capacity to learn complex mappings between input and output embeddings.  

The process involves:  
1. Expanding the dimensionality of the input from $ d_{\text{model}} $ to $ d_{\text{ff}} $ using the first linear layer.  
2. Applying a non-linear activation function (ReLU) to introduce non-linearity.  
3. Reducing the dimensionality back to $ d_{\text{model}} $ using the second linear layer.  
4. Using dropout for regularization, improving generalization.  

### Benefits of the Feed-Forward Network  
- **Position-wise Computation:** The FFN operates on each token independently, allowing parallel computation across all sequence positions.  
- **Increased Model Capacity:** The intermediate expansion to $ d_{\text{ff}} $ provides a higher capacity to model complex transformations, improving expressiveness.  
- **Non-linearity:** The ReLU activation captures non-linear patterns that cannot be modeled by attention mechanisms alone.  

This implementation reflects the standard FFN, efficiently balancing computational complexity and modeling power.

In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.relu = nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear2(self.dropout(self.relu(self.linear1(x))))

# Encoder Layer

The encoder layer is a fundamental building block of the Transformer, designed to process input sequences and generate contextualized representations. Each encoder layer consists of two key subcomponents:  

1. **Self-Attention Block:**  
   - Employs multi-head attention to allow each token to attend to all other tokens in the input sequence, capturing relationships and dependencies.  
   - Outputs are normalized (via LayerNorm) and combined with residual connections to facilitate stable and efficient training.  

2. **Feed-Forward Network:**  
   - A position-wise feed-forward network enhances the capacity of the model to capture complex transformations.  
   - Includes normalization and residual connections, similar to the self-attention block.  

### Workflow:  
1. Input passes through the self-attention block, with optional masking to focus on specific tokens.  
2. The attention output is added to the original input (residual connection), normalized, and passed to the feed-forward network.  
3. The feed-forward output undergoes another residual connection and normalization step.  

The encoder layer effectively learns rich representations of the input sequence by combining attention mechanisms and non-linear transformations, enabling robust feature extraction.

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        # Self attention block
        attn_output = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed forward block
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

# Decoder Layer 

The decoder layer processes target sequences while integrating information from the encoder's output to generate predictions. It consists of three main components:  

1. **Self-Attention Block:**  
   - Applies multi-head self-attention to the target sequence, with masking to prevent tokens from attending to future positions (causal masking).  

2. **Cross-Attention Block:**  
   - Employs multi-head attention where the queries come from the decoder, and keys and values are derived from the encoder's output, allowing the decoder to attend to relevant parts of the source sequence.  

3. **Feed-Forward Network:**  
   - A position-wise feed-forward network applies non-linear transformations, enhancing representation capabilities.  

### Workflow:  
1. Input goes through the self-attention block, ensuring the model focuses only on prior positions in the sequence.  
2. Cross-attention combines the decoder's self-attention output with the encoder's output for contextual understanding.  
3. The result is passed through the feed-forward network for further processing.  

Each subcomponent is wrapped with residual connections, LayerNorm, and dropout, ensuring efficient training and preventing vanishing gradients. The decoder layer excels at learning dependencies between target and source sequences for accurate generation.

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.cross_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        x: torch.Tensor,
        enc_output: torch.Tensor,
        tgt_mask: Optional[torch.Tensor] = None,
        src_mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        # Self attention block
        self_attn_output = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(self_attn_output))

        # Cross attention block
        cross_attn_output = self.cross_attention(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))

        # Feed forward block
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))

        return x

# Transformer 

The `Transformer` class implements the full Transformer architecture, combining encoder and decoder components to process sequences for tasks like machine translation and text generation.  

### Key Components:  
1. **Embedding and Positional Encoding:**  
   - Inputs and target sequences are embedded into continuous vectors and augmented with positional encodings to include token position information.  

2. **Encoder:**  
   - A stack of encoder layers processes the source sequence, generating context-rich representations by applying self-attention and feed-forward sublayers.  

3. **Decoder:**  
   - A stack of decoder layers combines target sequence information with the encoder's output. It uses self-attention to focus on prior tokens and cross-attention to incorporate relevant context from the source sequence.  

4. **Final Output Layer:**  
   - A linear layer maps the decoder's output to the target vocabulary space, producing logits for each token prediction.  

### Workflow:  
1. Input sequences are embedded and passed through the encoder layers, with masks applied to handle padding.  
2. Target sequences are similarly embedded and processed through decoder layers, with causal masks to prevent access to future tokens.  
3. The decoder output is transformed into predictions via the final linear layer.  

### Benefits:  
The `Transformer` model is highly parallelizable, handles long-range dependencies effectively, and achieves state-of-the-art results in sequence-to-sequence tasks.

In [None]:
class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size: int,
        tgt_vocab_size: int,
        d_model: int = 512,
        num_heads: int = 8,
        num_layers: int = 6,
        d_ff: int = 2048,
        max_seq_length: int = 5000,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        # Create encoder and decoder layers
        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )

        self.decoder_layers = nn.ModuleList(
            [DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
        )

        self.final_layer = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_mask(self, src: torch.Tensor, tgt: torch.Tensor) -> tuple:
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)

        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)
        seq_length = tgt.size(1)
        nopeak_mask = (
            1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)
        ).bool()
        tgt_mask = tgt_mask & nopeak_mask

        return src_mask, tgt_mask

    def forward(self, src: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
        src_mask, tgt_mask = self.create_mask(src, tgt)

        # Encoder
        src_embedded = self.dropout(
            self.positional_encoding(self.encoder_embedding(src))
        )
        enc_output = src_embedded

        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        # Decoder
        tgt_embedded = self.dropout(
            self.positional_encoding(self.decoder_embedding(tgt))
        )
        dec_output = tgt_embedded

        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, tgt_mask, src_mask)

        output = self.final_layer(dec_output)
        return output

# Translation Dataset

### PyTorch Dataset

PyTorch's `Dataset` class provides a framework for handling and preprocessing data for machine learning models. It allows custom datasets to be created with methods to define how data is loaded and processed, making it compatible with PyTorch's data loaders for efficient batch processing.

### `TranslationDataset` Class:  
This custom dataset is designed for machine translation tasks, processing paired source-target text data.

#### Key Features:  
1. **Initialization (`__init__`):**  
   - Takes a dataset of source and target text pairs, along with vocabularies for token-to-index mapping.  
   - Supports sequence truncation and padding up to a specified `max_length`.  

2. **Length (`__len__`):**  
   - Returns the total number of text pairs in the dataset, enabling iteration.

3. **Item Retrieval (`__getitem__`):**  
   - Converts source (`src_text`) and target (`tgt_text`) sequences into token indices using the provided vocabularies.  
   - Truncates sequences to `max_length` and pads shorter ones with zeros to ensure uniform length.  
   - Returns a dictionary containing tokenized and padded sequences as PyTorch tensors.

### Benefits:  
- **Preprocessing Flexibility:** Simplifies text-to-index conversion and padding, making the dataset ready for use in models.  
- **Compatibility:** Works seamlessly with PyTorch's `DataLoader` for batch processing, shuffling, and parallel data loading.  
- **Customizability:** Can be extended to include additional features like masking or special token handling.

In [None]:
# Data loading and preprocessing
class TranslationDataset(Dataset):
    def __init__(self, data, src_vocab, tgt_vocab, max_length=100):
        self.data = data
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        src_text, tgt_text = self.data[idx]

        # Convert text to indices
        src_indices = [self.src_vocab[token] for token in src_text.split()][
            : self.max_length
        ]
        tgt_indices = [self.tgt_vocab[token] for token in tgt_text.split()][
            : self.max_length
        ]

        # Pad sequences
        src_indices = src_indices + [0] * (self.max_length - len(src_indices))
        tgt_indices = tgt_indices + [0] * (self.max_length - len(tgt_indices))

        return {"src": torch.tensor(src_indices), "tgt": torch.tensor(tgt_indices)}

In [None]:
data = [
    ("hello world", "hallo welt"),
    ("goodbye", "auf wiedersehen"),
    ("how are you?", "wie geht es dir?"),
]

src_vocab = {"hello": 10, "world": 1, "goodbye": 2, "how": 3, "are": 4, "you?": 5}
tgt_vocab = {"hallo": 10, "welt": 1, "auf": 2, "wiedersehen": 3, "wie": 4, "geht": 5}

dataset = TranslationDataset(data, src_vocab, tgt_vocab)

dataset[0]

In [None]:
# Training setup
def train_model(model, train_loader, criterion, optimizer, device, num_epochs=10):
    model.train()

    for epoch in range(num_epochs):
        total_loss = 0
        for batch in train_loader:
            src = batch["src"].to(device)
            tgt = batch["tgt"].to(device)

            # Create target input and output
            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]

            optimizer.zero_grad()
            output = model(src, tgt_input)

            # Reshape output and target for loss calculation
            output = output.view(-1, output.size(-1))
            tgt_output = tgt_output.contiguous().view(-1)

            loss = criterion(output, tgt_output)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}, Average Loss: {avg_loss:.4f}")

In [None]:
# Example usage
def train():
    train_iter = [
        ("hello world", "hallo welt"),
        ("goodbye", "auf wiedersehen"),
        ("how are you?", "wie geht es dir?"),
        ("I'm fine", "ich bin gut"),
        ("what's your name?", "wie heißt du?"),
        ("my name is John", "ich heiße John"),
        ("nice to meet you", "freut mich"),
        ("where are you from?", "woher kommst du?"),
        ("I'm from Germany", "ich komme aus Deutschland"),
        ("do you speak English?", "sprichst du Englisch?"),
        ("yes, a little", "ja, ein bisschen"),
        ("what time is it?", "wie spät ist es?"),
        ("it's three o'clock", "es ist drei Uhr"),
        ("I'm hungry", "ich habe Hunger"),
        ("let's eat something", "lass uns etwas essen"),
        ("the food is delicious", "das Essen ist lecker"),
        ("thank you very much", "vielen Dank"),
        ("you're welcome", "bitte schön"),
        ("see you tomorrow", "bis morgen"),
        ("have a good day", "einen schönen Tag"),
        ("I love reading", "ich liebe es zu lesen"),
        ("what's the weather like?", "wie ist das Wetter?"),
        ("it's raining", "es regnet"),
        ("the sun is shining", "die Sonne scheint"),
        ("I need help", "ich brauche Hilfe"),
        ("can you help me?", "kannst du mir helfen?"),
        ("of course", "natürlich"),
        ("excuse me", "entschuldigung"),
        ("I'm sorry", "es tut mir leid"),
        ("no problem", "kein Problem"),
        ("how much is this?", "wie viel kostet das?"),
        ("that's expensive", "das ist teuer"),
        ("I'll take it", "ich nehme es"),
        ("where is the bathroom?", "wo ist die Toilette?"),
        ("turn left", "links abbiegen"),
        ("turn right", "rechts abbiegen"),
        ("go straight ahead", "geradeaus gehen"),
        ("I'm lost", "ich habe mich verlaufen"),
        ("can you show me the way?", "können Sie mir den Weg zeigen?"),
        ("I don't understand", "ich verstehe nicht"),
        ("please speak slowly", "bitte sprechen Sie langsam"),
        ("could you repeat that?", "können Sie das wiederholen?"),
        ("what does this mean?", "was bedeutet das?"),
        ("I like this", "das gefällt mir"),
        ("that's interesting", "das ist interessant"),
        ("I agree", "ich stimme zu"),
        ("I disagree", "ich stimme nicht zu"),
        ("good morning", "guten Morgen"),
        ("good evening", "guten Abend"),
        ("good night", "gute Nacht"),
    ]

    src_vocab = Counter()
    tgt_vocab = Counter()

    for src, tgt in train_iter:
        src_vocab.update(src.split())
        tgt_vocab.update(tgt.split())

    # Create vocabulary dictionaries
    src_vocab = {word: idx + 1 for idx, (word, _) in enumerate(src_vocab.most_common())}
    tgt_vocab = {word: idx + 1 for idx, (word, _) in enumerate(tgt_vocab.most_common())}
    src_vocab["<pad>"] = 0
    tgt_vocab["<pad>"] = 0

    # Create dataset and dataloader
    dataset = TranslationDataset(list(train_iter), src_vocab, tgt_vocab)
    train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

    # Initialize model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = Transformer(
        src_vocab_size=len(src_vocab),
        tgt_vocab_size=len(tgt_vocab),
        d_model=512,
        num_heads=8,
        num_layers=6,
        d_ff=2048,
    ).to(device)

    # Training setup
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

    # Train the model
    model = train_model(model, train_loader, criterion, optimizer, device, 10)
    return model

In [None]:
model = train()

Here are some valuable external resources to deepen your understanding of Transformers:

- [**Attention Is All You Need**](https://arxiv.org/abs/1706.03762) – The foundational paper that introduced the Transformer model.
- [**The Illustrated Transformer**](http://jalammar.github.io/illustrated-transformer/) by Jay Alammar – A visually intuitive explanation of the Transformer architecture.
- [**PyTorch Transformers from Scratch**](https://www.youtube.com/watch?v=U0s0f995w14&t=729s) by Aladdin Persson – A practical video guide on building Transformers in PyTorch.
- [**The Annotated Transformer**](http://nlp.seas.harvard.edu/annotated-transformer/) by Sasha Rush – A detailed, step-by-step explanation of the Transformer model's code.
- [**Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch**](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html) by Sebastian Raschka – A deep dive into the self-attention mechanism and how to implement it from scratch.
- [**Transformers from Scratch**](https://peterbloem.nl/blog/transformers) by Peter Bloem – An insightful blog post on understanding and implementing Transformers from the ground up.

In [None]:
import torch

x = torch.randn(1, 3, 1, 5)
print(x.shape)  # Output: torch.Size([1, 3, 1, 5])

y = x.squeeze()
print(y.shape)  # Output: torch.Size([3, 5])

In [None]:
x