<a href="https://colab.research.google.com/github/VaishnaviMudaliar/deep-learning/blob/main/Attention_is_all_you_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 📥 Importing PyTorch Modules


### ✅ Explanation

- **`import torch`**  
  This imports the core PyTorch library. It provides the foundational tools for:
  - Creating and working with **tensors**
  - Performing operations on **GPU/CPU**
  - Handling **automatic differentiation** for training neural networks

- **`import torch.nn as nn`**  
  This imports the `torch.nn` module, which contains classes and functions to help define and train **neural networks**.  
  Common components include:
  - Layers like `nn.Linear`, `nn.Conv2d`, etc.
  - Activation functions like `nn.ReLU`, `nn.Sigmoid`, etc.
  - Loss functions like `nn.CrossEntropyLoss`, `nn.MSELoss`, etc.
  - The base class `nn.Module` to create custom models

By importing it as `nn`, we can easily write shorter code such as `nn.Linear(...)` instead of `torch.nn.Linear(...)`.


In [9]:
import torch
import torch.nn as nn


## 🧠 Self-Attention Mechanism (Multi-Headed)

```python
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
```

### 🔧 `__init__` Method Explanation

- `embed_size`: Total dimensionality of the input embeddings.
- `heads`: Number of attention heads (multi-head attention splits the work into parallel "heads").
- `head_dim`: Size of each attention head = `embed_size // heads`.

- Three linear layers for computing **values**, **keys**, and **queries**, applied per head.
- `fc_out`: Final linear layer to combine all heads' output back into a single embedding.

> 📌 The `assert` ensures that `embed_size` is cleanly divisible by `heads`.

---

```python
    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        energy = torch.einsum("nqhd, nkhd -> nhqk", [queries, keys])
```

### 🔄 `forward` Method Explanation

- Inputs:
  - `values`, `keys`, `query`: Tensors with embedding data (shape: `[batch_size, seq_len, embed_size]`)
  - `mask`: Optional tensor to mask out positions in the sequence

- Reshaping inputs to split across attention heads:
  - New shape: `(batch_size, seq_len, heads, head_dim)`

- Applying linear projections to get keys, queries, and values.

- **Energy calculation**:
  - Using `torch.einsum("nqhd, nkhd -> nhqk", ...)` to compute dot-product attention scores.
  - Resulting shape: `(batch_size, heads, query_len, key_len)`

---

```python
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
```

### 🎭 Masking and Softmax

- If a mask is provided, attention to those positions is heavily penalized by setting their energy scores to a large negative number.
- Apply **softmax** over the `key_len` dimension to get attention weights.

---

```python
        out = torch.einsum("nhql,nlhd -> nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out
```

### 🧪 Weighted Sum and Output

- Weighted combination of values based on attention scores using another `einsum`.
- Output shape after reshaping: `(batch_size, query_len, embed_size)`
- Final linear layer `fc_out` combines all heads' outputs into one.

---

### 🔚 Summary

This `SelfAttention` module implements multi-head scaled dot-product attention. It:
- Projects inputs into multiple attention heads
- Computes attention scores
- Applies softmax with optional masking
- Combines outputs from all heads
- Projects the result back to the original embedding size

> 🧠 This is the core of the **Transformer** architecture!


In [10]:
class SelfAttention(nn.Module):
  def __init__(self, embed_size, heads):
    super(SelfAttention, self).__init__()
    self.embed_size = embed_size
    self.heads = heads
    self.head_dim = embed_size // heads
    assert(self.head_dim * heads == embed_size), "Embed size needs to be divisible by heads"
    self.values = nn.Linear(self.head_dim, self.head_dim, bias = False)
    self.keys = nn.Linear(self.head_dim, self.head_dim, bias = False )
    self.queries = nn.Linear(self.head_dim, self.head_dim, bias = False )
    self.fc_out = nn.Linear(heads*self.head_dim, embed_size)

  def forward(self, values, keys, query, mask):
    N = query.shape[0]
    value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

    #Split embedddings into self.heads pieces

    values = values.reshape(N, value_len, self.heads, self.head_dim)
    keys = keys.reshape(N, key_len, self.heads, self.head_dim)
    queries = query.reshape(N, query_len, self.heads, self.head_dim)

    values = self.values(values)
    keys = self.keys(keys)
    queries = self.queries(queries)
    energy = torch.einsum("nqhd, nkhd -> nhqk" , [queries,keys])

    # queries shape : (N, query_len, heads, head_dim)
    # keys shape : (N, key_len, heads, head_dim)

    # energy shape : (N, heads, query_len, key_len)

    if mask is not None:
      energy = energy.masked_fill(mask == 0, float("-1e20"))

    attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim = 3)

    out = torch.einsum("nhql,nlhd -> nqhd", [attention, values]).reshape(N, query_len, self.heads*self.head_dim)

    # attention shape: (N, heads, query_len, key_len)
    # values shape: (N, value_len, heads, head_dim)
    # afer einsum (N, query_len, heads, head_dim) then flatten last two dimensions

    out = self.fc_out(out)
    return out


## 🔗 Transformer Block

This class implements a **single block of the Transformer encoder** architecture, combining self-attention and a feed-forward network with residual connections, layer normalization, and dropout.

```python
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )

        self.dropout = nn.Dropout(dropout)
```

### 🧱 `__init__` Constructor

- `embed_size`: Dimensionality of token embeddings
- `heads`: Number of self-attention heads
- `dropout`: Dropout probability to prevent overfitting
- `forward_expansion`: Multiplier for expanding the hidden layer in the feed-forward network

**Main Components:**

- `SelfAttention(...)`: Multi-head self-attention layer
- `LayerNorm(...)`: Layer normalization for stabilizing and speeding up training
- `feed_forward`: Two-layer MLP with ReLU activation
- `Dropout(...)`: Regularization to prevent overfitting

---

```python
    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)
        x = self.dropout(self.norm1(attention + query))

        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))

        return out
```

### 🔄 `forward` Method

1. **Self-Attention + Residual + LayerNorm**
   - Computes attention over the input sequence.
   - Adds the original `query` (residual connection).
   - Applies `LayerNorm` and `Dropout`.

2. **Feed-Forward Network + Residual + LayerNorm**
   - Passes through a 2-layer MLP (`feed_forward`).
   - Adds the residual input `x`.
   - Applies another `LayerNorm` and `Dropout`.

---

### 🧠 Summary

A `TransformerBlock` combines:

- 🧲 **Multi-head self-attention** (focuses on relevant parts of input)
- 🔁 **Residual connections** (helps gradient flow)
- 🧪 **Layer normalization** (improves training stability)
- 🚿 **Dropout** (regularization)
- 🔧 **Feed-forward network** (non-linearity and learning capacity)

> 📌 Multiple `TransformerBlock`s are stacked in Transformer encoders to build deep and powerful models.


In [11]:
class TransformerBlock(nn.Module):
  def __init__(self, embed_size, heads, dropout, forward_expansion) :
    super(TransformerBlock, self).__init__()
    self.attention = SelfAttention(embed_size, heads)
    self.norm1 = nn.LayerNorm(embed_size)
    self.norm2 = nn.LayerNorm(embed_size)


    self.feed_forward = nn.Sequential(
        nn.Linear(embed_size, forward_expansion*embed_size),
        nn.ReLU(),
        nn.Linear(forward_expansion*embed_size, embed_size),


    )

    self.dropout = nn.Dropout(dropout)

  def forward(self, value, key, query, mask):
    attention = self.attention(value, key, query, mask)
    x = self.dropout(self.norm1(attention + query))

    forward = self.feed_forward(x)
    out = self.dropout(self.norm2(forward + x))

    return out





## 📘 Transformer Encoder

This class implements the **encoder** component of the Transformer model. It converts a sequence of input tokens into rich contextual embeddings using multiple layers of self-attention and feed-forward networks.

```python
class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device

        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList([
            TransformerBlock(embed_size, heads, dropout=dropout, forward_expansion=forward_expansion)
            for _ in range(num_layers)
        ])

        self.dropout = nn.Dropout(dropout)
```

### 🔧 `__init__` Constructor Explanation

- **`src_vocab_size`**: Number of tokens in the source vocabulary.
- **`embed_size`**: Size of the embedding vectors.
- **`num_layers`**: Number of Transformer blocks (depth of encoder).
- **`heads`**


In [19]:
class Encoder(nn.Module):
  def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length, ) :
    super(Encoder, self).__init__()
    self.embed_size = embed_size
    self.device = device
    self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
    self.position_embedding = nn.Embedding(max_length, embed_size)

    self.layers = nn.ModuleList([
        TransformerBlock(embed_size, heads, dropout=dropout, forward_expansion=forward_expansion) for _ in range(num_layers)
    ] )

    self.dropout = nn.Dropout(dropout)

  def forward(self, x, mask):
    N, seq_length = x.shape
    positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)

    out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

    for layer in self.layers:
      out = layer(out, out, out, mask)
    return out






## 🔁 Transformer Decoder Block

This class implements a **single decoder block** in the Transformer architecture. A decoder block uses two attention layers:
1. **Masked self-attention** on the decoder’s own inputs (to prevent peeking at future tokens).
2. **Encoder-decoder attention** (to attend to encoder outputs).

```python
class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout, device):
        super(DecoderBlock, self).__init__()

        self.attention = SelfAttention(embed_size, heads)
        self.norm = nn.LayerNorm(embed_size)
        self.transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion)
        self.dropout = nn.Dropout(dropout)
```

### 🧱 `__init__` Constructor

- **`embed_size`**: Dimensionality of token embeddings.
- **`heads`**: Number of attention heads.
- **`forward_expansion`**: Multiplier for the hidden dimension in the feed-forward layer.
- **`dropout`**: Dropout rate for regularization.
- **`device`**: Device to run the model (`cpu` or `cuda`).

**Components:**
- `SelfAttention(...)`: Used for **masked self-attention**.
- `LayerNorm(...)`: Normalizes the output for stability.
- `TransformerBlock(...)`: Used for **encoder-decoder attention** followed by feed-forward.
- `Dropout(...)`: Adds regularization to avoid overfitting.

---

```python
    def forward(self, x, value, key, src_mask, trg_mask):
        attention = self.attention(x, x, x, trg_mask)
        query = self.dropout(self.norm(attention + x))
        out = self.transformer_block(value, key, query, src_mask)
        return out
```

### 🔄 `forward` Method Explanation

- **Inputs**:
  - `x`: Target input embeddings (from previous decoder time steps).
  - `value`, `key`: Outputs from the encoder (used in encoder-decoder attention).
  - `src_mask`: Mask for the encoder inputs (to ignore padding).
  - `trg_mask`: Mask for decoder inputs (to prevent future token access).

#### 🔹 Step-by-Step:

1. **Masked Self-Attention**:
   - Applies self-attention over decoder inputs `x`.
   - Uses `trg_mask` to block future tokens.
   - Residual connection: `attention + x`
   - Normalized and regularized.

2. **Encoder-Decoder Attention** (via `TransformerBlock`):
   - Uses encoder outputs (`key`, `value`) to inform decoding.
   - Takes the `query` from masked self-attention.
   - Uses `src_mask` to ignore encoder padding tokens.

---

### 🧠 Summary

A `DecoderBlock`:
- Performs **masked self-attention** on target sequence so the model predicts one token at a time.
- Uses **encoder-decoder attention** to focus on relevant encoder output.
- Passes through a feed-forward layer with residuals and normalization.

> 🔄 Multiple decoder blocks are stacked to form the full decoder in a Transformer model (e.g., GPT, translation models).


In [20]:
class DecoderBlock(nn.Module):
  def __init__(self,embed_size, heads, forward_expansion, dropout, device ):
    super(DecoderBlock, self).__init__()

    self.attention = SelfAttention(embed_size, heads)
    self.norm = nn.LayerNorm(embed_size)
    self.transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion)
    self.dropout = nn.Dropout(dropout)


  def forward(self, x, value, key, src_mask, trg_mask):
    attention = self.attention(x, x, x, trg_mask)
    query = self.dropout(self.norm(attention + x))
    out = self.transformer_block(value, key, query, src_mask)
    return out




## 📙 Transformer Decoder

The `Decoder` class implements the full **decoder** component of the Transformer architecture. It consists of multiple decoder blocks that perform masked self-attention, encoder-decoder attention, and feed-forward transformations.

```python
class Decoder(nn.Module):
    def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length):
        super(Decoder, self).__init__()
        self.device = device

        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList([
            DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
            for _ in range(num_layers)
        ])

        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)
```

### 🧱 `__init__` Constructor Explanation

- **`trg_vocab_size`**: Number of tokens in the target language vocabulary.
- **`embed_size`**: Dimension of embeddings.
- **`num_layers`**: Number of decoder blocks.
- **`heads`**: Number of attention heads in each block.
- **`forward_expansion`**: Multiplier for hidden size in feed-forward layers.
- **`dropout`**: Dropout rate.
- **`device`**: Device to allocate tensors.
- **`max_length`**: Maximum length for positional embeddings.

**Main Components:**
- `word_embedding`: Converts token indices to vectors.
- `position_embedding`: Adds sequence position info.
- `DecoderBlock`: List of decoder blocks to build depth.
- `fc_out`: Final linear layer mapping to vocabulary logits.
- `dropout`: Applied to embeddings and intermediate steps.

---

```python
    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)

        x = self.dropout((self.word_embedding(x) + self.position_embedding(positions)))

        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)
        return out
```

### 🔄 `forward` Method Explanation

- **Inputs**:
  - `x`: Target token indices (`[N, tgt_seq_len]`).
  - `enc_out`: Output from the encoder (`[N, src_seq_len, embed_size]`).
  - `src_mask`: Mask for encoder input_


In [30]:
class Decoder(nn.Module):
  def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length) -> None:
    super(Decoder, self).__init__()
    self.device = device
    self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
    self.position_embedding = nn.Embedding(max_length, embed_size)

    self.layers = nn.ModuleList([
        DecoderBlock(embed_size, heads, forward_expansion, dropout, device) for _ in range(num_layers)
    ])

    self.fc_out = nn.Linear(embed_size, trg_vocab_size)
    self.dropout = nn.Dropout(dropout)


  def forward(self,x, enc_out, src_mask, trg_mask): # Corrected indentation
    N, seq_length = x.shape
    positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
    x = self.dropout((self.word_embedding(x) + self.position_embedding(positions)))

    for layer in self.layers:
      x = layer(x, enc_out, enc_out, src_mask, trg_mask)

    out = self.fc_out(x)
    return out

# 🚀 Transformer Model (Full Architecture)

This `Transformer` class brings together the **Encoder** and **Decoder** into a full sequence-to-sequence model, as described in the original ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) paper.

---

## 🧱 Constructor: `__init__`

```python
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx,
                 embed_size=256, num_layers=6, forward_expansion=4,
                 heads=8, dropout=0, device="cpu", max_length=100):
```

### 🔹 Arguments:

- **`src_vocab_size` / `trg_vocab_size`**: Vocabulary sizes for source and target languages.
- **`src_pad_idx` / `trg_pad_idx`**: Padding token indices (for masking).
- **`embed_size`**: Dimension of token embeddings.
- **`num_layers`**: Number of encoder and decoder blocks.
- **`heads`**: Number of attention heads.
- **`forward_expansion`**: Multiplier for feed-forward hidden size.
- **`dropout`**: Dropout rate.
- **`device`**: Device (`cpu` or `cuda`).
- **`max_length`**: Maximum sequence length.

### 🧩 Components:

```python
self.encoder = Encoder(...)
self.decoder = Decoder(...)
self.src_pad_idx = src_pad_idx
self.trg_pad_idx = trg_pad_idx
```

---

## 🛡️ Masking Functions

### 🔷 `make_src_mask`

```python
def make_src_mask(self, src):
    src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
    return src_mask.to(self.device)
```

- Creates a mask for the **source input** to ignore `<PAD>` tokens.
- Shape: `(N, 1, 1, src_len)` — broadcastable for attention.

### 🔶 `make_trg_mask`

```python
def make_trg_mask(self, trg):
    N, trg_len = trg.shape
    trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(N, 1, trg_len, trg_len)
    return trg_mask.to(self.device)
```

- Produces a **look-ahead mask** for the target sequence to prevent attending to future tokens.
- Shape: `(N, 1, trg_len, trg_len)`

---

## 🔄 `forward` Pass

```python
def forward(self, src, trg):
    src_mask = self.make_src_mask(src)
    trg_mask = self.make_trg_mask(trg)

    enc_src = self.encoder(src, src_mask)
    out = self.decoder(trg, enc_src, src_mask, trg_mask)
    return out
```

### 🧠 Process Flow:

1. Generate masks for both source and target inputs.
2. Pass the **source sequence** and mask to the encoder.
3. Pass the **target sequence**, encoder output, and both masks to the decoder.
4. Return the logits for each token in the target vocabulary.

---

## ✅ Summary

This `Transformer` class:
- Fully implements the **encoder-decoder Transformer architecture**.
- Handles positional embeddings, multi-head attention, masking, and feed-forward layers.
- Can be used for tasks like **machine translation**, **text generation**, and more.

> 🧪 Tip: To train this model, use `nn.CrossEntropyLoss(ignore_index=trg_pad_idx)` and optimize over the output predictions vs. the ground truth target sequence.


In [34]:
class Transformer(nn.Module):
  def __init__(self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size = 256, num_layers = 6, forward_expansion = 4, heads = 8, dropout = 0, device = "cpu", max_length = 100 ):
    super(Transformer, self).__init__()

    self.encoder = Encoder(
        src_vocab_size,
        embed_size,
        num_layers,
        heads,
        device,
       forward_expansion,
       dropout,
       max_length)

    self.decoder = Decoder(
        trg_vocab_size,
        embed_size,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        device,
        max_length
    )

    self.src_pad_idx = src_pad_idx
    self.trg_pad_idx = trg_pad_idx
    self.device = device


  def make_src_mask(self, src):
    src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

    #(N,1,1,src_length)

    return src_mask.to(self.device)

  def make_trg_mask(self, trg):
    N, trg_len = trg.shape
    trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(N, 1, trg_len, trg_len)
    return trg_mask.to(self.device)

  def forward(self, src, trg):
    src_mask = self.make_src_mask(src)
    trg_mask = self.make_trg_mask(trg)
    enc_src = self.encoder(src, src_mask)
    out = self.decoder(trg, enc_src, src_mask, trg_mask)
    return out




# Transformer Model Forward Pass Example

This section demonstrates how the **Transformer model** works with source and target sequences for a typical sequence-to-sequence task. We'll walk through the forward pass using example input tensors.

## 🔧 Code Overview

```python
import torch

device = "cpu"

# Define input tensors (source and target sequences)
x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(device)
trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

# Vocabulary and padding indices
src_pad_idx = 0
trg_pad_idx = 0
src_vocab_size = 10
trg_vocab_size = 10

# Clamping values to ensure they are within the valid range for vocab indices
x = torch.clamp(x, 0, src_vocab_size - 1)
trg = torch.clamp(trg, 0, trg_vocab_size - 1)

# Initialize the model
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)

# Forward pass (using trg[:-1] for target sequence input)
out = model(x, trg[:,:-1])

# Output the shape of the result
print(out.shape)
```

## 💡 Explanation

### 1. **Input Tensors: `x` and `trg`**
- **`x` (source sequence)**:
  - Shape: `(2, 9)` — 2 sequences, each with 9 tokens.
- **`trg` (target sequence)**:
  - Shape: `(2, 8)` — 2 sequences, each with 8 tokens.

These sequences represent indices in the source and target vocabularies.

### 2. **Clamping Values**
- The values in `x` and `trg` are clamped to ensure that all indices are valid with respect to the vocabulary sizes (`src_vocab_size` and `trg_vocab_size`).
- This prevents any indices from exceeding the bounds of the vocabulary, ensuring that all values lie within the valid range `[0, vocab_size - 1]`.

### 3. **Transformer Model Initialization**
The model is instantiated with the following key parameters:
- **`src_vocab_size = 10`**: Source vocabulary size.
- **`trg_vocab_size = 10`**: Target vocabulary size.
- **`src_pad_idx = 0` and `trg_pad_idx = 0`**: Padding token indices (0 in this case).
- **Device**: Set to `"cpu"`, but can be changed to `"cuda"` for GPU processing.

### 4. **Masking and Forward Pass**
- **Masking**: The `make_src_mask` and `make_trg_mask` methods are called inside the `Transformer` model's forward pass to create appropriate masks for the source and target sequences.
- **Target Sequence**: `trg[:,:-1]` is used as the input to the decoder. This slicing excludes the last token from the target sequence, as the decoder predicts the next token in the sequence.
  
### 5. **Expected Output Shape**
The output shape from the forward pass will be `(2, 7, 10)`:
- **2**: Batch size (2 sequences).
- **7**: Length of the target sequence minus the last token (`trg[:,:-1]`).
- **10**: The target vocabulary size (`trg_vocab_size`).

This shape indicates the decoder's prediction for each token in the target sequence, with probabilities for each token in the vocabulary.

---

## 🔍 Expected Output

```python
torch.Size([2, 7, 10])
```

This shape means that the model has produced predictions for 7 tokens in the target sequence for each of the 2 sequences in the batch, with 10 possible tokens (vocabulary size) at each position in the sequence.

---

## Conclusion
In summary:
- The code demonstrates a forward pass through the **Transformer model**.
- It uses source and target sequences with appropriate masking and clamping to ensure valid indices.
- The final output shape reflects the model's predictions for the target sequence, considering the given input and vocabulary sizes.

> 🧑‍🏫 **Tip**: Ensure that your target sequences are properly shifted by 1 token when feeding them into the decoder, as is done with `trg[:,:-1]` here.


In [36]:
device = "cpu"

x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(device)

trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

src_pad_idx = 0
trg_pad_idx = 0
src_vocab_size = 10
trg_vocab_size = 10

# The error likely occurs because src_vocab_size and trg_vocab_size are defined as 10,
# but 'x' and 'trg' may contain values greater than 9.
# The following lines will clamp 'x' and 'trg' values to be within the vocabulary size.
x = torch.clamp(x, 0, src_vocab_size - 1)
trg = torch.clamp(trg, 0, trg_vocab_size - 1)

model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)

out = model(x, trg[:,:-1])

print(out.shape)

torch.Size([2, 7, 10])
