<div align="center">

# Understanding and Implementing Transformers from Scratch

</div>

The Transformer model was introduced in the paper "Attention Is All You Need." The standard (vanilla) Transformer consists of two main components: an encoder and a decoder. In this notebook, I will explain the model step by step and build it from scratch using PyTorch.

First, I will introduce the encoder, followed by the decoder. To enhance understanding, I will also illustrate the model with a simple example. Below is the architecture of the Transformer model (adapted from the original paper). This diagram provides a high-level overview of the model's structure. I will explore each component in detail, discussing both the conceptual understanding and the step-by-step implementation, based on my perspective as an applied mathematician.

<img src="transformer.png" alt="Transformer Model" width="200"/>

# Encoder

First, let's examine the encoder component. I will explain the process from the inputs to the input embedding block (**see the right figure**).

<div style="display: flex; justify-content: space-between;">
    <img src="encoder.png" alt="Encoder" width="45%">
    <img src="encoder_input_embedding.png" alt="Input Embedding" width="45%">
</div>

### Input Embedding

Suppose we need to translate from English to French, such as:
* "I love you" → "Je t’aime"
* "I like you" → "Je t’aime bien"
The sentences "I love you" and "I like you" should be fed into the encoder as inputs. However, before doing so, we must first transform the text into numerical representations. For example, we can map words to integers as follows: {'I': 1, 'love': 2, 'you': 3,'like': 4}.

Using this mapping:
* "I love you" becomes [1, 2, 3] instead of the raw text.
* "I like you" becomes [1, 4, 3].
  
However, this method has a major limitation: it treats all words as equally distant from each other. The model has no way of knowing that "love" and "like" are semantically related, while "I" and "you" are not.

To address this, we use embeddings, which map words into a continuous, high-dimensional space where semantically similar words have closer distances. Instead of treating words as mere integers, embeddings assign each word a learned vector representation, such as:
* "love"→[0.21,0.87,−0.32,...]
* "like"→[0.22,0.85,−0.30,...]
Since "love" and "like" are close in meaning, their vector representations are also close in space. This structure allows the model to understand word relationships, which is crucial for capturing meaning in natural language processing.

Moreover, because Transformers perform mathematical operations (such as self-attention) on input data, they require continuous values. Embeddings enable the model to process and learn from text effectively.

This brings us to the concept of Input Embedding, where words are transformed from discrete integer IDs into meaningful high-dimensional representations.

In [1]:
import torch
import torch.nn as nn
import numpy as np
# Define the embedding layer 
embedding = nn.Embedding(num_embeddings=40, embedding_dim=6)  # num_embeddings(the total number of unique words and special tokens).

# Example input tensor (word indices)
input_tensor = torch.tensor([[1, 2, 3]])

# Get the embedded output
embedded_output = embedding(input_tensor)

# Print shape and output
print("Shape of output:", embedded_output.shape)  
print("Embedded Output:\n", embedded_output) 

Shape of output: torch.Size([1, 3, 6])
Embedded Output:
 tensor([[[-0.6899, -0.7213, -1.3537, -1.0998,  0.2229,  0.3120],
         [ 0.1424,  1.2600,  0.7236, -1.4959,  1.2815,  0.4239],
         [ 1.9486,  0.4626,  1.1677, -1.0220, -0.9651,  0.6257]]],
       grad_fn=<EmbeddingBackward0>)


### Positional Encoding

<img src="positional_encoding.png" alt="Positional encoding" width="500"/>

Since the Transformer model's self-attention mechanism does not inherently capture the order of words in a sequence—unlike RNNs, which process inputs sequentially—we need to explicitly provide positional information along with the input embeddings. This ensures that the model can differentiate word positions and retain the correct order within a sentence.

For example, in the sentence "I love you", after applying self-attention, the model may treat the words ("I", "love", "you") in any order, as self-attention alone does not encode positional dependencies. To address this, we introduce positional encoding, which helps the model incorporate order information and correctly interpret the sequence.

The positional encoding should have the properties:
* Unique encoding for each position across the entire corpus: The position values should be consistent across all sentences in the dataset, ensuring that the model can correctly differentiate word positions regardless of sentence length.

* A predictable relationship between positions: If we know the position p of one word, we should be able to easily determine the position p + k of another word. This helps the model recognize positional patterns more effectively.

A natural way to define position is by using the index of each word in the sentence (e.g., 1, 2, 3, ...). However, if a sentence is very long, such as 200 words, the positional values would range from (1, 2, 3, ..., 200).

If we add the positional encoding to the input embedding of a word (e.g., "love") in such a sentence, it would result in: Positional Encoding + Input Embedding --> [200+0.21,200+0.87,200+−0.32,...]. These large positional values could dominate the original embedding, making it harder for the model to retain the actual meaning of the word "love".

**Sinusoidal Encoding:**

One way to define positional encoding that satisfies the above conditions is by using sinusoidal functions. Suppose $d_{model}$  is the dimension of the model (i.e., the dimension of both the input embeddings and the positional encoding). The positional encoding is defined as:
* For even indices ($𝑖$ is even):
  $$Positional \ Encoding(pos,i) = sin(\frac{pos}{10000^{2i/d_{model}}})$$
* For odd indices ($𝑖$ is odd):
  $$Positional\ Encoding(pos,i) = cos(\frac{pos}{10000^{2i/d_{model}}})$$

**For any two distinct positions $p$ and $j$, the vectors generated by the sinusoidal positional encoding function are not identical for all $i$ from $0$ to $d_{\text{model}} - 1$. A contradiction can be found when considering $i = 0$.** 

**Proof:**  
Let $p$ and $j$ be two distinct positions such that $p \neq j$. Suppose, for contradiction, that:

$\text{Positional Encoding}(p,0) = \text{Positional Encoding}(j,0)$

From the definition of positional encoding:

$\sin\left(\frac{p}{10000^{\frac{2 \cdot 0}{d_{\text{model}}}}} \right) = \sin\left(\frac{j}{10000^{\frac{2 \cdot 0}{d_{\text{model}}}}} \right)$

Since the sine function is periodic, this implies:

$\frac{p}{10000^{\frac{2 \cdot 0}{d_{\text{model}}}}} = \frac{j}{10000^{\frac{2 \cdot 0}{d_{\text{model}}}}} + 2k\pi$

for some integer \( k \). Simplifying, we obtain:
$p - j = 2k\pi$

Since $p$ and $j$ are integers, the left-hand side of the equation is necessarily an integer. However, the right-hand side, $2k\pi$, is an irrational number unless $k = 0$. This is a contradiction, as no integer can be equal to a nonzero irrational number.

Thus, the assumption that positional encoding vectors could be identical for distinct positions is false. Therefore, for any two distinct positions $p$ and $j$, their positional encoding vectors must differ for at least one dimension $i$.   (more general proof can be comparing the transcendental number and algebraic number)


Now let consider the second property of Sinusoidal Encoding
* A predictable relationship between positions: If we know the position p of one word, we should be able to easily determine the position p + k of another word. This helps the model recognize positional patterns more effectively.

Transformer positional encoding uses both sine and cosine functions to provide a unique representation for each position in a sequence. The sinusoidal encoding ensures that the relative positions between tokens are encoded in a way that preserves distance information. Specifically, for any two positions
$p$ and $j$ with the same relative difference $h=j-p$, the difference between their positional encodings remains the same regardless of their absolute positions. This allows the self-attention mechanism to generalize to sequences of different lengths without requiring explicit learned position embeddings.

**Proof:** 
Let's set $\omega_i = \frac{1}{10000^{2i/d_{model}}}$, then we have 
$$
\left[
\begin{aligned}
\sin(\omega_i j)  \\
\cos(\omega_i j) 
\end{aligned}
\right] = \left[
\begin{aligned}
\sin(\omega_i (p+h)) \\
\cos(\omega_i (p+h))
\end{aligned}
\right] = \left[
\begin{aligned}
\cos(\omega_i h) \  \sin(\omega_i h) \\
-\sin(\omega_i h)\ \cos(\omega_i h)
\end{aligned}
\right]\cdot \left[
\begin{aligned}
\sin(\omega_i p)  \\
\cos(\omega_i p) 
\end{aligned}
\right]
$$

This shows that when the difference between positions $p$ and $j$ in the positional index is the same, the corresponding transformation in the Sinusoidal Encoding remains unchanged:
$$ \left[
\begin{aligned}
\cos(\omega_i h) \  \sin(\omega_i h) \\
-\sin(\omega_i h)\ \cos(\omega_i h)
\end{aligned}
\right]
$$
This transformation depends only on the relative difference $h = j-p$ and is independent of the absolute values of $p$ and $j$.







In [2]:
import math
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
        
    def forward(self, x):
    
        return x + self.pe[:, :x.size(1)]  # the model will consume the sum of the embedding and the position encoding

In [3]:
d_model = 6 # which is consistent as the above example
max_seq_length = 3 
positional_encoding = PositionalEncoding(d_model, max_seq_length)

position_plus_embedding = positional_encoding(embedded_output)
print(position_plus_embedding.detach().numpy().round(2))
print(position_plus_embedding - embedded_output)

[[[-0.69  0.28 -1.35 -0.1   0.22  1.31]
  [ 0.98  1.8   0.77 -0.5   1.28  1.42]
  [ 2.86  0.05  1.26 -0.03 -0.96  1.63]]]
tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
         [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
         [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000]]],
       grad_fn=<SubBackward0>)


### Multi-Head Attention

<img src="mutihead_attention.png" alt="Positional encoding" width="500"/>

#### Query, Key, and Value in Attention

For this block, the input is the sum of the input embedding and the positional encoding. This combined input is then transformed for the multi-head attention block into:
* Query: q
* Key: k
* Value: v

There are many explanations for query, key, and value in attention mechanisms. Here’s how I understand it:

Imagine you have a question and you know that the answer can be found in a book. When you open the book, you first scan through the table of contents, comparing the chapter titles with your question. Once you find keywords that closely match your question, you turn to the relevant section and read the details.

In this analogy:
* Your question is the query.
* The chapter titles in the table of contents are the keys.
* The detailed content inside each chapter is the value.

When comparing your query with the keys, the closer the match, the more information you extract from that section. If there’s a strong match, you rely heavily on that content. If the match is weak, you gather less information from it.

This is how attention mechanisms work—by weighing different pieces of information based on their relevance to the query. To have the variables q,k,v, we need transform the each input (each term) from the sum of input embedding and positional encoding: $q = xW_q$, $k = xW_k$, and $v = xW_v$. Below we show an example transforming inputs to query, key and value.



<img src="qkv.png" alt="Positional encoding" width="500"/>

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

class SelfAttention(nn.Module):
    def __init__(self, d_model):
        super(SelfAttention, self).__init__()
        
        self.d_model = d_model
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def forward(self, Q, K, V, mask=None):
         
        Q = self.W_q(Q)
         
        K = self.W_k(K)
         
        V = self.W_v(V)

        return Q,K,V

In [5]:
self_attention = SelfAttention(d_model)
Q,K,V = self_attention(position_plus_embedding,position_plus_embedding,position_plus_embedding)
print(Q)
print(K)
print(V)

tensor([[[-0.5244,  0.1807,  0.5842, -0.7224, -0.3677,  0.6021],
         [ 1.1129,  0.5850,  0.5045, -0.0854, -0.6448, -0.7773],
         [ 0.8135,  1.2542, -0.1623,  1.6789, -0.0822, -0.6191]]],
       grad_fn=<ViewBackward0>)
tensor([[[-0.2834,  0.1222,  0.5653, -0.4237, -0.2857, -0.1194],
         [-0.5558, -0.0443, -0.7159, -1.0623,  0.3462,  0.3581],
         [ 0.4966, -0.2995, -0.6203, -1.0580,  0.3751,  1.6195]]],
       grad_fn=<ViewBackward0>)
tensor([[[ 0.2462,  0.2801,  0.2107, -0.0255,  0.6793, -0.0926],
         [ 0.3025,  0.0955,  0.5272, -0.3304,  0.4018,  0.3059],
         [ 1.1790, -0.4415, -0.4340, -0.0909,  0.6732,  0.3955]]],
       grad_fn=<ViewBackward0>)


#### Scaled product attention

<img src="self_attention.png" alt="Positional encoding" width="500"/>

We continue with the example "I love you". For each word—"I," "love," and "you"—we extract three properties: query, key, and value, as discussed above.

Following the same process, when answering the query for "I", we first compute the similarity between its query vector $q_I$ and the keys from all words: "I," "love," and "you". Next, we apply a softmax function to the scaled dot product to prevent excessively large values:
$$\text{Attention weights}\quad  w = \text{softmax}\Big(\frac{q_I\cdot k}{\sqrt{d_{model}}}\Big)$$
Using these weights, we aggregate the answer (output) for the query from the values corresponding to all words in the sentence:
$$\text{'answer'} = w_1 v_I + w_2 v_{love} + w_3 v_{you}$$
A similar process is applied to compute attention for "love" and "you," as illustrated in the figure above.

In [6]:
class SelfAttention(nn.Module):
    def __init__(self, d_model):
        super(SelfAttention, self).__init__()
        
        self.d_model = d_model
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
    def scaled_product_attention(self, Q, K, V):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_model)
        
        attention_weights = torch.softmax(attn_scores, dim=-1)
        
        output = torch.matmul(attention_weights, V)
        return output
        
    def forward(self, Q, K, V, mask=None):
         
        Q = self.W_q(Q)
         
        K = self.W_k(K)
         
        V = self.W_v(V)
             
        attn_output = self.scaled_product_attention(Q, K, V)
        
        return attn_output


In [7]:
self_attention = SelfAttention(d_model)
attn_output = self_attention(position_plus_embedding,position_plus_embedding,position_plus_embedding)
print(attn_output.detach().numpy().round(2))

[[[ 0.04  0.36 -0.39  0.61  0.07  0.94]
  [-0.38  0.52 -0.41  0.45  0.22  0.97]
  [-0.57  0.58 -0.42  0.38  0.29  0.96]]]


This step is extend the SelfAttention model to MultiHeadAttention model. 

**Feature-Splitting Multi-Head Attention**

One way is to split the input tensor into multiple attention heads. Imagine reading a complex sentence. If you focus on everything at once, it’s overwhelming. Instead, your brain processes different parts separately.
* (A) Each Head Learns Different Attention Patterns
* (B) More Expressive Representations
* (C) Reducing Computational Cost

For example, consider the following query matrix:
$$
\text{query} =
\begin{bmatrix}
  0.53 & 0.11 & -1.24 & -0.05 & 0.03 & 0.08 \\
  0.45 & 0.17 & -1.32 & -0.04 & 0.25 & 0.04 \\
  0.45 & 0.20 & -1.30 & -0.04 & 0.27 & 0.03
\end{bmatrix}
$$

If we apply **Multi-Head Attention** with $num\_heads = 2$, we split the query into **two heads**, each processing a portion of the feature space:

$$
\text{split\_query} =
\begin{bmatrix}
  \begin{bmatrix} 0.53 & 0.11 & -1.24 \\ 0.45 & 0.17 & -1.32 \\ 0.45 & 0.20 & -1.30 \end{bmatrix},
  \begin{bmatrix} -0.05 & 0.03 & 0.08 \\ -0.04 & 0.25 & 0.04 \\ -0.04 & 0.27 & 0.03 \end{bmatrix}
\end{bmatrix}
$$
Similarly, the key and value matrices are split in the same way. Each head independently computes scaled dot-product attention on its respective sub-query, allowing different attention heads to focus on different parts of the input.

In essence, each head processes a sub-section of the query (and similarly, the key and value) before the results are combined to form the final attention output.



In [8]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

class SplittingMultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(SplittingMultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attention_weights = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output
        
    # def split_heads(self, x):
    #     batch_size, seq_length, d_model = x.size()
    #     return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    # def combine_heads(self, x):
    #     batch_size, _, seq_length, d_k = x.size()
    #     return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).permute(0, 2, 1, 3)

    def combine_heads(self, x):
        batch_size, num_heads, seq_length, d_k = x.size()
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
         
        Q = self.split_heads(self.W_q(Q))
         
        K = self.split_heads(self.W_k(K))
         
        V = self.split_heads(self.W_v(V))
         
        
        attn_output = self.scaled_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

**Independent-Head Attention**

* Each head acts as an independent "expert" that learns to understand the full input from a different perspective. Instead of splitting the features across heads (like standard multi-head attention), each head sees the entire representation.
* The heads "think" independently, process the full information, and then communicate their findings by concatenating their outputs.
* A final projection layer blends their insights together, ensuring that all perspectives contribute to the final decision.

In [9]:
import torch
import torch.nn as nn
import math

class IndependentHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(IndependentHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        
        # Each head gets its own full-dimensional projections
        self.W_q = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(num_heads)])
        self.W_k = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(num_heads)])
        self.W_v = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(num_heads)])
        
        self.W_o = nn.Linear(d_model * num_heads, d_model)  # Adjust for concatenated heads

    def scaled_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_model)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attention_weights = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output

    def forward(self, Q, K, V, mask=None):
        head_outputs = []
        
        for i in range(self.num_heads):
            Q_i = self.W_q[i](Q)  
            K_i = self.W_k[i](K)  
            V_i = self.W_v[i](V)  
            
            attn_output = self.scaled_product_attention(Q_i, K_i, V_i, mask)
            head_outputs.append(attn_output)
        
        # Concatenate along feature dimension
        multi_head_output = torch.cat(head_outputs, dim=-1)
        
        # Final projection back to d_model
        output = self.W_o(multi_head_output)
        return output

### Add & Norm and Feed Forward blocks

<img src="add_norm.png" alt="Positional encoding" width="500"/>

In a Transformer, the Add & Norm layer plays a crucial role in stabilizing training and improving information flow.

* The Add (Residual Connection) simply sums the attention output(attn_output) with the input embeddings (position_plus_embedding). This helps gradients flow more easily through the network, preventing vanishing gradients and allowing deeper models to be trained effectively.
* The Norm (Layer Normalization) step then normalizes the summed output to stabilize the distribution of features, making training more efficient.

After the Add & Norm step, the output is passed through the Feed Forward Network (FFN), which consists of: 
* A linear transformation followed by a non-linearity (e.g., ReLU or GELU). 
* Another linear transformation to project back to the original dimension.

Mathematically, the process can be written as:
**Norm(Residual+Attn Output) ---> Feed Forward ---> Norm**
              

# Decoder

<img src="decoder.png" alt="Positional encoding" width="600"/>

The decoder is quite similar to the encoder, but with some key differences. Therefore, we will focus on two main components: the output embeddings and the masked multi-head attention. In the masked multi-head attention, we will specifically discuss the masking mechanism, as the self-attention mechanism itself remains the same as in the encoder.

### Outputs and Output Embedding

To illustrate this, let’s consider the English-to-French translation of "I love you" ---> "Je t’aime". We begin by tokenizing the words separately. For example, we can assign token IDs as follows:

* English: `{"I": 1, "love": 2, "you": 3, "<pad>": 0, "<sos>": 4, "<eos>": 5}`
* French: `{"Je": 1, "t’aime": 2, "<pad>": 0, "<sos>": 3, "<eos>": 4}`
  
When working with multiple samples of varying sentence lengths, the <pad> token (0) is used to standardize input sizes by padding shorter sentences. Additionally, <sos> (Start of Sequence) and <eos> (End of Sequence) are special tokens that play a crucial role in training and inference. 

To train the decoder effectively, we right-shift the target sequence. This means the decoder input starts with <sos> and excludes <eos>, ensuring the model learns to predict each token step by step without seeing future words.

For this example, assume the maximum sentence length for both inputs and outputs, including the start-of-sequence and end-of-sentence tokens, is 5:
* The full target: `["<sos>","Je", "t’aime", "<eos>", "<pad>", "<pad>"] → [3, 1, 2, 4, 0]`
* Target output: Target output: `["Je", "t’aime", "<eos>", "<pad>", "<pad>"] → [1, 2, 4, 0]`
* Right-shifted decoder input: `["<sos>", "Je", "t’aime", "<eos>", "<pad>"] → [3, 1, 2, 4]`
  
This setup enables the model to generate the next word autoregressively, aligning training with real-world inference. Once we obtain the tokens, the next step is to compute the output embeddings and apply positional encoding, just as in the encoder.



### Masked Multi-Head Attention

The masking operation serves two purposes: it prevents the model from attending to padding tokens and ensures that the decoder cannot access future tokens during training. The padding mask applies to both the encoder and decoder to ignore padded values, while the no-peek mask is used in the decoder to restrict attention to only past and present tokens, enforcing autoregressive generation.

<img src="mask.png" alt="Positional encoding" width="600"/>

In [20]:
import torch
import torch.nn as nn
import numpy as np

src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 6
num_heads = 2
num_layers = 1
d_ff = 4*d_model
 
batch_size = 1
dropout = 0.1
num_epochs = 300


max_seq_length = 5
# Define the embedding layer 
embedding = nn.Embedding(num_embeddings=40, embedding_dim=d_model)  # num_embeddings(the total number of unique words and special tokens).

# Example input tensor (word indices)
src_tensor = torch.tensor([[1, 2, 3, 4, 0]])
print(src_tensor.shape)
tgt_tensor = torch.tensor([[3, 1, 2, 4, 0]])

# Get the embedded output
embedded_src = embedding(src_tensor)
embedded_tgt = embedding(tgt_tensor[:,:])
 
print("Embedded Output:\n", embedded_src) 
print("Embedded Output:\n", embedded_tgt) 

torch.Size([1, 5])
Embedded Output:
 tensor([[[ 0.2077,  0.9226,  0.8388,  1.4238, -0.8683, -1.1170],
         [-1.2872, -0.0714, -0.5810,  3.1085, -0.5321,  0.6713],
         [ 0.4108,  0.1605, -1.6998,  1.3860, -1.2406,  0.5303],
         [-0.2086,  0.0436,  1.0710, -0.0247,  0.5650,  0.1487],
         [-1.6737, -0.0327,  0.6171,  0.4939, -1.3666, -0.0686]]],
       grad_fn=<EmbeddingBackward0>)
Embedded Output:
 tensor([[[ 0.4108,  0.1605, -1.6998,  1.3860, -1.2406,  0.5303],
         [ 0.2077,  0.9226,  0.8388,  1.4238, -0.8683, -1.1170],
         [-1.2872, -0.0714, -0.5810,  3.1085, -0.5321,  0.6713],
         [-0.2086,  0.0436,  1.0710, -0.0247,  0.5650,  0.1487],
         [-1.6737, -0.0327,  0.6171,  0.4939, -1.3666, -0.0686]]],
       grad_fn=<EmbeddingBackward0>)


In [21]:
import math
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
        
    def forward(self, x):
    
        return x + self.pe[:, :x.size(1)]  # the model will consume the sum of the embedding and the position encoding

In [22]:
Positional_Encoding = PositionalEncoding(d_model,max_seq_length)

In [23]:
embedded_src_pos = Positional_Encoding(embedded_src)
embedded_tgt_pos = Positional_Encoding(embedded_tgt)
print(embedded_src_pos)
print(embedded_tgt_pos)

tensor([[[ 0.2077,  1.9226,  0.8388,  2.4238, -0.8683, -0.1170],
         [-0.4457,  0.4689, -0.5346,  4.1074, -0.5299,  1.6713],
         [ 1.3201, -0.2556, -1.6071,  2.3817, -1.2363,  1.5303],
         [-0.0674, -0.9464,  1.2098,  0.9657,  0.5714,  1.1487],
         [-2.4305, -0.6863,  0.8017,  1.4767, -1.3580,  0.9314]]],
       grad_fn=<AddBackward0>)
tensor([[[ 0.4108,  1.1605, -1.6998,  2.3860, -1.2406,  1.5303],
         [ 1.0492,  1.4629,  0.8851,  2.4227, -0.8661, -0.1170],
         [-0.3779, -0.4876, -0.4883,  4.1042, -0.5278,  1.6712],
         [-0.0674, -0.9464,  1.2098,  0.9657,  0.5714,  1.1487],
         [-2.4305, -0.6863,  0.8017,  1.4767, -1.3580,  0.9314]]],
       grad_fn=<AddBackward0>)


In [24]:
def generate_mask(src, tgt):
    src_mask = (src != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, src_len)
    tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, tgt_len)
    
    seq_length = tgt.size(1)
    nopeak_mask = torch.tril(torch.ones(seq_length, seq_length)).bool().to(tgt.device) 
    tgt_mask = tgt_mask & nopeak_mask  # Apply no-peak mask correctly
    
    return src_mask, tgt_mask


In [25]:
print(tgt_tensor[:,:])
tgt_mask = (tgt_tensor[:,:] != 0).unsqueeze(1).unsqueeze(2) 
# print(tgt_mask)

seq_length = tgt_tensor[:,:].size(1)
nopeak_mask = torch.tril(torch.ones(seq_length, seq_length)).bool()
# print(nopeak_mask)

tgt_mask = tgt_mask & nopeak_mask
# print(tgt_mask)


print(tgt_mask==0)

tensor([[3, 1, 2, 4, 0]])
tensor([[[[False,  True,  True,  True,  True],
          [False, False,  True,  True,  True],
          [False, False, False,  True,  True],
          [False, False, False, False,  True],
          [False, False, False, False,  True]]]])


In [26]:
src_mask, tgt_mask =  generate_mask(src_tensor, tgt_tensor)
d_k = d_model // num_heads
def scaled_product_attention(Q, K, V, mask=None):
    attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
    print(attn_scores)
    attention_weights = torch.softmax(attn_scores, dim=-1)
    print(attention_weights) 
    output = torch.matmul(attention_weights, V)
    return output

In [27]:
output = scaled_product_attention(embedded_src_pos, embedded_src_pos, embedded_src_pos, mask=tgt_mask)

tensor([[[[ 6.4003e+00, -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09],
          [ 6.1088e+00,  1.1922e+01, -1.0000e+09, -1.0000e+09, -1.0000e+09],
          [ 2.9457e+00,  7.5901e+00,  8.0446e+00, -1.0000e+09, -1.0000e+09],
          [ 5.1447e-01,  2.6113e+00,  9.0066e-01,  2.8535e+00, -1.0000e+09],
          [ 2.0193e+00,  5.0083e+00,  1.3279e+00,  2.0226e+00, -1.0000e+09]]]],
       grad_fn=<MaskedFillBackward0>)
tensor([[[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0030, 0.9970, 0.0000, 0.0000, 0.0000],
          [0.0037, 0.3868, 0.6094, 0.0000, 0.0000],
          [0.0477, 0.3879, 0.0701, 0.4943, 0.0000],
          [0.0447, 0.8881, 0.0224, 0.0449, 0.0000]]]],
       grad_fn=<SoftmaxBackward0>)


## Full Source Code

In [113]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy


src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 10
num_heads = 1
num_layers = 1
d_ff = 4*d_model
 
batch_size = 20
dropout = 0.1
num_epochs = 1

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_product_attention(self, Q, K, V, mask=None):
         
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attention_weights = torch.softmax(attn_scores, dim=-1)
        
        output = torch.matmul(attention_weights, V)
        return output
        
    # def split_heads(self, x):
    #     batch_size, seq_length, d_model = x.size()
    #     return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    # def combine_heads(self, x):
    #     batch_size, _, seq_length, d_k = x.size()
    #     return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).permute(0, 2, 1, 3)

    def combine_heads(self, x):
        batch_size, num_heads, seq_length, d_k = x.size()
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
         
        Q = self.split_heads(self.W_q(Q))
         
        K = self.split_heads(self.W_k(K))
         
        V = self.split_heads(self.W_v(V))
         
        
        attn_output = self.scaled_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
        
    def forward(self, x):
    
        return x + self.pe[:, :x.size(1)]

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
         
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

In [114]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    # def generate_mask(self, src, tgt):
    #     src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
    #     tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        
    #     seq_length = tgt.size(1)
    #     nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        
         
          
    #     tgt_mask = tgt_mask & nopeak_mask
         
    #     return src_mask, tgt_mask

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, src_len)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, tgt_len)
        
        seq_length = tgt.size(1)
        nopeak_mask = torch.tril(torch.ones(seq_length, seq_length)).bool().to(tgt.device) 
        tgt_mask = tgt_mask & nopeak_mask  # Apply no-peak mask correctly
        
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
         
        

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)
        
        output = self.fc(dec_output)
        return output

In [115]:
# # Generate random sample data
# src_data = torch.randint(1, src_vocab_size, (batch_size, max_seq_length))  # (batch_size, seq_length)
# tgt_data = torch.randint(1, tgt_vocab_size, (batch_size, max_seq_length))  # (batch_size, seq_length)
# print(src_data)
 

from torch.utils.data import DataLoader, TensorDataset


# Define extended vocabulary for English (source) and French (target)
source_vocab_extended = {
    "I": 1, "love": 2, "you": 3, "hello": 4, "world": 5, 
    "good": 6, "morning": 7, "night": 8, "cat": 9, "dog": 10,
    "he": 11, "she": 12, "we": 13, "they": 14, "go": 15,
    "eat": 16, "run": 17, "walk": 18, "happy": 19, "sad": 20,
    "big": 21, "small": 22, "fast": 23, "slow": 24, "boy": 25, "girl": 26,
    "play": 27, "read": 28, "write": 29, "car": 30, "house": 31,
    "plays": 32, "reads": 33, "writes": 34, "walks": 35, "runs": 36, 'are':37,
    "<pad>": 0, "<sos>": 38, "<eos>": 39
}

target_vocab_extended = {
    "Je": 1, "t’aime": 2, "bonjour": 3, "monde": 4, "bon": 5, 
    "matin": 6, "nuit": 7, "chat": 8, "chien": 9, "il": 10,
    "elle": 11, "nous": 12, "ils": 13, "va": 14, "mange": 15,
    "court": 16, "marche": 17, "heureux": 18, "triste": 19,
    "grand": 20, "petit": 21, "rapide": 22, "lent": 23, "garçon": 24, "fille": 25,
    "joue": 26, "lit": 27, "écrit": 28, "voiture": 29, "maison": 30,
    "jouent": 31, "lisent": 32, "écrivent": 33, "marchent": 34, "courent": 35,"est":36,'sont': 37,
    "<pad>": 0, "<sos>": 38, "<eos>": 39
}

# Reverse mapping for translation
target_vocab_inv = {v: k for k, v in target_vocab_extended.items()}

# Predefined sentence pairs (each word is in the vocab)
sentence_pairs = [
    ("I love you", "Je t’aime"),
    ("hello world", "bonjour monde"),
    ("good morning", "bon matin"),
    ("good night", "bon nuit"),
    ("cat is small", "chat est petit"),
    ("dog is big", "chien est grand"),
    ("he runs fast", "il court rapide"),
    ("she walks slow", "elle marche lent"),
    ("we eat good", "nous mange bon"),
    ("they are happy", "ils sont heureux"),
    ("boy plays", "garçon joue"),
    ("girl reads", "fille lit"),
    ("he writes", "il écrit"),
    ("car is big", "voiture est grand"),
    ("house is small", "maison est petit"),
    ("she is sad", "elle est triste"),
    ("we go", "nous va"),
    ("they run morning", "ils courent matin"),
    ("I read", "Je lisent"),
    ("dog walks fast", "chien marche rapide")
]

# Function to tokenize sentences
def tokenize(sentence, vocab):
    words = sentence.split()  # No lowercasing to preserve vocabulary matching
    return [vocab[word] for word in words if word in vocab] + [vocab["<eos>"]]

# Generate tokenized dataset
data_samples = []
for src, tgt in sentence_pairs:
    src_tokenized = tokenize(src, source_vocab_extended)
    tgt_tokenized = [target_vocab_extended["<sos>"]] + tokenize(tgt, target_vocab_extended)
    data_samples.append((src_tokenized, tgt_tokenized))

# Convert tokenized data into tensor format
max_seq_length = max(max(len(src), len(tgt)) for src, tgt in data_samples)


# Pad sequences to the maximum length
src_data = torch.zeros((len(data_samples), max_seq_length), dtype=torch.long)
tgt_data = torch.zeros((len(data_samples), max_seq_length), dtype=torch.long)

for i, (src_seq, tgt_seq) in enumerate(data_samples):
    src_data[i, :len(src_seq)] = torch.tensor(src_seq, dtype=torch.long)
    tgt_data[i, :len(tgt_seq)] = torch.tensor(tgt_seq, dtype=torch.long)


In [116]:
# Create DataLoader
dataset = TensorDataset(src_data, tgt_data)
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# # Example usage: iterating through DataLoader
for batch_src, batch_tgt in data_loader:
    print("Batch Source:", batch_src)
    print("Batch Target:", batch_tgt)
    print(batch_src.shape)
    break
    
    
     


Batch Source: tensor([[11, 34, 39,  0,  0],
        [30, 21, 39,  0,  0],
        [10, 21, 39,  0,  0],
        [14, 17,  7, 39,  0],
        [ 1,  2,  3, 39,  0],
        [10, 35, 23, 39,  0],
        [14, 37, 19, 39,  0],
        [13, 16,  6, 39,  0],
        [12, 35, 24, 39,  0],
        [11, 36, 23, 39,  0],
        [31, 22, 39,  0,  0],
        [ 4,  5, 39,  0,  0],
        [12, 20, 39,  0,  0],
        [25, 32, 39,  0,  0],
        [26, 33, 39,  0,  0],
        [ 1, 28, 39,  0,  0],
        [13, 15, 39,  0,  0],
        [ 6,  8, 39,  0,  0],
        [ 6,  7, 39,  0,  0],
        [ 9, 22, 39,  0,  0]])
Batch Target: tensor([[38, 10, 28, 39,  0],
        [38, 29, 36, 20, 39],
        [38,  9, 36, 20, 39],
        [38, 13, 35,  6, 39],
        [38,  1,  2, 39,  0],
        [38,  9, 17, 22, 39],
        [38, 13, 37, 18, 39],
        [38, 12, 15,  5, 39],
        [38, 11, 17, 23, 39],
        [38, 10, 16, 22, 39],
        [38, 30, 36, 21, 39],
        [38,  3,  4, 39,  0],
        [38

In [117]:


transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)

In [118]:
# Training loop
transformer.train()
for epoch in range(num_epochs):  # Train for 10 epochs
    for batch_src, batch_tgt in data_loader:
        optimizer.zero_grad()
         
        output = transformer(batch_src, batch_tgt[:, :-1])
        print(output)
        print(batch_tgt[:, 1:])
        loss = criterion(output.contiguous().view(-1, tgt_vocab_size), batch_tgt[:, 1:].contiguous().view(-1))
        # loss = criterion(output.view(-1, tgt_vocab_size), batch_tgt[:, 1:].contiguous().view(-1))
        loss.backward()
        optimizer.step()
    if (epoch+1) % 100 == 0:  # Print every epoch
        print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

tensor([[[-0.8768, -1.1023, -0.0734,  ..., -0.6387, -0.1363,  0.0523],
         [-0.5000, -0.4919, -0.2958,  ..., -0.5241, -0.4329,  0.0198],
         [ 0.0143, -0.2099,  1.1591,  ...,  0.8882, -0.5303, -0.7495],
         [ 0.1193,  0.5671, -0.5095,  ..., -0.1803, -0.9110,  0.1656]],

        [[-1.1052, -1.1976,  0.0801,  ...,  0.5014, -0.6975, -0.5310],
         [-1.2910, -0.8343, -0.2651,  ...,  0.0698, -0.5597, -0.2602],
         [ 0.3867,  0.5624,  0.4221,  ...,  1.0820, -0.3656, -0.8576],
         [ 0.3392,  0.6169,  0.3211,  ...,  0.3042, -0.9341,  0.1040]],

        [[-0.5663, -0.7348, -0.1208,  ..., -0.6185, -0.2072,  0.1170],
         [-0.2922, -0.3419, -0.0221,  ..., -0.5045, -0.3769,  0.0067],
         [-0.2445, -0.7056,  1.2907,  ..., -0.0483, -0.0653, -0.1523],
         [ 0.4961,  0.7027, -0.0429,  ...,  0.2873, -0.7494, -0.1113]],

        ...,

        [[-1.1542, -1.2012, -0.0465,  ...,  0.1649, -0.4956, -0.4125],
         [ 0.9605,  0.5651, -0.0315,  ..., -0.4023, -0.19

In [119]:
def decode_tokens(tokens, vocab_inv):
    return " ".join([vocab_inv[token] for token in tokens if token in vocab_inv and token not in [0, 38, 39]])
                    
# Function for model inference
def translate_sentence(model, sentence, src_vocab, tgt_vocab_inv, max_length=max_seq_length):
    model.eval()
    src_tokens = tokenize(sentence, src_vocab)
    src_tensor = torch.tensor(src_tokens, dtype=torch.long).unsqueeze(0)
    
    tgt_tokens = [38]  # Start with <sos>
    for _ in range(max_length):
        tgt_tensor = torch.tensor(tgt_tokens, dtype=torch.long).unsqueeze(0)
        output = model(src_tensor, tgt_tensor)
        next_token = output.argmax(-1)[:, -1].item()
        if next_token == 39:  # If <eos> token is generated, stop
            break
        tgt_tokens.append(next_token)
    
    return decode_tokens(tgt_tokens, tgt_vocab_inv)

In [52]:
sentence_pairs = [
    ("I love you", "Je t’aime"),
    ("hello world", "bonjour monde"),
    ("good morning", "bon matin"),
    ("good night", "bon nuit"),
    ("cat is small", "chat est petit"),
    ("dog is big", "chien est grand"),
    ("he runs fast", "il court rapide"),
    ("she walks slow", "elle marche lent"),
    ("we eat good", "nous mange bon"),
    ("they are happy", "ils sont heureux"),
    ("boy plays", "garçon joue"),
    ("girl reads", "fille lit"),
    ("he writes", "il écrit"),
    ("car is big", "voiture est grand"),
    ("house is small", "maison est petit"),
    ("she is sad", "elle est triste"),
    ("we go", "nous va"),
    ("they run morning", "ils courent matin"),
    ("I read", "Je lisent"),
    ("dog walks fast", "chien marche rapide")
]

In [73]:
for english_sentence, expected_french in sentence_pairs:
    predicted_translation = translate_sentence(transformer, english_sentence, source_vocab_extended, target_vocab_inv)
    
    print(f"🔹 Input: {english_sentence}")
    print(f"✅ Expected Translation: {expected_french}")
    print(f"🔍 Predicted Translation: {predicted_translation}\n")

encoderlayer_d: torch.Size([1, 4, 128])
encoderlayer_d: torch.Size([1, 4, 128])
encoderlayer_d: torch.Size([1, 4, 128])
🔹 Input: I love you
✅ Expected Translation: Je t’aime
🔍 Predicted Translation: 

encoderlayer_d: torch.Size([1, 3, 128])
🔹 Input: hello world
✅ Expected Translation: bonjour monde
🔍 Predicted Translation: 

encoderlayer_d: torch.Size([1, 3, 128])
encoderlayer_d: torch.Size([1, 3, 128])
encoderlayer_d: torch.Size([1, 3, 128])
🔹 Input: good morning
✅ Expected Translation: bon matin
🔍 Predicted Translation: 

encoderlayer_d: torch.Size([1, 3, 128])
🔹 Input: good night
✅ Expected Translation: bon nuit
🔍 Predicted Translation: 

encoderlayer_d: torch.Size([1, 3, 128])
encoderlayer_d: torch.Size([1, 3, 128])
encoderlayer_d: torch.Size([1, 3, 128])
🔹 Input: cat is small
✅ Expected Translation: chat est petit
🔍 Predicted Translation: 

encoderlayer_d: torch.Size([1, 3, 128])
🔹 Input: dog is big
✅ Expected Translation: chien est grand
🔍 Predicted Translation: 

encoderlayer_d: