<a href="https://colab.research.google.com/github/dhrits/LLM-Engineering-Foundations-to-SLMs/blob/main/02_The_Transformer/Encoder_Decoder_Transformer_from_Scratch_Hardmode_Assignment_Version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoder-Decoder Transformer Model from Scratch in PyTorch - Hardmode

In today's notebook, we'll be focusing on two major components of transformers:

1. The Building Blocks
2. Training the Model

As a brunt of the in-class time will be spent on the building blocks, we'll leave the training logic as an assignment to complete. We'll be focusing more specifically on training once we start using decoder-only architectures.

# How AIM Does Assignments

Throughout our time together - we'll be providing a number of assignments. Each assignment will be split into two broad categories:

1. Base Assignment - a more conceptual and theory based assignment focused on locking in specific key concepts and learnings.
2. Hardmode Assignment - a more programming focused assignment focused on core code-concepts used in transformers.

Each assignment will have a few of the following categories of exercises:

1. ❓Questions - these will be questions that you will be expected to gather the answer to!
2. 🏗️ Activities - these will be work or coding activities meant to reinforce specific concepts or theory components.

You are expected to complete all of the activities in your selected notebook!

# The Building Block Fundamentals of Transformer Architecture

We're going to start with an example of an encoder-decoder model - the kind found in the classic paper:

[Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf).

We'll walk through each step in code - leveraging the [PyTorch]() library heavily - in order to get an idea of how these models work.

While this example notebook could be extended to a sincere usecase - we'll be using a toy dataset, and we will not fully train the model until it converges (under-train), as the full training process might take many days!

## The Desired Architecture

![image](https://i.imgur.com/YPjbqW6.png)

We'll skip over the diagram for now, and talk through each component in detail!

In [1]:
import torch
import torch.nn as nn
import math

## Embedding

![image](https://i.imgur.com/sFlEZ2e.png)

The first step will be do convert our tokenized sequence of inputs into an embedding vector. This allows use to understand a rich amount of information about input sequences and their semantic meanings.

As the embedding layer will be training along side the rest of the model - it will allow us to have an excellent vector-representation of the tokens in our dataset.

Let's see how it looks in code!

#### 🏗️ Activity #1:

Complete the InputEmbedding Module.

In [2]:
class InputEmbeddings(nn.Module):
  def __init__(self, d_model: int, vocab_size: int, verbose=False) -> None:
    """
    vocab_size - the size of our vocabulary
    d_model - the dimension of our embeddings and the input dimension for our model
    """
    super().__init__()
    self.vocab_size = vocab_size
    self.d_model = d_model
    self.embedding = nn.Embedding(self.vocab_size, self.d_model)
    self.verbose = verbose

  def forward(self, x):
    if self.verbose:
      print(f"Embedding Vector (1st 5 elements): {self.embedding(x)[:5] * math.sqrt(self.d_model)}")
    return self.embedding(x) * math.sqrt(self.d_model) # scale embeddings by square root of d_model

### ❓Question 1:

Given:

1. Batch Size = `16`
2. Sequence Length = `350`

What will the output shape of the `InputEmbeddings` layer be?

### Test Embedding Layer

We'll set up a sample Embedding Layer and then test that it does what we'd expect!

In [3]:
def test_input_embeddings_with_example():
    # Create a small embedding layer
    embed = InputEmbeddings(d_model=512, vocab_size=1000)

    # Example sentence tokens (simplified)
    tokens = torch.tensor([[1, 2, 3, 4, 5]])  # "The cat sat down quickly"

    output = embed(tokens)
    print(f"Input shape: {tokens.shape}")
    print(f"Output shape: {output.shape}")
    print("\nExample shows how words are converted to high-dimensional vectors")

    # Run technical test
    assert output.shape == (1, 5, 512), f"Expected shape (1, 5, 512), got {output.shape}"
    print("✓ Input Embeddings Test Passed")

In [4]:
test_input_embeddings_with_example()

Input shape: torch.Size([1, 5])
Output shape: torch.Size([1, 5, 512])

Example shows how words are converted to high-dimensional vectors
✓ Input Embeddings Test Passed


## Positional Encoding

![image](https://i.imgur.com/IIA3NK3.png)

We need to impart information about where each token is in the sequence, but we aren't using any recurrence or convolutions - the easiest way to encode positional information is to inject positional information into our input embeddings.

We're going to use the process outlined in the paper to do this - which is to use a specific combination of functions to add positional information to the embeddings.

#### 🏗️ Activity #2:

Complete the PositionalEncoding module.

In [5]:
class PositionalEncoding(nn.Module):
  def __init__(self, d_model: int, seq_len: int, dropout: float, verbose=False) -> None:
    super().__init__()
    self.d_model = d_model
    self.seq_len = seq_len
    self.dropout = nn.Dropout(dropout)
    self.verbose = verbose

    ### YOUR CODE HERE
    positional_embeddings = torch.zeros(seq_len, d_model)
    positional_sequence_vector = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    positional_model_vector = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    positional_embeddings[:, 0::2] = torch.sin(positional_sequence_vector * positional_model_vector)
    positional_embeddings[:, 1::2] = torch.cos(positional_sequence_vector * positional_model_vector)
    positional_embeddings = positional_embeddings.unsqueeze(0)

    self.register_buffer('positional_embeddings', positional_embeddings)

  def forward(self, x):
    x = x + self.positional_embeddings[:, :x.size(1), :]
    if self.verbose:
      print(f"Positional Encodings (1st 5 elements): {x}")
    return self.dropout(x)

### ❓Question 2:

Given:

1. Batch Size = `16`
2. Sequence Length = `350`

What will the output shape of the `PositionalEncoding` layer be?
The size will be the same as the input size, which in this case is: `Batch Size x Sequence Length X d_model` = `16 x 350 x d_model`.

### Test Positional Encoding Layer

We'll set up a sample Positional Encoding Layer and then test that it does what we'd expect!

In [6]:
def test_positional_encoding_with_example():
    pos = PositionalEncoding(d_model=512, seq_len=10, dropout=0.1)

    # Create sample embeddings for "The cat sat"
    x = torch.randn(1, 3, 512)

    output = pos(x)
    print("Input tokens position:  [1, 2, 3]")
    print("Added position info to each word's embedding")
    print(f"Output maintains shape: {output.shape}")

    # Verify position information was added
    assert not torch.allclose(output, x), "Position information should modify embeddings"
    print("✓ Positional Encoding Test Passed")

In [7]:
test_positional_encoding_with_example()

Input tokens position:  [1, 2, 3]
Added position info to each word's embedding
Output maintains shape: torch.Size([1, 3, 512])
✓ Positional Encoding Test Passed


## Add & Norm

Next we'll tackle the Add & Norm Block of the diagram.

![image](https://i.imgur.com/otdEq4D.png)

### Layer Normalization

The first step is to add layer normalization. You can read more about it [here](https://paperswithcode.com/method/layer-normalization)!

The basic idea is that it makes training the model a bit easier, and allows the model to generalize a bit better.

#### 🏗️ Activity #3:

Complete the LayerNormalization Module.

In [8]:
class LayerNormalization(nn.Module):
  def __init__(self, features: int, epsilon:float=10**-6) -> None:
    super().__init__()
    self.epsilon = epsilon
    self.gamma = nn.Parameter(torch.ones(features))
    self.beta = nn.Parameter(torch.zeros(features))

  def forward(self, x):
    mean = x.mean(dim = -1, keepdim = True)
    standard_deviation = x.std(dim = -1, keepdim = True)
    return self.gamma * (x - mean) / (standard_deviation + self.epsilon) + self.beta

### ❓Question 3:

What is the purpose of `epsilon` in the above code.

> HINT: Pay special attention to the math in the `return` statement.

`epsilon` helps prevent divide by zero errors when the `standard_deviation` is very small or near zero. This is possible given the feature space of `x`.

### Test Layer Normalization

We'll set up a sample Layer Normalization and then test that it does what we'd expect!

In [9]:
def test_layer_normalization_with_example():
    layer_norm = LayerNormalization(features=3)  # Smaller feature size for example

    # Simulate word embeddings with different magnitudes
    word_embeddings = torch.tensor([
        [2.5, 4.1, -3.2],  # "The" (high magnitude)
        [0.1, 0.2, -0.1],  # "cat" (low magnitude)
        [8.2, -6.1, 5.5]   # "sat" (very high magnitude)
    ]).unsqueeze(0)

    normalized = layer_norm(word_embeddings)

    print("Before normalization (magnitudes vary greatly):")
    print(word_embeddings[0])
    print("\nAfter normalization (values scaled to similar ranges):")
    print(normalized[0])

    # Verify statistical properties
    mean = normalized.mean(dim=-1)
    var = normalized.var(dim=-1)
    assert torch.allclose(mean, torch.zeros_like(mean), atol=1e-5)
    assert torch.allclose(var, torch.ones_like(var), atol=1e-5)
    print("✓ Layer Normalization Test Passed")

In [10]:
test_layer_normalization_with_example()

Before normalization (magnitudes vary greatly):
tensor([[ 2.5000,  4.1000, -3.2000],
        [ 0.1000,  0.2000, -0.1000],
        [ 8.2000, -6.1000,  5.5000]])

After normalization (values scaled to similar ranges):
tensor([[ 0.3562,  0.7732, -1.1293],
        [ 0.2182,  0.8729, -1.0911],
        [ 0.7459, -1.1363,  0.3905]], grad_fn=<SelectBackward0>)
✓ Layer Normalization Test Passed


### Residual Connection

Another technique that makes model training easier, we add a Residual connection to the outputs of the Attention Block - this helps to prevent vanishing gradient.

#### 🏗️ Activity #4:

Complete the ResidualConnection Module.

In [11]:
class ResidualConnection(nn.Module):
  def __init__(self, features: int, dropout: float = 0.1) -> None:
    super().__init__()
    self.dropout = nn.Dropout(dropout)
    self.layernorm = LayerNormalization(features)

  def forward(self, x, sublayer):
    return x + self.dropout(sublayer(self.layernorm(x)))

### Testing Residual Connection

We'll set up a sample Residual Connection and then test that it does what we'd expect!

In [12]:
def test_residual_connection_with_example():
    residual = ResidualConnection(features=3, dropout=0.1)

    # Original input "The cat"
    x = torch.tensor([
        [1.0, 1.0, 1.0],
        [2.0, 2.0, 2.0]
    ]).unsqueeze(0)

    # Sublayer that makes meaningful changes
    def sublayer(x):
        return torch.nn.functional.relu(x + 0.5) # Non-linear transformation

    output = residual(x, sublayer)

    print("Original input:")
    print(x[0])
    print("\nAfter residual connection (combines original + transformed):")
    print(output[0])

    # Verify output changed but maintained shape
    assert output.shape == x.shape
    assert torch.any(torch.abs(output - x) > 1e-6), "Output should differ from input"
    print("✓ Residual Connection Test Passed")

In [13]:
test_residual_connection_with_example()

Original input:
tensor([[1., 1., 1.],
        [2., 2., 2.]])

After residual connection (combines original + transformed):
tensor([[1.5556, 1.0000, 1.5556],
        [2.0000, 2.5556, 2.5556]], grad_fn=<SelectBackward0>)
✓ Residual Connection Test Passed


## Feed Forward Network

![image](https://i.imgur.com/woEqBjQ.png)

Moving onto the next component, we have our feed forward network.

The feed forward networks servers two purposes in our model:

1. It reforms the attention outputs into a format that works with the next block.

2. It helps add complexity to prevent each attention block acting in a similar fashion.

#### 🏗️ Activity #5:

Complete the FeedForwardBlock Module

In [14]:
class FeedForwardBlock(nn.Module):
  def __init__(self, d_model: int, d_ff: int = 2048, dropout: float = 0.1) -> None:
    """
    d_model - dimension of model
    d_ff - dimension of feed forward network
    dropout - regularization measure
    """
    super().__init__()
    self.linear_1 = nn.Linear(d_model, d_ff)
    self.dropout = nn.Dropout(dropout)
    self.linear_2 = nn.Linear(d_ff, d_model)

  def forward(self, x):
    return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

### Testing the Feed-forward Block

Let's test!

In [15]:
def test_feed_forward_block_with_example():
   ff_block = FeedForwardBlock(d_model=3, d_ff=8)  # Small dimensions for demonstration

   # Input: Word embeddings for "The cat"
   x = torch.tensor([
       [1.0, 0.5, 0.2],  # "The"
       [2.0, -0.3, 1.1]  # "cat"
   ]).unsqueeze(0)

   output = ff_block(x)

   print("Input embeddings:")
   print(x[0])
   print("\nAfter feed-forward transformation:")
   print(output[0])

   # First linear layer expands to d_ff dimensions
   # ReLU keeps only positive values
   # Second linear layer projects back to d_model dimensions
   assert output.shape == x.shape
   assert not torch.allclose(output, x)
   print("✓ Feed Forward Block Test Passed")

In [16]:
test_feed_forward_block_with_example()

Input embeddings:
tensor([[ 1.0000,  0.5000,  0.2000],
        [ 2.0000, -0.3000,  1.1000]])

After feed-forward transformation:
tensor([[ 0.0795,  0.2701, -0.0664],
        [ 0.1536,  0.1792, -0.0532]], grad_fn=<SelectBackward0>)
✓ Feed Forward Block Test Passed


## Multi-Head Attention

![image](https://i.imgur.com/4qOT46y.png)

Next up is the heart and soul of the Transformer - Multi-Head Attention.

We'll break it down into the basic building blocks in code in the following section!

### Multi-Head Attention Class



In [17]:
class MultiHeadAttention(nn.Module):
  def __init__(self, d_model: int = 512, num_heads: int = 8, dropout: float = 0.1) -> None:
    super().__init__()
    self.d_model = d_model
    self.num_heads = num_heads
    assert d_model % num_heads == 0, "d_model is not divisible by h"

    self.d_k = d_model // num_heads

    self.w_q = nn.Linear(d_model, d_model, bias=False)
    self.w_k = nn.Linear(d_model, d_model, bias=False)
    self.w_v = nn.Linear(d_model, d_model, bias=False)

    self.w_o = nn.Linear(d_model, d_model, bias=False)

    self.dropout = nn.Dropout(dropout)

### ❓Question 4:

What do: Q, K, V, and O stand for in the above code?

What do: Q, K, V, and O stand for in the above code?

* Q: or `query` Represents a transformation of the current token looking for relevant information in the rest of the sequence.
* K: or `key` can be thought of as the address (a transformation) of the rest of (really all of) the sequence which can be matched against Q to find similarities.
* V: or `value` represents a transformation of the rest of (all of) of sequence which holds the actual information needed by the query Q.
* O: or `output` is a final linear transformation applied after attention.

### Testing Multi-Head Attention

Let's test it out!

In [18]:
def test_multi_head_attention_with_example():
   mha = MultiHeadAttention(d_model=6, num_heads=2)  # Small dimensions for clarity

   # Input sequence: ["The", "cat", "sat"]
   query = key = value = torch.tensor([
       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # "The"
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # "cat"
       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]   # "sat"
   ]).unsqueeze(0)

   # Allow all words to attend to each other
   mask = torch.ones(1, 1, 3, 3)

   output = mha(query, key, value, mask)

   print("Input embeddings (each row is a word):")
   print(query[0])
   print("\nAttention output (words now contain mixed information from relevant words):")
   print(output[0])

   # Each head processes sequence differently, then results are combined
   assert output.shape == query.shape
   assert not torch.allclose(output, query)
   print("✓ Multi-Head Attention Test Passed")

In [19]:
test_multi_head_attention_with_example()

NotImplementedError: Module [MultiHeadAttention] is missing the required "forward" function

### Scaled Dot-Product Attention

![image](https://i.imgur.com/Yp48DuB.png)

In [20]:
def attention(query, key, value, mask, d_k, dropout: nn.Dropout = None):
  attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)

  if mask is not None:
    attention_scores.masked_fill_(mask == 0, -1e9)

  attention_scores = attention_scores.softmax(dim=-1)

  if dropout is not None:
    attention_scores = dropout(attention_scores)

  return (attention_scores @ value), attention_scores

### Forward Method

This is code is required to do a forward pass with our model.

In [21]:
def forward(self, query, key, value, mask):
  query = self.w_q(query)
  key = self.w_k(key)
  value = self.w_v(value)

  query = query.view(query.shape[0], query.shape[1], self.num_heads, self.d_k).transpose(1, 2)
  key = key.view(key.shape[0], key.shape[1], self.num_heads, self.d_k).transpose(1, 2)
  value = value.view(value.shape[0], value.shape[1], self.num_heads, self.d_k).transpose(1, 2)

  x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)

  x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)

  return self.w_o(x)

### Combining it All Together

In [22]:
class MultiHeadAttention(nn.Module):
  def __init__(self, d_model: int = 512, num_heads: int = 8, dropout: float = 0.1) -> None:
    super().__init__()
    self.d_model = d_model
    self.num_heads = num_heads
    assert d_model % num_heads == 0, "d_model is not divisible by h"

    self.d_k = d_model // num_heads

    self.w_q = nn.Linear(d_model, d_model, bias=False)
    self.w_k = nn.Linear(d_model, d_model, bias=False)
    self.w_v = nn.Linear(d_model, d_model, bias=False)

    self.w_o = nn.Linear(d_model, d_model, bias=False)

    self.dropout = nn.Dropout(dropout)

  @staticmethod
  def attention(query, key, value, mask, dropout: nn.Dropout = None):
    d_k = query.shape[-1]

    attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
      attention_scores.masked_fill_(mask == 0, -1e9)

    attention_scores = attention_scores.softmax(dim=-1)

    if dropout is not None:
      attention_scores = dropout(attention_scores)

    return (attention_scores @ value), attention_scores

  def forward(self, query, key, value, mask):
    query = self.w_q(query)
    key = self.w_k(key)
    value = self.w_v(value)

    query = query.view(query.shape[0], query.shape[1], self.num_heads, self.d_k).transpose(1, 2)
    key = key.view(key.shape[0], key.shape[1], self.num_heads, self.d_k).transpose(1, 2)
    value = value.view(value.shape[0], value.shape[1], self.num_heads, self.d_k).transpose(1, 2)

    x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)

    x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)

    return self.w_o(x)

### Testing MultiHeadAttention

Let's test it out!

In [23]:
def test_attention_mechanism():
   mha = MultiHeadAttention(d_model=6, num_heads=2)

   # Simple sequence: "The cat sleeps"
   seq = torch.tensor([
       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],
       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]
   ]).unsqueeze(0)  # [1, 3, 6]

   # Mask shape needs to match attention scores [batch, heads, seq_len, seq_len]
   attention_scores = torch.ones(1, 2, 3, 3)  # 2 heads, sequence length 3

   print("Input sequence shape:", seq.shape)
   print("Input values (each row is a word):")
   print(seq[0])

   output = mha(seq, seq, seq, attention_scores)
   print("\nOutput after attention:")
   print(output[0])

   # Verify output maintains shape but changes values
   assert output.shape == seq.shape
   assert not torch.allclose(output, seq)
   print("✓ Multi-Head Attention Test Passed")

In [24]:
test_attention_mechanism()

Input sequence shape: torch.Size([1, 3, 6])
Input values (each row is a word):
tensor([[1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0.],
        [0., 0., 0., 0., 1., 1.]])

Output after attention:
tensor([[-0.0057,  0.1820, -0.1763, -0.0338,  0.1358, -0.0313],
        [-0.0291,  0.1975, -0.0898, -0.0396,  0.0641, -0.0099],
        [-0.0279,  0.2055, -0.1045, -0.0411,  0.0785, -0.0070]],
       grad_fn=<SelectBackward0>)
✓ Multi-Head Attention Test Passed


## Encoder

When we pass information through our model - the first thing we will do is Encode it by passing it through our Encoder Blocks.


### Encoder Block

![image](https://i.imgur.com/nwNYZAT.png)

The encoder takes in the source language sentence (e.g. English). Each word is converted into a vector representation using an embedding layer. Then a positional encoder adds information about the position of each word. This goes through multiple self-attention layers, where each word vector attends to all other word vectors to build contextual representations.

#### 🏗️ Activity #6:

Complete the EncoderBlock Module.

In [25]:
class EncoderBlock(nn.Module):
  def __init__(self, features: int, self_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
    super().__init__()
    self.self_attention_block = self_attention_block
    self.feed_forward_block = feed_forward_block
    self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(2)])

  def forward(self, x, input_mask):
    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, input_mask))
    x = self.residual_connections[1](x, self.feed_forward_block)
    return x

### Testing the EncoderBlock

Testing time!

In [26]:
def test_encoder_block():
   # Create encoder block with small dimensions
   mha = MultiHeadAttention(d_model=6, num_heads=2)
   ff = FeedForwardBlock(d_model=6, d_ff=12)
   encoder = EncoderBlock(features=6, self_attention_block=mha, feed_forward_block=ff, dropout=0.1)

   # Input: "The cat sleeps"
   x = torch.tensor([
       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # "The"
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # "cat"
       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]   # "sleeps"
   ]).unsqueeze(0)

   # Attention mask
   mask = torch.ones(1, 2, 3, 3)  # Allow all connections

   output = encoder(x, mask)

   print("Input sequence:")
   print(x[0])
   print("\nAfter encoder processing (self-attention + feed-forward):")
   print(output[0])

   assert output.shape == x.shape
   assert not torch.allclose(output, x)
   print("✓ Encoder Block Test Passed")

In [27]:
test_encoder_block()

Input sequence:
tensor([[1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0.],
        [0., 0., 0., 0., 1., 1.]])

After encoder processing (self-attention + feed-forward):
tensor([[ 1.5114,  1.0243, -0.4289, -0.0856, -0.6263,  0.5252],
        [ 0.2721,  0.1055,  1.1354,  1.0113, -0.0537,  0.1418],
        [ 0.1484, -0.0359,  0.1036, -0.1714,  0.1753,  1.1145]],
       grad_fn=<SelectBackward0>)
✓ Encoder Block Test Passed


### Encoder Stack

Following along from the original paper - we will organize these blocks into a set of 6.

These 6 Encoder Blocks (each with 8 Attention Heads) will comprise our Encoding Stack.

In [28]:
class EncoderStack(nn.Module):
  def __init__(self, features: int, layers: nn.ModuleList) -> None:
    super().__init__()
    self.layers = layers
    self.norm = LayerNormalization(features)

  def forward(self, x, mask):
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)

## Decoder

Next, we will take the encoded sequence and decode it through our Decoder Blocks.

### Decoder Block

![image](https://i.imgur.com/HtAAXZc.png)

The decoder takes in the target language sentence (e.g. Italian). It also converts words to vectors and adds positional info. Then it goes through self-attention layers. Here, a mask is applied so each word can only see the words before it, not after.

The decoder also does attention over the encoder output. This allows each French word to find relevant connections with the English words.

#### 🏗️ Activity #7:

Complete the DecoderBlock Module.

In [29]:
class DecoderBlock(nn.Module):
  def __init__(self, features: int, self_attention_block: MultiHeadAttention, cross_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
    super().__init__()
    self.self_attention_block = self_attention_block
    self.cross_attention_block = cross_attention_block
    self.feed_forward_block = feed_forward_block
    self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(3)])

  def forward(self, x, encoder_output, input_mask, target_mask):
    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, target_mask))
    x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, input_mask))
    x = self.residual_connections[2](x, self.feed_forward_block)
    return x

### Testing DecoderBlock

You know what's up next...testing!

In [30]:
def test_decoder_block():
   # Initialize components with small dimensions
   self_attn = MultiHeadAttention(d_model=6, num_heads=2)
   cross_attn = MultiHeadAttention(d_model=6, num_heads=2)
   ff = FeedForwardBlock(d_model=6, d_ff=12)
   decoder = DecoderBlock(features=6, self_attention_block=self_attn,
                         cross_attention_block=cross_attn,
                         feed_forward_block=ff, dropout=0.1)

   # Input: "El gato" (target sequence)
   x = torch.tensor([
       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # "El"
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # "gato"
   ]).unsqueeze(0)

   # Encoder output: "The cat" (source sequence)
   encoder_output = torch.tensor([
       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # "The"
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # "cat"
   ]).unsqueeze(0)

   # Masks
   src_mask = torch.ones(1, 2, 2, 2)  # Can attend to all encoder outputs
   tgt_mask = torch.tril(torch.ones(1, 2, 2, 2))  # Can only attend to previous words

   output = decoder(x, encoder_output, src_mask, tgt_mask)

   print("Input target sequence:")
   print(x[0])
   print("\nSource (encoder) sequence:")
   print(encoder_output[0])
   print("\nDecoder output (after self-attention, cross-attention, and feed-forward):")
   print(output[0])

   assert output.shape == x.shape
   assert not torch.allclose(output, x)
   print("✓ Decoder Block Test Passed")

In [31]:
test_decoder_block()

Input target sequence:
tensor([[1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0.]])

Source (encoder) sequence:
tensor([[1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0.]])

Decoder output (after self-attention, cross-attention, and feed-forward):
tensor([[ 0.7908,  1.3278,  0.0080,  0.5537,  0.6892,  0.3012],
        [-0.0924,  0.4200,  0.9624,  1.5692,  0.3715,  0.2245]],
       grad_fn=<SelectBackward0>)
✓ Decoder Block Test Passed


### Decoder Stack

We'll use the same number of Decoder Blocks as we did Encoder Blocks - leaving us with 6 Deocder Blocks in our Decoder Stack.

In [32]:
class DecoderStack(nn.Module):
  def __init__(self, features: int, layers: nn.ModuleList) -> None:
    super().__init__()
    self.layers = layers
    self.norm = LayerNormalization(features)

  def forward(self, x, encoder_output, input_mask, target_mask):
    for layer in self.layers:
      x = layer(x, encoder_output, input_mask, target_mask)
    return self.norm(x)

## Linear Projection Layer

After the decoder's self-attention and encoder-decoder attention layers, we have a context vector representing each Italian word. This context vector has a high dimension (e.g. 512 or 1024).

We want to take this context vector and generate a probability distribution over the French vocabulary so we can pick the next translated word.

The linear projection layer helps with this. It projects the context vector into a much larger vector called the vocabulary distribution - one entry per word in the vocabulary.

For example, if our Italian vocabulary has 50,000 words, the vocabulary distribution will have 50,000 dimensions. Each dimension corresponds to the probability of that Italian word being the correct translation.

In [33]:
class LinearProjectionLayer(nn.Module):
  def __init__(self, d_model, vocab_size) -> None:
    super().__init__()
    self.proj = nn.Linear(d_model, vocab_size)

  def forward(self, x) -> None:
    return self.proj(x)

## The Transformer

At this point, all we need to do is create a class that represents our model!

In [34]:
class Transformer(nn.Module):
  def __init__(self, encoder: EncoderBlock, decoder: DecoderBlock, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: LinearProjectionLayer) -> None:
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.src_embed = src_embed
    self.tgt_embed = tgt_embed
    self.src_pos = src_pos
    self.tgt_pos = tgt_pos
    self.projection_layer = projection_layer

  def encode(self, src, src_mask):
    src = self.src_embed(src)
    src = self.src_pos(src)
    return self.encoder(src, src_mask)

  def decode(self, encoder_output: torch.Tensor, src_mask: torch.Tensor, tgt: torch.Tensor, tgt_mask: torch.Tensor):
    tgt = self.tgt_embed(tgt)
    tgt = self.tgt_pos(tgt)
    return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

  def project(self, x):
    return self.projection_layer(x)

## Building Our Transformer

Now that we have each of our components - we need to construct an actual model!

We'll use this helper function to aid in our goal and set up our Encoder/Decoder Stacks!

In [35]:
def build_transformer(input_vocab_size: int, target_vocab_size: int, input_seq_len: int, target_seq_len: int, d_model: int=512, N: int=6, num_heads: int=8, dropout: float=0.1, d_ff: int=2048, verbose=True) -> Transformer:
  input_embeddings = InputEmbeddings(d_model, input_vocab_size, verbose=verbose)
  target_embeddings = InputEmbeddings(d_model, target_vocab_size)

  input_position = PositionalEncoding(d_model, input_seq_len, dropout, verbose=verbose)
  target_position = PositionalEncoding(d_model, target_seq_len, dropout)

  encoder_blocks = []

  for _ in range(N):
    encoder_self_attention_block = MultiHeadAttention(d_model, num_heads, dropout)
    feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
    encoder_block = EncoderBlock(d_model, encoder_self_attention_block, feed_forward_block, dropout)
    encoder_blocks.append(encoder_block)

  decoder_blocks = []

  for _ in range(N):
    decoder_self_attention_block = MultiHeadAttention(d_model, num_heads, dropout)
    decoder_cross_attention_block = MultiHeadAttention(d_model, num_heads, dropout)
    feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
    decoder_block = DecoderBlock(d_model, decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
    decoder_blocks.append(decoder_block)

  encoder_stack = EncoderStack(d_model, nn.ModuleList(encoder_blocks))
  decoder_stack = DecoderStack(d_model, nn.ModuleList(decoder_blocks))

  linear_projection_layer = LinearProjectionLayer(d_model, target_vocab_size)

  transformer = Transformer(encoder_stack, decoder_stack, input_embeddings, target_embeddings, input_position, target_position, linear_projection_layer)

  for p in transformer.parameters():
    if p.dim() > 1:
      nn.init.xavier_uniform_(p)

  return transformer

# Training Our Transformer!

We will be using the resources created in [this](https://github.com/hkproj/pytorch-transformer/tree/main) repository to train our model on a English -> French translation task.



## Dataset Creation

The BilingualDataset is a custom PyTorch dataset for working with translation data. It needs a tokenizer for each language, a dataset of sentence pairs, info on which languages are source and target, and the max sequence length.

This class handles tokenizing the sentences, padding them to be the same length, and getting the data into the right format for sequence-to-sequence models. It adds special start, end, and padding tokens so all the inputs and outputs are the same length.

When you grab a sample from the dataset, it tokenizes the source and target sentences, pads them, and creates the input tensors the model needs - encoder input, decoder input, and target labels. It also makes masks to show what's real data vs padding, and to make sure the decoder predictions only use previous tokens, not future ones.

The BilingualDataset gets the data ready for training seq2seq models in a way that works with the sequential nature of language. The model can only predict the next token based on what came before it, not after.

In [36]:
from torch.utils.data import Dataset

class BilingualDataset(Dataset):
  def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len):
    super().__init__()
    self.seq_len = seq_len

    self.ds = ds
    self.tokenizer_src = tokenizer_src
    self.tokenizer_tgt = tokenizer_tgt
    self.src_lang = src_lang
    self.tgt_lang = tgt_lang

    self.sos_token = torch.tensor([tokenizer_tgt.token_to_id("[SOS]")], dtype=torch.int64)
    self.eos_token = torch.tensor([tokenizer_tgt.token_to_id("[EOS]")], dtype=torch.int64)
    self.pad_token = torch.tensor([tokenizer_tgt.token_to_id("[PAD]")], dtype=torch.int64)

  def __len__(self):
    return len(self.ds)

  def __getitem__(self, idx):
    src_target_pair = self.ds[idx]
    src_text = src_target_pair['translation'][self.src_lang]
    tgt_text = src_target_pair['translation'][self.tgt_lang]

    enc_input_tokens = self.tokenizer_src.encode(src_text).ids
    dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids

    enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2
    dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1

    if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
        raise ValueError("Sentence is too long")

    encoder_input = torch.cat(
        [
            self.sos_token,
            torch.tensor(enc_input_tokens, dtype=torch.int64),
            self.eos_token,
            torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype=torch.int64),
        ],
        dim=0,
    )

    decoder_input = torch.cat(
        [
            self.sos_token,
            torch.tensor(dec_input_tokens, dtype=torch.int64),
            torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
        ],
        dim=0,
    )

    label = torch.cat(
        [
            torch.tensor(dec_input_tokens, dtype=torch.int64),
            self.eos_token,
            torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
        ],
        dim=0,
    )

    assert encoder_input.size(0) == self.seq_len
    assert decoder_input.size(0) == self.seq_len
    assert label.size(0) == self.seq_len

    return {
        "encoder_input": encoder_input,
        "decoder_input": decoder_input,
        "encoder_mask": (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(),
        "decoder_mask": (decoder_input != self.pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0)),
        "label": label,
        "src_text": src_text,
        "tgt_text": tgt_text,
    }

def causal_mask(size):
  mask = torch.triu(torch.ones((1, size, size)), diagonal=1).type(torch.int)
  return mask == 0

## Build Tokenizer For Training

In [37]:
!pip install transformers tokenizers datasets -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

This will grab all the sentences from our dataset per language.

In [38]:
def get_all_sentences(ds, lang):
    for item in ds:
        yield item['translation'][lang]

We'll quickly train a tokenizer on our dataset for both our source and target languages.

We'll be sure to add the `[UNK]`, `[PAD]`, `[SOS]`, and `[EOS]` special tokens.

In [39]:
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(config, ds, lang):
  tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
  tokenizer.pre_tokenizer = Whitespace()
  trainer = WordLevelTrainer(special_tokens=["[UNK]", "[PAD]", "[SOS]", "[EOS]"], min_frequency=2)
  tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer=trainer)
  return tokenizer

Now we can create our dataset in a format that our model expects and can train with!

In [40]:
from torch.utils.data import DataLoader, random_split

def get_ds(config):
  # It only has the train split, so we divide it overselves
  ds_raw = load_dataset(f"{config['datasource']}", f"{config['lang_src']}-{config['lang_tgt']}", split='train')

  # Build tokenizers
  tokenizer_src = build_tokenizer(config, ds_raw, config['lang_src'])
  tokenizer_tgt = build_tokenizer(config, ds_raw, config['lang_tgt'])

  # Keep 90% for training, 10% for validation
  train_ds_size = int(0.9 * len(ds_raw))
  val_ds_size = len(ds_raw) - train_ds_size
  train_ds_raw, val_ds_raw = random_split(ds_raw, [train_ds_size, val_ds_size])

  train_ds = BilingualDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
  val_ds = BilingualDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])

  # Find the maximum length of each sentence in the source and target sentence
  max_len_src = 0
  max_len_tgt = 0

  for item in ds_raw:
    src_ids = tokenizer_src.encode(item['translation'][config['lang_src']]).ids
    tgt_ids = tokenizer_tgt.encode(item['translation'][config['lang_tgt']]).ids
    max_len_src = max(max_len_src, len(src_ids))
    max_len_tgt = max(max_len_tgt, len(tgt_ids))

  print(f'Max length of source sentence: {max_len_src}')
  print(f'Max length of target sentence: {max_len_tgt}')


  train_dataloader = DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True)
  val_dataloader = DataLoader(val_ds, batch_size=1, shuffle=True)

  return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt

We can build our model with this helper function.

In [41]:
def get_model(config, vocab_src_len, vocab_tgt_len):
  model = build_transformer(vocab_src_len, vocab_tgt_len, config["seq_len"], config['seq_len'], d_model=config['d_model'], verbose=False)
  return model

In [42]:
def get_weights_file_path(config, epoch: str):
  model_folder = f"{config['datasource']}_{config['model_folder']}"
  model_filename = f"{config['model_basename']}{epoch}.pt"
  return str(Path('.') / model_folder / model_filename)

In [43]:
def latest_weights_file_path(config):
  model_folder = f"{config['datasource']}_{config['model_folder']}"
  model_filename = f"{config['model_basename']}*"
  weights_files = list(Path(model_folder).glob(model_filename))
  if len(weights_files) == 0:
      return None
  weights_files.sort()
  return str(weights_files[-1])

Finally....our training loop!

We'll spend more time in following weeks discussing this - for now, we'll quickly walk through what's happening:

1. Configure the training device (GPU/CPU) and print details. Set device in PyTorch.

2. Create directory for saving model weights based on config.

3. Get data loaders, tokenizers, and model. Move model to configured device.

4. Initialize Adam optimizer with learning rate and epsilon from config.

5. Set up initial training parameters like start epoch and global step.

6. Define cross-entropy loss function with label smoothing, ignoring padding.

---

- Main training loop over epochs:

  - Clear cache, set model to train mode, initialize progress bar.

  - For each batch:

    - Move data to device, run model forward/backward passes.
    - Compute loss, backprop, update model weights.
    - Increment global step.
  - After each epoch, save model and optimizer checkpoint.

#### 🏗️ Activity #8:

Complete the inner training loop.

In [44]:
import warnings
from tqdm import tqdm
import os
from pathlib import Path

def train_model(config):
  # Define the device
  device = "cuda" if torch.cuda.is_available() else "mps" if torch.has_mps or torch.backends.mps.is_available() else "cpu"
  print("Using device:", device)
  if (device == 'cuda'):
    print(f"Device name: {torch.cuda.get_device_name(device.index)}")
    print(f"Device memory: {torch.cuda.get_device_properties(device.index).total_memory / 1024 ** 3} GB")
  else:
    print("Please ensure you're in a GPU enabled Colab Notebook instance.")
  device = torch.device(device)

  # Make sure the weights folder exists
  Path(f"{config['datasource']}_{config['model_folder']}").mkdir(parents=True, exist_ok=True)

  train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
  model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)

  optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'], eps=1e-9)

  initial_epoch = 0
  global_step = 0

  loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer_src.token_to_id('[PAD]'), label_smoothing=0.1).to(device)

  for epoch in range(initial_epoch, config['num_epochs']):
    torch.cuda.empty_cache()
    model.train()
    batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d}")
    for batch in batch_iterator:
      ### YOUR CODE HERE ###
       encoder_input = batch['encoder_input'].to(device)
       decoder_input = batch['decoder_input'].to(device)
       encoder_mask = batch['encoder_mask'].to(device)
       decoder_mask = batch['decoder_mask'].to(device)
       label = batch['label'].to(device)

       encoder_output = model.encode(encoder_input, encoder_mask)
       decoder_output = model.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)
       proj_output = model.project(decoder_output)

       loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
       batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}"})
       loss.backward()

       optimizer.step()
       optimizer.zero_grad(set_to_none=True)

       global_step += 1


    model_filename = get_weights_file_path(config, f"{epoch:02d}")
    torch.save({
      'epoch': epoch,
      'model_state_dict': model.state_dict(),
      'optimizer_state_dict': optimizer.state_dict(),
      'global_step': global_step
    }, model_filename)

In [45]:
config = {
  "batch_size": 64,
  "num_epochs": 6,
  "lr": 1e-4,
  "seq_len": 350,
  "d_model": 512,
  "datasource": 'opus_books',
  "lang_src": "en",
  "lang_tgt": "it",
  "model_folder": "trained__en_it_translation_model",
  "model_basename": "encoder_decoder_model_"
}

In [46]:
train_model(config)

Using device: cuda
Device name: NVIDIA A100-SXM4-40GB
Device memory: 39.56427001953125 GB


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32332 [00:00<?, ? examples/s]

Max length of source sentence: 309
Max length of target sentence: 274


Processing Epoch 00: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=6.292]
Processing Epoch 01: 100%|██████████| 455/455 [05:25<00:00,  1.40it/s, loss=6.073]
Processing Epoch 02: 100%|██████████| 455/455 [05:25<00:00,  1.40it/s, loss=5.478]
Processing Epoch 03: 100%|██████████| 455/455 [05:26<00:00,  1.40it/s, loss=5.364]
Processing Epoch 04: 100%|██████████| 455/455 [05:25<00:00,  1.40it/s, loss=5.184]
Processing Epoch 05: 100%|██████████| 455/455 [05:25<00:00,  1.40it/s, loss=5.108]


In [47]:
def load_model(config):
    # Get dataloaders and tokenizers
    _, _, tokenizer_src, tokenizer_tgt = get_ds(config)

    # Initialize model
    model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size())

    # Load trained weights
    model_filename = latest_weights_file_path(config)
    if model_filename:
        print(f"Loading weights from {model_filename}")
        state = torch.load(model_filename)
        model.load_state_dict(state['model_state_dict'])

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    return model, tokenizer_src, tokenizer_tgt, device

def generate(model, tokenizer_src, tokenizer_tgt, src_text, device, max_length=350):
    model.eval()

    enc_input = tokenizer_src.encode(src_text).ids
    enc_input = torch.tensor([tokenizer_src.token_to_id('[SOS]')] + enc_input + [tokenizer_src.token_to_id('[EOS]')]).unsqueeze(0)

    enc_mask = (enc_input != tokenizer_src.token_to_id('[PAD]')).unsqueeze(0).unsqueeze(0).int()

    enc_input = enc_input.to(device)
    enc_mask = enc_mask.to(device)

    with torch.no_grad():
        enc_output = model.encode(enc_input, enc_mask)
        dec_input = torch.tensor([[tokenizer_tgt.token_to_id('[SOS]')]]).to(device)

        for _ in range(max_length):
            dec_mask = causal_mask(dec_input.size(1)).to(device)

            dec_output = model.decode(enc_output, enc_mask, dec_input, dec_mask)
            proj_output = model.project(dec_output)

            next_word = proj_output[:, -1].argmax(dim=-1)
            dec_input = torch.cat([dec_input, next_word.unsqueeze(-1)], dim=1)

            if next_word.item() == tokenizer_tgt.token_to_id('[EOS]'):
                break

    translated_tokens = [tokenizer_tgt.id_to_token(t.item()) for t in dec_input[0]]
    translated_text = ' '.join([t for t in translated_tokens if t not in ['[SOS]', '[EOS]', '[PAD]']])

    return translated_text


model, tokenizer_src, tokenizer_tgt, device = load_model(config)
model.eval()

Max length of source sentence: 309
Max length of target sentence: 274
Loading weights from opus_books_trained__en_it_translation_model/encoder_decoder_model_05.pt


  state = torch.load(model_filename)


Transformer(
  (encoder): EncoderStack(
    (layers): ModuleList(
      (0-5): 6 x EncoderBlock(
        (self_attention_block): MultiHeadAttention(
          (w_q): Linear(in_features=512, out_features=512, bias=False)
          (w_k): Linear(in_features=512, out_features=512, bias=False)
          (w_v): Linear(in_features=512, out_features=512, bias=False)
          (w_o): Linear(in_features=512, out_features=512, bias=False)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward_block): FeedForwardBlock(
          (linear_1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (residual_connections): ModuleList(
          (0-1): 2 x ResidualConnection(
            (dropout): Dropout(p=0.1, inplace=False)
            (layernorm): LayerNormalization()
          )
        )
      )
    )
    (norm): LayerNormaliz

In [48]:
test_sentences = [
        "the weather is beautiful today",
        "how are you?"
    ]

print("English to Italian Translations:")
print("-" * 50)
for sentence in test_sentences:
    translation = generate(model, tokenizer_src, tokenizer_tgt, sentence, device)
    print(f"EN: {sentence}")
    print(f"IT: {translation}")
    print("-" * 50)

English to Italian Translations:
--------------------------------------------------
EN: the weather is beautiful today
IT: Il giorno , il resto , la questione .
--------------------------------------------------
EN: how are you?
IT: Come siete ?
--------------------------------------------------


#### Acknowledgements

This notebook is heavily adapted from a number of incredible resources on Transformers, including but not limited to:

- https://blog.floydhub.com/the-transformer-in-pytorch/
- https://arxiv.org/pdf/1706.03762.pdf
- https://txt.cohere.com/what-are-transformer-models/
- https://jalammar.github.io/illustrated-transformer/