# Homework 3: Transformers, T5, Machine Translation

This homework guides you through implementing a complete T5 (Text-To-Text Transfer Transformer) encoder-decoder architecture from scratch for machine translation. You will build the encoder stack, decoder stack with cross-attention, and train the model on English-German translation.

**Total Points: 20**

**Instructions:**
1. Complete all tasks in this notebook
2. Ensure your code runs without errors
3. Submit both this notebook and any additional files created
4. Write clear explanations for your approach
5. Train your model on the English-German translation dataset

## Setup

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math
from typing import Optional, Tuple, List, Dict

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


## Dataset Preparation

We'll use the WMT English-German translation dataset. This dataset contains parallel sentences in English and German, perfect for training a sequence-to-sequence translation model.

In [2]:
!pip install -q datasets

In [3]:
import os
from collections import Counter
from datasets import load_dataset
import tqdm

print("Initializing WMT Dataset Loader (No backups, real data)...")

# 1. Load the Real WMT Dataset
# We use WMT16 (German-English), a standard benchmark.
# We take the first 50,000 examples to keep RAM usage efficient while ensuring variety.
# If you want the full dataset (4.5M pairs), remove the [:50000] slice.
dataset = load_dataset("wmt16", "de-en", split="train[:50000]")

print(f"Successfully loaded {len(dataset)} pairs from WMT16.")

# 2. Extract pairs into memory
translation_pairs = []
for item in dataset:
    # WMT data structure is usually {'translation': {'de': '...', 'en': '...'}}
    en_text = item['translation']['en']
    de_text = item['translation']['de']
    translation_pairs.append((en_text, de_text))

# 3. Build Vocabulary (Character-level)
# We stick to character-level to match your previous logic.
# Note: For SOTA results on WMT, you would usually use BPE (Byte Pair Encoding),
# but character-level is robust for learning deep learning fundamentals.
print("Building vocabulary from dataset...")

counter = Counter()
for en, de in translation_pairs:
    counter.update(en) # Keep case sensitivity? standard is usually to keep it for WMT
    counter.update(de)

# Filter rare characters to keep vocab clean (optional, but good for real web data)
MIN_FREQ = 5
common_chars = [char for char, count in counter.items() if count >= MIN_FREQ]

# Add special tokens
special_tokens = ['<pad>', '<sos>', '<eos>', '<unk>']
chars = special_tokens + sorted(common_chars)
vocab_size = len(chars)

print(f'Shared vocabulary size: {vocab_size}')

# 4. Mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

PAD_IDX = stoi['<pad>']
SOS_IDX = stoi['<sos>']
EOS_IDX = stoi['<eos>']
UNK_IDX = stoi['<unk>']

# 5. Encoding/Decoding Functions
def encode(s):
    """Encode string to token IDs (Character level)."""
    # We do NOT lower() here to preserve information, unlike the toy example
    return [stoi.get(c, UNK_IDX) for c in s]

def decode(l):
    """Decode token IDs to string."""
    return ''.join([itos[i] for i in l if i != PAD_IDX])

print(f"Special tokens: PAD={PAD_IDX}, SOS={SOS_IDX}, EOS={EOS_IDX}, UNK={UNK_IDX}")

# 6. Train/Val Split (80/20)
n_train = int(0.8 * len(translation_pairs))
train_pairs = translation_pairs[:n_train]
val_pairs = translation_pairs[n_train:]

print("-" * 40)
print(f"Train pairs: {len(train_pairs)}")
print(f"Val pairs:   {len(val_pairs)}")
print("-" * 40)
print("Sample Data (Real WMT):")
print(f"  EN: {train_pairs[5][0]}") # Random index to show real sentences
print(f"  DE: {train_pairs[5][1]}")
print(f"  Encoded: {encode(train_pairs[5][0])}")

Initializing WMT Dataset Loader (No backups, real data)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

de-en/train-00000-of-00003.parquet:   0%|          | 0.00/282M [00:00<?, ?B/s]

de-en/train-00001-of-00003.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

de-en/train-00002-of-00003.parquet:   0%|          | 0.00/277M [00:00<?, ?B/s]

de-en/validation-00000-of-00001.parquet:   0%|          | 0.00/343k [00:00<?, ?B/s]

de-en/test-00000-of-00001.parquet:   0%|          | 0.00/475k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Successfully loaded 50000 pairs from WMT16.
Building vocabulary from dataset...
Shared vocabulary size: 113
Special tokens: PAD=0, SOS=1, EOS=2, UNK=3
----------------------------------------
Train pairs: 40000
Val pairs:   10000
----------------------------------------
Sample Data (Real WMT):
  EN: Please rise, then, for this minute' s silence.
  DE: Ich bitte Sie, sich zu einer Schweigeminute zu erheben.
  Encoded: [45, 69, 62, 58, 76, 62, 4, 75, 66, 76, 62, 13, 4, 77, 65, 62, 71, 13, 4, 63, 72, 75, 4, 77, 65, 66, 76, 4, 70, 66, 71, 78, 77, 62, 9, 4, 76, 4, 76, 66, 69, 62, 71, 60, 62, 15]


## TASK 1: T5 Encoder Implementation (6 points)

Build the encoder stack that processes source sequences with bidirectional attention. The encoder is similar to BERT but designed for the T5 architecture.

**1.1**: *Encoder Block* — Implement T5 encoder block with bidirectional self-attention, feed-forward network, layer normalization, and residual connections

**1.2**: *Encoder Stack* — Stack multiple encoder blocks with positional encoding and token embeddings

**1.3**: *Encoder Testing* — Test encoder implementation and verify bidirectional attention works correctly


In [4]:
# Helper function: Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))

        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        return x + self.pe[:x.size(1), :].unsqueeze(0)

In [5]:
class FeedForward(nn.Module):
    def __init__(self, n_embd, hidden_size, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, hidden_size),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, n_embd),
            nn.Dropout(dropout),
        )
    def forward(self, x): return self.net(x)

In [6]:
# Helper function: Scaled Dot-Product Attention
def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
    """
    Compute scaled dot-product attention.

    Args:
        query: [batch_size, seq_len, d_k] or [batch_size, n_heads, seq_len, d_k]
        key: [batch_size, seq_len, d_k] or [batch_size, n_heads, seq_len, d_k]
        value: [batch_size, seq_len, d_v] or [batch_size, n_heads, seq_len, d_v]
        mask: [batch_size, seq_len, seq_len] or None (0s for masked positions)
        dropout: Dropout layer or None
    Returns:
        output: [batch_size, seq_len, d_v] or [batch_size, n_heads, seq_len, d_v]
        attention_weights: [batch_size, seq_len, seq_len] or [batch_size, n_heads, seq_len, seq_len]
    """
    d_k = query.size(-1)

    # Compute attention scores: Q @ K^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Apply mask if provided (set masked positions to -inf before softmax)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)

    # Apply dropout if provided
    if dropout is not None:
        attention_weights = dropout(attention_weights)

    # Multiply attention weights by values
    output = torch.matmul(attention_weights, value)

    return output, attention_weights

In [7]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_embd, n_head, dropout=0.1):
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        self.head_dim = n_embd // n_head
        self.scale = self.head_dim ** -0.5
        self.query = nn.Linear(n_embd, n_embd)
        self.key = nn.Linear(n_embd, n_embd)
        self.value = nn.Linear(n_embd, n_embd)
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        B, T, C = q.shape
        q = self.query(q).view(B, -1, self.n_head, self.head_dim).transpose(1, 2)
        k = self.key(k).view(B, -1, self.n_head, self.head_dim).transpose(1, 2)
        v = self.value(v).view(B, -1, self.n_head, self.head_dim).transpose(1, 2)
        scores = (q @ k.transpose(-2, -1)) * self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = self.dropout(F.softmax(scores, dim=-1))
        out = (attn @ v).transpose(1, 2).contiguous().view(B, -1, C)
        return self.proj(out), attn

### Task 1.1 Implement T5 encoder block

Implement a T5 encoder block with:
- Bidirectional multi-head self-attention (all positions can attend to all positions)
- Feed-forward network
- Layer normalization (pre-norm architecture: normalize before sub-layers)
- Residual connections around each sub-layer

The encoder block should follow this structure:
1. Self-attention: `x = x + dropout(attention(layer_norm(x)))`
2. Feed-forward: `x = x + dropout(ffn(layer_norm(x)))`

In [8]:
class T5EncoderBlock(nn.Module):
    """T5 Encoder Block with bidirectional self-attention."""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE
        # TODO: Initialize:
        # - Multi-head self-attention (bidirectional, no masking)
        # - Feed-forward network
        # - Two layer normalization layers (one for attention, one for FFN)
        # - Dropout layer

        self.self_attention = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm_1 = nn.LayerNorm(d_model)
        self.norm_2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

        ## YOUR_CODE_ENDS_HERE

    def forward(self, x, mask=None):
        ## YOUR_CODE_STARTS_HERE
        # TODO: Implement encoder block with residual connections (pre-norm)
        # 1. Self-attention: x = x + dropout(attention(layer_norm(x)))
        # 2. Feed-forward: x = x + dropout(ffn(layer_norm(x)))
        # Note: Apply layer norm BEFORE the sub-layer (pre-norm architecture)

        attention, _ = self.self_attention(self.norm_1(x), self.norm_1(x), self.norm_1(x), mask = mask)
        x = x + self.dropout(attention)

        feed_forward = self.feed_forward(self.norm_2(x))
        x = x + self.dropout(feed_forward)


        ## YOUR_CODE_ENDS_HERE

        return x

In [9]:
# Test your implementation
encoder_block = T5EncoderBlock(d_model=512, n_heads=8, d_ff=2048)
x = torch.randn(2, 10, 512)
output = encoder_block(x)
print(f"Encoder block output shape: {output.shape}")
## RESULT_CHECKING_POINT -> torch.Size([2, 10, 512])

Encoder block output shape: torch.Size([2, 10, 512])


### Task 1.2 Build encoder stack


In [10]:
class T5Encoder(nn.Module):
    """T5 Encoder: Stack of encoder blocks with embeddings and positional encoding."""
    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, max_len=5000, dropout=0.1):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE
        # TODO: Initialize:
        # - Token embedding layer
        # - Positional encoding
        # - Stack of encoder blocks (n_layers)
        # - Final layer normalization (optional, but common in T5)

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = PositionalEncoding(d_model, max_len)
        self.dropout = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([T5EncoderBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)])
        self.norm = nn.LayerNorm(d_model)

        ## YOUR_CODE_ENDS_HERE

    def forward(self, x, mask=None):
        ## YOUR_CODE_STARTS_HERE
        # TODO: Implement encoder forward pass
        # 1. Token embeddings
        # 2. Add positional encoding
        # 3. Apply dropout
        # 4. Pass through encoder blocks
        # 5. Apply final layer norm

        tok_embed = self.token_embedding(x)
        x = self.pos_embedding(tok_embed)
        x = self.dropout(x)
        for block in self.blocks:
          x = block(x, mask)
        x = self.norm(x)
        ## YOUR_CODE_ENDS_HERE

        return x

# Test your implementation
encoder = T5Encoder(vocab_size=1000, d_model=512, n_heads=8, n_layers=6, d_ff=2048)
x = torch.randint(0, 1000, (2, 10))
output = encoder(x)
print(f"Encoder output shape: {output.shape}")
## RESULT_CHECKING_POINT -> torch.Size([2, 10, 512])

Encoder output shape: torch.Size([2, 10, 512])


### Task 1.3 Test encoder implementation

In [11]:
def test_encoder(encoder, sample_text, max_len=50):
    """Test encoder and visualize attention patterns."""
    encoder.eval()

    # Encode text → ensure tensors go to the correct device
    encoded = torch.tensor(encode(sample_text[:max_len]), device=device).unsqueeze(0)

    # Forward pass
    with torch.no_grad():
        output = encoder(encoded)

    print(f"Input text: {sample_text[:max_len]}")
    print(f"Input shape: {encoded.shape}")
    print(f"Encoder output shape: {output.shape}")
    print(f"Encoder output mean: {output.mean().item():.4f}")
    print(f"Encoder output std: {output.std().item():.4f}")

    return output

## TASK 2: T5 Decoder with Cross-Attention (7 points)

Build the decoder stack with causal self-attention and cross-attention to encoder outputs. The decoder uses masked self-attention (causal) for autoregressive generation and cross-attention to attend to the encoder's output.

**2.1**: *Cross-Attention Mechanism* — Implement cross-attention where queries come from decoder states, and keys/values come from encoder outputs

**2.2**: *T5 Decoder Block* — Implement decoder block with masked self-attention, cross-attention, and feed-forward network

**2.3**: *Decoder Stack* — Stack multiple decoder blocks with positional encoding and token embeddings

### Task 2.1 Implement cross-attention mechanism

In [12]:
class CrossAttentionHead(nn.Module):
    """Single head of cross-attention (decoder attends to encoder)."""
    def __init__(self, head_size, n_embd, dropout):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE
        # Query comes from decoder, Key and Value come from encoder

        self.query = nn.Linear(n_embd, head_size)
        self.key = nn.Linear(n_embd, head_size)
        self.value = nn.Linear(n_embd, head_size)

        self.scale = head_size ** (-0.5) # scale for attention scores
        self.dropout = nn.Dropout(dropout)

        ## YOUR_CODE_ENDS_HERE

    def forward(self, decoder_states, encoder_states):
        ## YOUR_CODE_STARTS_HERE

        q = self.query(decoder_states)   # only query comes from decoder
        k = self.key(encoder_states)
        v = self.value(encoder_states)
        scores = q @ k.transpose(-2,-1) * self.scale # attention scores

        attention = F.softmax(scores, dim=-1)
        attention = self.dropout(attention)

        out = torch.matmul(attention, v) # product of attention weights and values

        ## YOUR_CODE_ENDS_HERE
        return out

In [13]:
class CrossMultiHeadAttention(nn.Module):
    """Multi-head cross-attention for T5."""
    def __init__(self, n_head, head_size, n_embd, dropout):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE
        assert n_embd % n_head == 0 # equal split among heads
        self.n_head = n_head
        self.head_dim = n_embd // n_head

        # Q, K, V from concatenated heads
        self.query = nn.Linear(n_embd, n_embd)
        self.key = nn.Linear(n_embd, n_embd)
        self.value = nn.Linear(n_embd, n_embd)

        self.proj = nn.Linear(n_embd, n_embd)

        self.scale = self.head_dim ** (-0.5)
        self.dropout = nn.Dropout(dropout)

        ## YOUR_CODE_ENDS_HERE

    def forward(self, decoder_states, encoder_states):
        ## YOUR_CODE_STARTS_HERE

        B, T_decoder, _ = decoder_states.shape
        T_encoder = encoder_states.size(1)

        q = self.query(decoder_states)  # only query from decoder
        k = self.key(encoder_states)
        v = self.value(encoder_states)

        # reshape to fit multiple heads
        q = q.view(B, T_decoder, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T_encoder, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T_encoder, self.n_head, self.head_dim).transpose(1, 2)

        scores = q @ k.transpose(-2,-1) * self.scale

        attention = F.softmax(scores, dim=-1)
        attention = self.dropout(attention)

        out = torch.matmul(attention, v)

        # reconcatenate heads
        out = out.transpose(1,2).contiguous().view(B, T_decoder, -1)
        out = self.proj(out)

        ## YOUR_CODE_ENDS_HERE
        return out

In [14]:
# Test your implementation
cross_attn = CrossMultiHeadAttention(n_head=8, head_size=64, n_embd=512, dropout=0.1)
decoder_states = torch.randn(2, 10, 512)  # (batch, decoder_seq_len, d_model)
encoder_states = torch.randn(2, 15, 512)  # (batch, encoder_seq_len, d_model)
output = cross_attn(decoder_states, encoder_states)
print(f"Cross-attention output shape: {output.shape}")
## RESULT_CHECKING_POINT -> torch.Size([2, 10, 512])

Cross-attention output shape: torch.Size([2, 10, 512])


### Task 2.2 Implement T5 decoder block

Implement the decoder block with:
- Masked self-attention (causal, for autoregressive generation)
- Cross-attention (decoder attends to encoder)
- Feed-forward network
- Three layer normalization layers (pre-norm)
- Residual connections

In [15]:
def generate_causal_mask(seq_len):
    return torch.tril(torch.ones(seq_len, seq_len))

In [16]:
class T5DecoderBlock(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_head
        ## YOUR_CODE_STARTS_HERE
        self.self_attention = MultiHeadAttention(
            n_head = n_head,
            n_embd = n_embd,
            dropout = dropout,
        )
        self.cross_attention = CrossMultiHeadAttention(
            n_head = n_head,
            head_size = head_size,
            n_embd = n_embd,
            dropout = dropout
        )
        self.feed_forward = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

        self.norm_1 = nn.LayerNorm(n_embd)
        self.norm_2 = nn.LayerNorm(n_embd)
        self.norm_3 = nn.LayerNorm(n_embd)

        ## YOUR_CODE_ENDS_HERE

    def forward(self, x, encoder_output):
        ## YOUR_CODE_STARTS_HERE
        B, T, _ = x.size()
        # casual mask for self attention
        _mask = torch.tril(torch.ones(T, T, device = x.device)). unsqueeze(0).unsqueeze(0)

        attention, _ = self.self_attention(self.norm_1(x), self.norm_1(x), self.norm_1(x), mask = _mask)
        x = x + self.dropout(attention)

        # decoder attend to encoder in cross-attention
        x = x + self.cross_attention(self.norm_2(x), encoder_output)

        feed_forward = self.feed_forward(self.norm_3(x))
        x = x + self.dropout(feed_forward)

        ## YOUR_CODE_ENDS_HERE
        return x

### Task 2.3 Build decoder stack


In [17]:
class T5Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, block_size, max_len, dropout):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = PositionalEncoding(d_model, block_size)
        self.dropout = nn.Dropout(dropout)
        # stack of decoder blocks
        self.blocks = nn.ModuleList([T5DecoderBlock(d_model, n_heads, block_size, dropout) for _ in range(n_layers)])
        self.norm = nn.LayerNorm(d_model)
        self.block_size = block_size

        ## YOUR_CODE_ENDS_HERE

    def forward(self, x, encoder_output):
        ## YOUR_CODE_STARTS_HERE

        tok_embed = self.token_embedding(x)
        x = self.pos_embedding(tok_embed)
        x = self.dropout(x)
        for block in self.blocks:
          x = block(x, encoder_output)
        x = self.norm(x)

        return x

        ## YOUR_CODE_ENDS_HERE

In [18]:
print("Testing T5DecoderBlock...")
decoder_block = T5DecoderBlock(n_embd=512, n_head=8, block_size=128, dropout=0.1)
x = torch.randn(2, 10, 512)
encoder_output = torch.randn(2, 15, 512)
output = decoder_block(x, encoder_output)
print(f"Decoder block output shape: {output.shape}")

Testing T5DecoderBlock...
Decoder block output shape: torch.Size([2, 10, 512])


## TASK 3: Full T5 Model and Machine Translation (7 points)

Combine encoder and decoder into a complete T5 model and train it on English-German machine translation.

**3.1**: *Complete T5 Model* — Combine encoder and decoder stacks with shared token embeddings and language modeling head

**3.2**: *Translation Dataset Preparation* — Prepare English-German translation dataset with proper batching and padding

**3.3**: *Training and Testing* — Train model on translation task and evaluate performance

### Task 3.1 Build complete T5 model


In [19]:
class T5Model(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, block_size, max_len=5000, dropout=0.1):
        super().__init__()
        ## YOUR_CODE_STARTS_HERE

        self.vocab_size = vocab_size

        self.encoder = T5Encoder(     # Encoder
            vocab_size = vocab_size,
            d_model = d_model,
            n_heads = n_heads,
            n_layers = n_layers,
            d_ff = d_ff,
            max_len = max_len,
            dropout = dropout
        )

        self.decoder = T5Decoder(    # Decoder
            vocab_size = vocab_size,
            d_model = d_model,
            n_heads = n_heads,
            n_layers = n_layers,
            d_ff = d_ff,
            block_size = block_size,
            max_len = max_len,
            dropout = dropout
        )

        # shared embeddings for both
        self.shared_token_embeddings = nn.Embedding(vocab_size, d_model)
        self.encoder.token_embedding = self.shared_token_embeddings
        self.decoder.token_embedding = self.shared_token_embeddings

        self.lm_head = nn.Linear(d_model, vocab_size, bias = False)
        self.lm_head.weight = self.shared_token_embeddings.weight

        ## YOUR_CODE_ENDS_HERE

    def forward(self, src, tgt):
        ## YOUR_CODE_STARTS_HERE

        encoder_output = self.encoder(src) # encode source
        decoder_states = self.decoder(tgt, encoder_output)  # decode based on encoding output

        logits = self.lm_head(decoder_states) # next token prediction

        ## YOUR_CODE_ENDS_HERE
        return logits

### Task 3.2 Prepare translation dataset


In [20]:
def get_translation_batch(split, batch_size=4):
    """Get a batch of English-German translation pairs."""
    pairs = train_pairs if split == 'train' else val_pairs

    # Random sampling
    indices = torch.randint(0, len(pairs), (batch_size,))

    src_batch = []
    tgt_batch = []

    for idx in indices:
        en_text, de_text = pairs[idx]
        src_ids = encode(en_text)
        # German: Add SOS at start, EOS at end
        tgt_ids = [SOS_IDX] + encode(de_text) + [EOS_IDX]

        src_batch.append(torch.tensor(src_ids, dtype=torch.long))
        tgt_batch.append(torch.tensor(tgt_ids, dtype=torch.long))

    # Pad
    src_batch = torch.nn.utils.rnn.pad_sequence(src_batch, batch_first=True, padding_value=PAD_IDX)
    tgt_batch = torch.nn.utils.rnn.pad_sequence(tgt_batch, batch_first=True, padding_value=PAD_IDX)

    # Teacher forcing split
    tgt_input = tgt_batch[:, :-1] # Input: SOS ... Token
    tgt_output = tgt_batch[:, 1:] # Target: Token ... EOS

    return src_batch.to(device), tgt_input.to(device), tgt_output.to(device)

In [23]:
BLOCK_SIZE = 512 # Increased from 64 to 512 to handle sentences > 64 chars
BATCH_SIZE = 16

model = T5Model(
    vocab_size=vocab_size,
    d_model=128,
    n_heads=4,
    n_layers=2,
    d_ff=512,
    block_size=BLOCK_SIZE
).to(device)

print(f"Model initialized with block_size={BLOCK_SIZE}")

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Updated Batch Function with Truncation
def get_safe_batch(split, batch_size=4, max_len=BLOCK_SIZE):
    src_batch, tgt_in, tgt_out = get_translation_batch(split, batch_size)

    # Truncate if they exceed block_size
    if src_batch.size(1) > max_len:
        src_batch = src_batch[:, :max_len]
    if tgt_in.size(1) > max_len:
        tgt_in = tgt_in[:, :max_len]
        tgt_out = tgt_out[:, :max_len] # Output must match input length

    return src_batch, tgt_in, tgt_out

print("Starting training...")
model.train()

Model initialized with block_size=512
Starting training...


T5Model(
  (encoder): T5Encoder(
    (token_embedding): Embedding(113, 128)
    (pos_embedding): PositionalEncoding()
    (dropout): Dropout(p=0.1, inplace=False)
    (blocks): ModuleList(
      (0-1): 2 x T5EncoderBlock(
        (self_attention): MultiHeadAttention(
          (query): Linear(in_features=128, out_features=128, bias=True)
          (key): Linear(in_features=128, out_features=128, bias=True)
          (value): Linear(in_features=128, out_features=128, bias=True)
          (proj): Linear(in_features=128, out_features=128, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): FeedForward(
          (net): Sequential(
            (0): Linear(in_features=128, out_features=512, bias=True)
            (1): GELU(approximate='none')
            (2): Dropout(p=0.1, inplace=False)
            (3): Linear(in_features=512, out_features=128, bias=True)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (norm_1): Laye

### Task 3.3 Train and test translation model


In [24]:
PRINT_EVERY = 50

for epoch in range(5):   # or more epochs later
    for step in range(len(train_pairs)//BATCH_SIZE):
        src, tgt_in, tgt_out = get_safe_batch('train', batch_size=BATCH_SIZE)

        logits = model(src, tgt_in)
        loss = criterion(logits.reshape(-1, vocab_size), tgt_out.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print every few iterations
        if step % PRINT_EVERY == 0:
            print(f"Epoch {epoch} | Step {step}: Loss = {loss.item():.4f}")

print("✓ Training completed successfully!")


Epoch 0 | Step 0: Loss = 90.7166
Epoch 0 | Step 50: Loss = 4.9869
Epoch 0 | Step 100: Loss = 3.4599
Epoch 0 | Step 150: Loss = 3.0678
Epoch 0 | Step 200: Loss = 2.8170
Epoch 0 | Step 250: Loss = 2.7543
Epoch 0 | Step 300: Loss = 2.6322
Epoch 0 | Step 350: Loss = 2.6316
Epoch 0 | Step 400: Loss = 2.5281
Epoch 0 | Step 450: Loss = 2.5944
Epoch 0 | Step 500: Loss = 2.5094
Epoch 0 | Step 550: Loss = 2.5497
Epoch 0 | Step 600: Loss = 2.4533
Epoch 0 | Step 650: Loss = 2.5393
Epoch 0 | Step 700: Loss = 2.4462
Epoch 0 | Step 750: Loss = 2.5085
Epoch 0 | Step 800: Loss = 2.4955
Epoch 0 | Step 850: Loss = 2.4701
Epoch 0 | Step 900: Loss = 2.5037
Epoch 0 | Step 950: Loss = 2.4439
Epoch 0 | Step 1000: Loss = 2.3893
Epoch 0 | Step 1050: Loss = 2.5087
Epoch 0 | Step 1100: Loss = 2.4971
Epoch 0 | Step 1150: Loss = 2.4654
Epoch 0 | Step 1200: Loss = 2.4127
Epoch 0 | Step 1250: Loss = 2.4322
Epoch 0 | Step 1300: Loss = 2.4713
Epoch 0 | Step 1350: Loss = 2.5853
Epoch 0 | Step 1400: Loss = 2.4151
Epoch 0

In [26]:
# Set model to evaluation mode (disables dropout)
model.eval()

def translate_sentence(sentence, model, max_length=100):
    """
    Translates an English sentence to German using the trained T5Model.
    Uses greedy decoding.
    """
    # 1. Prepare Source (English)
    # Encode and add batch dimension [1, seq_len]
    src_ids = encode(sentence)
    src_tensor = torch.tensor(src_ids, dtype=torch.long).unsqueeze(0).to(device)

    # 2. Encoder Pass
    # We encode the source sequence once
    with torch.no_grad():
        encoder_output = model.encoder(src_tensor)

    # 3. Decoder Loop (Autoregressive)
    # Start with the Start-Of-Sequence token
    decoder_input = torch.tensor([[SOS_IDX]], dtype=torch.long).to(device)

    generated_tokens = []

    for _ in range(max_length):
        with torch.no_grad():
            # Pass current sequence and encoder output to decoder
            # decoder_input shape: [1, current_seq_len]
            decoder_output = model.decoder(decoder_input, encoder_output)

            # Project to vocabulary
            logits = model.lm_head(decoder_output)

            # Get the token with highest probability for the last position
            # logits shape: [1, current_seq_len, vocab_size]
            next_token_logits = logits[:, -1, :]
            next_token_id = torch.argmax(next_token_logits, dim=-1).item()

        # Stop if End-Of-Sequence token is generated
        if next_token_id == EOS_IDX:
            break

        generated_tokens.append(next_token_id)

        # Append prediction to decoder input for next step
        next_token_tensor = torch.tensor([[next_token_id]], dtype=torch.long).to(device)
        decoder_input = torch.cat([decoder_input, next_token_tensor], dim=1)

    # 4. Decode to string
    translated_text = decode(generated_tokens)
    return translated_text

# --- Test Translations ---
print("--- Translation Demo ---")
test_sentences = [
    "Hello",
    "Good morning",
    "I love you",
    "Where is the bathroom?",
    "This is difficult"
]

for s in test_sentences:
    trans = translate_sentence(s, model)
    print(f"EN: {s:25} -> DE: {trans}")

# Try a custom one
custom = "I am hungry"
print(f"\nCustom: {custom} -> {translate_sentence(custom, model)}")

--- Translation Demo ---
EN: Hello                     -> DE: Deee insmmmllieeellleen undeeeeeeeuueeeeeeeuueeeeeeeuuuneeeuuuneeeuuuneeeeuuuneeeeuuungslostensseenu
EN: Good morning              -> DE: Die Aussssuchungsnicht der Ausssuussssuchungen der Aussssuussssuchungsnisseondern und der Aussssusss
EN: I love you                -> DE: Ich mmmmennke, werdeeeinnnungssichtlich nicht geoneeeeeeineauseeeuuungsnicht werden, deeeineeellnnnn
EN: Where is the bathroom?    -> DE: Wir ssind schween wir sind schweendigeeeeeeeeeeeeeeeeeinein und deeinealle und schiedeuteeineicht un
EN: This is difficult         -> DE: Diess sind ssseisssontischeidungsssorgeeit deee in deeeeinsationalisiein ssteeetungssnotischeidungss

Custom: I am hungry -> Ich mmmin mich mich mmmin deer Schufendee ich eineemminsam Ich aus den Sie und der einealle Anschmit


# BONUS (2 points): Solve any task with an LLM

**Goal.**  
Pick **one** of the homework tasks (Task 1, Task 2, or Task 3) and solve it using an **LLM**. Provide the **LLM** with the **task description** and any **starter/prerequisite code** it depends on, ask it to generate a complete **code solution** first, then run that **generated code** here in the code notebook yourself. Finally, document what you did and **compare** the LLM’s result to your own pipeline.

**What to deliver below.**
1) **LLM used** (name + version, e.g., “Llama-3-8B-Instruct”, “GPT-x”, “Claude-x”, “Mistral-x”, etc.).  
2) **Prompt(s)** you used.  
3) **LLM output** — copy and paste the generated code.  
4) **Comparison** to your solution: what matches or differs (quantitative or qualitative).  
5) **Reflection**: what the LLM was **good at** vs **bad at**, what it got **right** vs **wrong**.

> **No code required.** You do **not** need to run, share, or submit any code used for the LLM generation. Provide only the deliverables listed above.
> You may use any LLMs through any interface (API, web UI, local inference).


# **Solution**

1.  **LLM used:** GPT-5
2.   **Prompts:** I attached the SetUp, the Dataset Preparation and the helper functions: PositionalEncoding, FeedForward, scaled_dot_product_attention and MultiHeadAttention. I asked GPT-5 to use the helper code in order to selve the first task providing the description.
3. **LLM Output:**

In [27]:
class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        # root mean square
        rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).sqrt()
        x = x / rms
        return self.weight * x

In [28]:
class T5EncoderBlock(nn.Module):
    def __init__(self, n_embd, n_head, ff_hidden, dropout=0.1):
        super().__init__()

        # --- Layer norms (RMSNorm for real T5) ---
        self.norm1 = RMSNorm(n_embd)
        self.norm2 = RMSNorm(n_embd)

        # --- Attention ---
        self.self_attn = MultiHeadAttention(n_embd, n_head, dropout=dropout)

        # --- Feedforward ---
        self.ff = FeedForward(n_embd, ff_hidden, dropout=dropout)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # ----------- Self-attention (pre-norm) -----------
        h = self.norm1(x)
        attn_out, _ = self.self_attn(h, h, h, mask=mask)
        x = x + self.dropout(attn_out)

        # ----------- Feed-forward (pre-norm) -----------
        h = self.norm2(x)
        ff_out = self.ff(h)
        x = x + self.dropout(ff_out)

        return x

In [29]:
class T5Encoder(nn.Module):
    def __init__(
        self,
        vocab_size,
        n_embd=256,
        n_head=4,
        ff_hidden=1024,
        num_layers=6,
        max_len=512,
        dropout=0.1
    ):
        super().__init__()

        self.token_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_enc = PositionalEncoding(n_embd, max_len)

        self.layers = nn.ModuleList([
            T5EncoderBlock(n_embd, n_head, ff_hidden, dropout)
            for _ in range(num_layers)
        ])

        self.norm = RMSNorm(n_embd)

    def forward(self, x, mask=None):
        # x: (batch, seq)
        x = self.token_emb(x)
        x = self.pos_enc(x)

        for layer in self.layers:
            x = layer(x, mask)

        return self.norm(x)

In [30]:
def test_encoder():
    batch = 2
    seqlen = 20
    vocab = vocab_size
    n_embd = 128
    n_head = 4
    ff_hidden = 256
    layers = 3

    encoder = T5Encoder(
        vocab_size=vocab,
        n_embd=n_embd,
        n_head=n_head,
        ff_hidden=ff_hidden,
        num_layers=layers,
        max_len=512,
        dropout=0.1
    ).to(device)

    # Random input
    x = torch.randint(0, vocab, (batch, seqlen)).to(device)

    # Mask: 1 = keep, 0 = pad
    mask = (x != PAD_IDX).unsqueeze(1).unsqueeze(2)

    print("Input shape:", x.shape)
    print("Mask shape:", mask.shape)

    out = encoder(x, mask)
    print("Output shape:", out.shape)

    # Check NaNs
    assert not torch.isnan(out).any(), "NaNs detected in encoder output!"

    # Gradient test
    out.sum().backward()
    print("Backprop OK.")

    print("✓ Encoder test passed!")

In [31]:
test_encoder()

Input shape: torch.Size([2, 20])
Mask shape: torch.Size([2, 1, 1, 20])
Output shape: torch.Size([2, 20, 128])
Backprop OK.
✓ Encoder test passed!


4. **Comparison:** The main difference noticed in the LLM implementation is the use of RMS norm arguing that the T5 does not use Layer norm but instead the root-mean-square.

    The T5 Encoder Block follows the same logic as implemented originally, using Multi Head Attention and implementing Self Attention and Feed Forward by normalizing on RMS.

    Similarly, the T5 Encoder is implemented following the same rules. However, the LLM reassigns the initial hyperparameters and does not initialize or apply dropout on the forward function.

    Also, it argues in text that T5 architectures do not use Positional Encoding but a "relative position bias". It continues to use the first due to the given helper function and stated that the impementation is more similar to a BERT transformer encoder than a T5.

    Since testing was part of the task description, the LLM designed its own test function with random input ansd masking.

          

---



5. **Reflection:** The core of the T5 Encoder architecture remains the same in both implementations following the classic modeling with Bidirectional self attention. However the difference in performance relies on the use of root mean squared RMS.

    The latter is known for less sensitivity to outliers, emphasizing bigger errors and does not remove the mean, perserving fluctuations. This could make RMS a better normalizing tool for encoding, since it can better capture differences useful for training.
    
