<a href="https://colab.research.google.com/github/Yuan-Yu-Han/PTAS/blob/main/machine_translation_using_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Translation using transformers: Overview

Transformers are a groundbreaking neural network architecture that has significantly advanced the field of natural language processing (NLP), particularly in sequence-to-sequence tasks. They've become the backbone of many modern NLP applications, including machine translation, text summarization, and language modeling.

A defining feature of transformers is their ability to process entire input sequences simultaneously. Unlike traditional recurrent models, which handle one token at a time, transformers leverage a self-attention mechanism to analyze relationships between all elements in a sequence at once. This enables the model to focus on relevant parts of the input context more efficiently and with greater flexibility.

In this tutorial, we'll explore how to build a machine translation system using the transformer architecture.



In [1]:
# ‚úÖ Install required packages with compatible versions for PyTorch-based English-to-French translation.
# Includes torch (with GPU support), torchtext for dataset utilities, spaCy for tokenization,
# and tqdm for progress visualization.


# Remove torchtune (if exist)
!pip uninstall torchtune -y

# Reinstall compatible versions
!pip install torch==2.3.0 torchvision torchaudio
!pip install torchtext==0.18.0 torchdata
!pip install 'portalocker>=2.0.0' --no-deps
!pip install numpy==1.26.4






In [2]:
!pip install gputil



In [3]:
# Confirm the torch version

import torch

torch.__version__

'2.3.0+cu121'

### Loading The Data

In [4]:
import torchtext

import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torchtext.vocab import build_vocab_from_iterator
from tqdm import tqdm



##### Download the dataset

In [7]:
!pip install -U kagglehub
import kagglehub

# Download latest version
path = kagglehub.dataset_download("harishreddy18/english-to-french")

print("Path to dataset files:", path)

Collecting kagglehub
  Downloading kagglehub-0.4.2-py3-none-any.whl.metadata (38 kB)
Collecting kagglesdk<1.0,>=0.1.14 (from kagglehub)
  Downloading kagglesdk-0.1.15-py3-none-any.whl.metadata (13 kB)
Downloading kagglehub-0.4.2-py3-none-any.whl (69 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m69.3/69.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading kagglesdk-0.1.15-py3-none-any.whl (160 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m160.4/160.4 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kagglesdk, kagglehub
  Attempting uninstall: kagglehub
    Found existing installation: kagglehub 0.3.13
    Uninstalling kagglehub-0.3.13:
      Successfully uninstalled kagglehub-0.3.13
Successfully installed kagglehub-0.4.2 kagglesdk-0.1.15


Using Colab cache for faster access to the 'english-to-french' dataset.
Path to dataset files: /kaggle/input/english-to-french


In [8]:
en_df = pd.read_csv('/kaggle/input/english-to-french/small_vocab_en.csv', header=None, usecols=[0])
fr_df = pd.read_csv('/kaggle/input/english-to-french/small_vocab_fr.csv', header=None, usecols=[0])

english_sentences = en_df[0].astype(str).tolist()
french_sentences = fr_df[0].astype(str).tolist()

print(english_sentences[0:7])
print(french_sentences[0:7])


['new jersey is sometimes quiet during autumn ', 'the united states is usually chilly during july ', 'california is usually quiet during march ', 'the united states is sometimes mild during june ', 'your least liked fruit is the grape ', 'his favorite fruit is the orange ', 'paris is relaxing during december ']
["new jersey est parfois calme pendant l' automne ", 'les √©tats-unis est g√©n√©ralement froid en juillet ', 'california est g√©n√©ralement calme en mars ', 'les √©tats-unis est parfois l√©g√®re en juin ', 'votre moins aim√© fruit est le raisin ', "son fruit pr√©f√©r√© est l'orange ", 'paris est relaxant en d√©cembre ']


### Vocabulary Building and Sentence Encoding

This section defines helper functions and builds vocabularies for the English (source) and French (target) datasets.

1. **`yield_tokens`**: A generator function that tokenizes each sentence by lowercasing and splitting on spaces.
2. **Vocabulary Creation**:
   - Uses `build_vocab_from_iterator` to create source (`src_vocab`) and target (`trg_vocab`) vocabularies from tokenized English and French sentences.
   - Adds special tokens: `<pad>` for padding, `<sos>` (start of sentence), and `<eos>` (end of sentence).
   - Sets the default index to the padding token (`<pad>`), which is mapped to index 0.
3. **`encode_sentence`**:
   - Converts a raw sentence into a list of token indices using the corresponding vocabulary.
   - Optionally prepends `<sos>` and appends `<eos>` tokens for use in sequence models.
   - Returns the encoded sentence as a PyTorch tensor.


In [9]:
def yield_tokens(sentences):
    for sentence in sentences:
        yield sentence.strip().lower().split()

SRC_PAD_IDX = 0
TRG_PAD_IDX = 0

src_vocab = build_vocab_from_iterator(yield_tokens(english_sentences), specials=["<pad>", "<sos>", "<eos>"])
src_vocab.set_default_index(SRC_PAD_IDX) # if a token is not found in vocab set it to 0.

trg_vocab = build_vocab_from_iterator(yield_tokens(french_sentences), specials=["<pad>", "<sos>", "<eos>"])
trg_vocab.set_default_index(TRG_PAD_IDX) # if a token is not found in vocab set it to 0.

def encode_sentence(sentence, vocab, add_specials=True):
    tokens = sentence.lower().strip().split()
    if add_specials:
        tokens = ['<sos>'] + tokens + ['<eos>']
    return torch.tensor(vocab(tokens), dtype=torch.long)


### Custom Dataset and DataLoader for Machine Translation

This section defines a PyTorch-compatible dataset and a collate function to prepare batches for training a transformer-based translation model.

1. **`TranslationDataset` Class**:
   - Inherits from `torch.utils.data.Dataset`.
   - Takes lists of source (`src`) and target (`trg`) sentences along with their respective vocabularies.
   - Implements `__len__` and `__getitem__`:
     - `__getitem__` encodes each source and target sentence into tensors of token indices using `encode_sentence`.

2. **`collate_fn` Function**:
   - A custom function used to combine individual samples into a batch.
   - Uses `pad_sequence` to pad sequences in the batch so they have equal lengths.
   - Ensures correct padding values (`SRC_PAD_IDX` and `TRG_PAD_IDX`) are used for source and target batches.

3. **`DataLoader` Setup**:
   - Wraps the dataset in a `DataLoader` to enable efficient batch processing.
   - Shuffles data and uses the custom `collate_fn` to handle variable-length sequences.


In [10]:
class TranslationDataset(Dataset):
    def __init__(self, src, trg, src_vocab, trg_vocab):
        self.src = src
        self.trg = trg
        self.src_vocab = src_vocab
        self.trg_vocab = trg_vocab

    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx):
        src_seq = encode_sentence(self.src[idx], self.src_vocab)
        trg_seq = encode_sentence(self.trg[idx], self.trg_vocab)
        return src_seq, trg_seq

def collate_fn(batch):
    src_batch, trg_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, padding_value=SRC_PAD_IDX) # pads the seq to be the same length as the longest sequence
    trg_batch = pad_sequence(trg_batch, padding_value=TRG_PAD_IDX)
    return src_batch, trg_batch

dataset = TranslationDataset(english_sentences, french_sentences, src_vocab, trg_vocab)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)


### Spit into train, val and test set.


In [11]:
from torch.utils.data import random_split

# Split ratios
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

# Total size
dataset = TranslationDataset(english_sentences, french_sentences, src_vocab, trg_vocab)
total_size = len(dataset)

train_size = int(train_ratio * total_size)
val_size = int(val_ratio * total_size)
test_size = total_size - train_size - val_size

# Random split
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

# DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)


### Positional Encoding Module

This module adds **positional information** to input embeddings, enabling the transformer model to capture the order of tokens in a sequence.

- Uses **sine and cosine functions** of different frequencies to generate position-dependent vectors.
- Applies dropout for regularization.
- The positional encodings are **precomputed and stored** as a buffer to avoid recomputation during training.

The encoding is added to the input embeddings before they are passed into the transformer layers.


### Positional Encoding Formulas

For each position `pos` and embedding dimension `i`, the positional encoding is defined as:

- **Even dimensions** (i = 0, 2, 4, ...):

$$
PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

- **Odd dimensions** (i = 1, 3, 5, ...):

$$
PE(pos, 2i + 1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

---

### What this means:

- Each position `pos` has a **unique encoding vector**
- Each dimension `i` corresponds to a **different frequency**
- Sine and cosine allow the model to **recover relative positions**


In [12]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pos_encoding = torch.zeros(max_len, d_model)
        positions = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))

        pos_encoding[:, 0::2] = torch.sin(positions * div_term) # for all 2i
        pos_encoding[:, 1::2] = torch.cos(positions * div_term) # for all 2i + 1

        self.register_buffer('pe', pos_encoding.unsqueeze(1))

    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)


### Sequence-to-Sequence Transformer Model

This class defines the core **Transformer-based encoder-decoder architecture** for sequence-to-sequence tasks such as machine translation.

Key components:
- **Embedding Layers**: Convert token indices from the source and target sequences into dense vector representations.
- **Positional Encoding**: Adds position information to the embeddings to help the model understand word order.
- **`nn.Transformer`**: The main transformer block containing multi-head attention, feedforward networks, and residual connections.
  - Configurable number of encoder and decoder layers.
- **Output Generator**: A linear layer that maps the transformer's output to the target vocabulary size for prediction.

The `forward` method takes:
- `src`, `tgt`: Source and target token sequences.
- `src_mask`, `tgt_mask`: Masks to prevent attention to future tokens or pad tokens.
- `src_padding_mask`, `tgt_padding_mask`: Masks to ignore padding in the attention mechanism.

It returns the output logits for the target sequence.


In [13]:
class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers, num_decoder_layers, emb_size, nhead, src_vocab_size, tgt_vocab_size, dim_feedforward=512, dropout=0.1):
        super().__init__()
        self.transformer = nn.Transformer(d_model=emb_size, nhead=nhead,
                                          num_encoder_layers=num_encoder_layers,
                                          num_decoder_layers=num_decoder_layers,
                                          dim_feedforward=dim_feedforward,
                                          dropout=dropout)

        self.src_tok_emb = nn.Embedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

        self.generator = nn.Linear(emb_size, tgt_vocab_size)

    def forward(self, src, tgt, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(tgt))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, None)
        return self.generator(outs)


### Mask Generation for Transformer

Transformers require masks to control how tokens attend to one another during training, especially for tasks like language modeling and translation. This section defines two functions:

1. **`generate_square_subsequent_mask(sz)`**:
   - Creates an **upper triangular matrix** filled with `-inf` above the diagonal.
   - Prevents the decoder from attending to future tokens during training (auto-regressive behavior).

2. **`create_mask(src, tgt)`**:
   - Generates the necessary masks for both source and target sequences:
     - `src_mask`: A placeholder (all zeros) since the encoder attends to the full input.
     - `tgt_mask`: Prevents the decoder from peeking ahead using `generate_square_subsequent_mask`.
     - `src_padding_mask` and `tgt_padding_mask`: Identify padded positions in the input, so they can be ignored during attention computation.

These masks are passed to the transformer during training and inference to ensure correct attention behavior.


In [14]:
def generate_square_subsequent_mask(sz):
    return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1) # keeps the elements above main diagonal inf

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    # Ensure tgt_mask and src_mask are both bool type
    tgt_mask = generate_square_subsequent_mask(tgt_seq_len).type(torch.bool) # mask future tokens (all -infs are True)
    src_mask = torch.zeros((src_seq_len, src_seq_len), dtype=torch.bool) # all tokens of src is known (all False)

    # Padding masks for src and tgt (batch, seq_len)
    src_padding_mask = (src == SRC_PAD_IDX).transpose(0, 1) # mask for pad tokens
    tgt_padding_mask = (tgt == TRG_PAD_IDX).transpose(0, 1)

    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask


### Training

In [15]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, random_split
from tqdm import tqdm
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction

# Accuracy function
def calculate_accuracy(logits, target):
    pred = logits.argmax(dim=-1)
    non_pad = target != TRG_PAD_IDX
    correct = (pred == target) & non_pad
    return correct.sum().item() / non_pad.sum().item()

# Setup
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SRC_PAD_IDX = src_vocab['<pad>']
SRC_SOS_IDX = src_vocab['<sos>']
SRC_EOS_IDX = src_vocab['<eos>']

TRG_PAD_IDX = trg_vocab['<pad>']
TRG_SOS_IDX = trg_vocab['<sos>']
TRG_EOS_IDX = trg_vocab['<eos>']

SRC_VOCAB_SIZE = len(src_vocab)
TGT_VOCAB_SIZE = len(trg_vocab)
EMB_SIZE = 256
NHEAD = 8
FFN_HID_DIM = 512
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
BATCH_SIZE = 32
NUM_EPOCHS = 3

model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                           NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM).to(DEVICE)

loss_fn = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

itos = trg_vocab.get_itos()
smoothie = SmoothingFunction().method4 # this is for blue score

# Training loop
for epoch in range(NUM_EPOCHS):
    model.train()
    total_loss = 0
    total_acc = 0
    total_bleu = 0
    total_samples = 0

    for src, tgt in tqdm(train_loader, desc=f"Epoch {epoch+1} - Train"):
        src, tgt = src.to(DEVICE), tgt.to(DEVICE)
        tgt_input = tgt[:-1, :]  # remove <eos>
        tgt_out = tgt[1:, :]     # remove <sos>

        src_mask, tgt_mask, src_pad_mask, tgt_pad_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask.to(DEVICE), tgt_mask.to(DEVICE),
                       src_pad_mask.to(DEVICE), tgt_pad_mask.to(DEVICE))

        optimizer.zero_grad()
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_acc += calculate_accuracy(logits, tgt_out)

        pred_tokens = logits.argmax(-1).transpose(0, 1).tolist()
        target_tokens = tgt_out.transpose(0, 1).tolist()

        for pred, ref in zip(pred_tokens, target_tokens):
            pred_clean = [itos[tok] for tok in pred if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
            ref_clean = [itos[tok] for tok in ref if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
            if len(ref_clean) > 0 and len(pred_clean) > 0:
                bleu = sentence_bleu([ref_clean], pred_clean, smoothing_function=smoothie)
                total_bleu += bleu
                total_samples += 1

    avg_train_loss = total_loss / len(train_loader)
    avg_train_acc = total_acc / len(train_loader)
    avg_train_bleu = (total_bleu / total_samples) * 100 if total_samples > 0 else 0

    # ------------------ Validation -------------------
    model.eval()
    val_loss, val_acc = 0, 0
    val_bleu = 0
    val_samples = 0

    with torch.no_grad():
        for src, tgt in tqdm(val_loader, desc=f"Epoch {epoch+1} - Val"):
            src, tgt = src.to(DEVICE), tgt.to(DEVICE)
            tgt_input = tgt[:-1, :]
            tgt_out = tgt[1:, :]

            src_mask, tgt_mask, src_pad_mask, tgt_pad_mask = create_mask(src, tgt_input)

            logits = model(src, tgt_input, src_mask.to(DEVICE), tgt_mask.to(DEVICE),
                           src_pad_mask.to(DEVICE), tgt_pad_mask.to(DEVICE))

            loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
            val_loss += loss.item()
            val_acc += calculate_accuracy(logits, tgt_out)

            pred_tokens = logits.argmax(-1).transpose(0, 1).tolist()
            target_tokens = tgt_out.transpose(0, 1).tolist()

            for pred, ref in zip(pred_tokens, target_tokens):
                pred_clean = [itos[tok] for tok in pred if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
                ref_clean = [itos[tok] for tok in ref if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
                if len(ref_clean) > 0 and len(pred_clean) > 0:
                    bleu = sentence_bleu([ref_clean], pred_clean, smoothing_function=smoothie)
                    val_bleu += bleu
                    val_samples += 1

    avg_val_loss = val_loss / len(val_loader)
    avg_val_acc = val_acc / len(val_loader)
    avg_val_bleu = (val_bleu / val_samples) * 100 if val_samples > 0 else 0

    print(f"Epoch {epoch+1}:")
    print(f"  Train Loss = {avg_train_loss:.4f}, Train Acc = {avg_train_acc:.4f}, Train BLEU = {avg_train_bleu:.2f}")
    print(f"  Val   Loss = {avg_val_loss:.4f}, Val   Acc = {avg_val_acc:.4f}, Val   BLEU = {avg_val_bleu:.2f}")


Epoch 1 - Train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3447/3447 [01:27<00:00, 39.47it/s]
Epoch 1 - Val: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 431/431 [00:04<00:00, 96.59it/s]


Epoch 1:
  Train Loss = 0.4568, Train Acc = 0.9037, Train BLEU = 75.57
  Val   Loss = 0.0807, Val   Acc = 0.9745, Val   BLEU = 88.28


Epoch 2 - Train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3447/3447 [01:25<00:00, 40.40it/s]
Epoch 2 - Val: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 431/431 [00:04<00:00, 101.24it/s]


Epoch 2:
  Train Loss = 0.0810, Train Acc = 0.9753, Train BLEU = 87.52
  Val   Loss = 0.0537, Val   Acc = 0.9818, Val   BLEU = 90.45


Epoch 3 - Train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3447/3447 [01:25<00:00, 40.48it/s]
Epoch 3 - Val: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 431/431 [00:04<00:00, 88.87it/s] 

Epoch 3:
  Train Loss = 0.0572, Train Acc = 0.9809, Train BLEU = 88.74
  Val   Loss = 0.0441, Val   Acc = 0.9848, Val   BLEU = 91.11





### Evaluation on test set

In [16]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

itos = trg_vocab.get_itos()

# Test evaluation
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

model.eval()
test_loss, test_acc = 0, 0
total_bleu = 0
total_samples = 0

with torch.no_grad():
    for src, tgt in tqdm(test_loader, desc="Test Set"):
        src, tgt = src.to(DEVICE), tgt.to(DEVICE)
        tgt_input = tgt[:-1, :]
        tgt_out = tgt[1:, :]

        src_mask, tgt_mask, src_pad_mask, tgt_pad_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask.to(DEVICE), tgt_mask.to(DEVICE),
                       src_pad_mask.to(DEVICE), tgt_pad_mask.to(DEVICE))

        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        test_loss += loss.item()
        test_acc += calculate_accuracy(logits, tgt_out)

        # BLEU computation
        pred_tokens = logits.argmax(-1).transpose(0, 1).tolist()
        target_tokens = tgt_out.transpose(0, 1).tolist()

        for pred, ref in zip(pred_tokens, target_tokens):
            pred_clean = [itos[tok] for tok in pred if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
            ref_clean = [itos[tok] for tok in ref if tok not in {TRG_PAD_IDX, TRG_SOS_IDX, TRG_EOS_IDX}]
            if len(ref_clean) > 0 and len(pred_clean) > 0:
                bleu = sentence_bleu([ref_clean], pred_clean, smoothing_function=smoothie)
                total_bleu += bleu
                total_samples += 1

avg_test_loss = test_loss / len(test_loader)
avg_test_acc = test_acc / len(test_loader)
avg_test_bleu = (total_bleu / total_samples) * 100 if total_samples > 0 else 0

print(f"\nüìä Test Loss: {avg_test_loss:.4f}, Test Accuracy: {avg_test_acc:.4f}, Test BLEU: {avg_test_bleu:.2f}")


Test Set: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 431/431 [00:04<00:00, 101.39it/s]


üìä Test Loss: 0.0417, Test Accuracy: 0.9856, Test BLEU: 91.14





### Checking on samples from test set

In [17]:
def translate(model, sentence):
    model.eval()
    src = encode_sentence(sentence, src_vocab).unsqueeze(1).to(DEVICE)
    num_tokens = src.shape[0]
    src_mask = torch.zeros((num_tokens, num_tokens)).type(torch.bool).to(DEVICE)

    memory = model.transformer.encoder(model.positional_encoding(model.src_tok_emb(src)), src_mask)
    ys = torch.tensor([[trg_vocab['<sos>']]], dtype=torch.long).to(DEVICE)

    for i in range(50):
        tgt_mask = generate_square_subsequent_mask(ys.size(0)).to(DEVICE)
        out = model.transformer.decoder(model.positional_encoding(model.tgt_tok_emb(ys)), memory, tgt_mask)
        out = model.generator(out)
        next_word = out[-1, 0].argmax().item()
        ys = torch.cat([ys, torch.tensor([[next_word]], device=DEVICE)], dim=0)
        if next_word == trg_vocab['<eos>']:
            break

    translated = [trg_vocab.get_itos()[token] for token in ys.squeeze()][1:-1]
    return ' '.join(translated)


In [18]:
print("\nTest Set Samples")

for i in range(3):
    src_sentence = test_dataset[i][0]
    tgt_sentence = test_dataset[i][1]

    # Decode token IDs back to words using vocab
    src_text = ' '.join([src_vocab.get_itos()[tok] for tok in src_sentence if tok != SRC_PAD_IDX])
    tgt_text = ' '.join([trg_vocab.get_itos()[tok] for tok in tgt_sentence if tok != TRG_PAD_IDX])

    print(f"\nSample {i+1}:")
    print(f"  English (src):  {src_text}")
    print(f"  Predicted French (tgt): {translate(model, src_text)}")
    print(f"  GT French  (tgt):  {tgt_text}")



Test Set Samples

Sample 1:
  English (src):  <sos> pears are my least liked fruit . <eos>
  Predicted French (tgt): les poires sont mes fruits moins aim√© .
  GT French  (tgt):  <sos> les poires sont mes fruits moins aim√© . <eos>

Sample 2:
  English (src):  <sos> france is snowy during september <eos>
  Predicted French (tgt): la france est la neige au mois de septembre
  GT French  (tgt):  <sos> la france est la neige au mois de septembre <eos>

Sample 3:
  English (src):  <sos> the peach is your least favorite fruit <eos>
  Predicted French (tgt): la p√™che est votre fruit pr√©f√©r√© moins
  GT French  (tgt):  <sos> la p√™che est votre fruit pr√©f√©r√© moins <eos>


####
To-Do

Q. Visualize results for other samples from the test set

Q. Try different epoch values ‚Äî e.g., 4, 5, 6 ‚Äî and observe how the loss changes.

Q. Experiment with different architecture configurations, such as changing the number of encoder and decoder layers.

Q. Try for different batch sizes

Q. We you find a different translation dataset try on that.

### Resources

[The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)


### [Quiz](https://docs.google.com/forms/d/e/1FAIpQLScM0wn9cLVWCQvJemeOWlrK_RhGBmdeKOJBaqFkERFGmyqK3w/viewform?usp=dialog)