# Real-World Translation with Transformers

## 1. Introduction

In the previous notebook, we built the Encoder-Decoder architecture and trained it on a tiny synthetic dataset. Now, we will apply that same architecture to a **real-world translation task**.

We will build an **English-to-French translator**.

### What's New in This Notebook?

1.  **Real Data Pipeline**: Downloading, cleaning, and processing raw text files.
2.  **Vocabulary Building**: Mapping real words to indices based on frequency.
3.  **Batching & Padding**: Handling sentences of different lengths using `pad_sequence` and `collate_fn`.
4.  **Inference**: Translating unseen English sentences.

### The Dataset
We will use the English-French dataset from [Tatoeba](https://tatoeba.org/), hosted by [ManyThings.org](http://www.manythings.org/anki/). It contains sentence pairs ranging from very simple ("Go.") to complex.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import math
import random
import numpy as np
import re
import unicodedata
import os
import urllib.request
import zipfile

# Setup Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set seeds
torch.manual_seed(42)
random.seed(42)

Using device: cuda


## 2. Data Acquisition and Cleaning

First, we download the dataset directly from the source. The file is a tab-separated text file.

In [3]:
import urllib.request
import os
import zipfile

def download_data():
    url = "http://www.manythings.org/anki/fra-eng.zip"
    filename = "fra-eng.zip"
    
    if not os.path.exists("fra.txt"):
        print("Downloading dataset...")
        
        # Create a request with a browser-like User-Agent header
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        )
        
        # Download using urlopen instead of urlretrieve to support headers easily
        with urllib.request.urlopen(req) as response, open(filename, 'wb') as out_file:
            data = response.read()
            out_file.write(data)
            
        print("Extracting dataset...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall(".")
        print("Done!")
    else:
        print("Dataset already exists.")

download_data()

Downloading dataset...
Extracting dataset...
Done!


### Preprocessing Utils

Real text is messy. We need to:
1.  **Unicode Normalization**: Convert accented characters to a standard form.
2.  **Clean**: Remove non-alphabetic characters (keep punctuation separate).
3.  **Tokenize**: Split sentences into words.

In [4]:
# Turn a Unicode string to plain ASCII
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    # Add space between punctuation and words (e.g., "hi!" -> "hi !")
    s = re.sub(r"([.!?])", r" \1", s)
    # Remove anything that isn't a letter or punctuation
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s.strip()

def read_text_file(filename, limit=None):
    print("Reading lines...")
    with open(filename, encoding='utf-8') as f:
        lines = f.read().strip().split('\n')
    
    pairs = []
    for line in lines:
        # File format: English \t French \t Attribution
        parts = line.split('\t')
        if len(parts) >= 2:
            eng = normalize_string(parts[0])
            fra = normalize_string(parts[1])
            pairs.append([eng, fra])
            
    # Limit data for faster training in this tutorial
    if limit:
        pairs = pairs[:limit]
        
    print(f"Read {len(pairs)} sentence pairs.")
    return pairs

# Let's load 50,000 sentences (enough for a good demo)
pairs = read_text_file("fra.txt", limit=50000)
print("Sample:", pairs[100])

Reading lines...
Read 50000 sentence pairs.
Sample: ['go now .', 'allez y maintenant .']


## 3. Vocabulary Building

We need to convert words into numbers (indices). We'll create a `Vocabulary` class that:
1.  Assigns unique IDs to words.
2.  Handles special tokens: `<pad>`, `<sos>`, `<eos>`, `<unk>`.

In [5]:
class Vocabulary:
    def __init__(self):
        self.word2index = {}
        self.index2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.n_words = 4  # Start count

    def add_sentence(self, sentence):
        for word in sentence.split(' '):
            self.add_word(word)

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word
            self.n_words += 1
            
    def sentence_to_indices(self, sentence):
        return [self.word2index.get(word, 3) for word in sentence.split(' ')] # 3 is <unk>

# Build vocabularies
input_vocab = Vocabulary()
output_vocab = Vocabulary()

for pair in pairs:
    input_vocab.add_sentence(pair[0])
    output_vocab.add_sentence(pair[1])

print(f"English Vocab Size: {input_vocab.n_words}")
print(f"French Vocab Size: {output_vocab.n_words}")

English Vocab Size: 5865
French Vocab Size: 9942


## 4. Dataset and Batching

Unlike the synthetic example where all sentences were the same length, real sentences vary. We must:
1.  Convert sentences to tensors.
2.  **Pad** shorter sentences in a batch to match the longest one.
3.  Use a `collate_fn` in the DataLoader to handle this padding dynamically.

In [6]:
class TranslationDataset(Dataset):
    def __init__(self, pairs, input_vocab, output_vocab):
        self.pairs = pairs
        self.input_vocab = input_vocab
        self.output_vocab = output_vocab

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng_text, fra_text = self.pairs[idx]
        
        # Convert to indices
        eng_indices = self.input_vocab.sentence_to_indices(eng_text)
        fra_indices = self.output_vocab.sentence_to_indices(fra_text)
        
        # Add <sos> and <eos>
        # Src: [word, word, ...]
        # Tgt: [<sos>, word, word, ..., <eos>]
        # (We don't strictly need <sos>/<eos> on source for this model, but usually helpful)
        
        eng_tensor = torch.tensor(eng_indices, dtype=torch.long)
        fra_tensor = torch.tensor([1] + fra_indices + [2], dtype=torch.long)
        
        return eng_tensor, fra_tensor

def collate_fn(batch):
    # Sort batch by source length (helps with efficiency, though not strictly required)
    batch.sort(key=lambda x: len(x[0]), reverse=True)
    
    src_batch, tgt_batch = zip(*batch)
    
    # Pad sequences
    # padding_value=0 is our <pad> token
    src_padded = pad_sequence(src_batch, batch_first=True, padding_value=0)
    tgt_padded = pad_sequence(tgt_batch, batch_first=True, padding_value=0)
    
    return src_padded, tgt_padded

# Create DataLoaders
dataset = TranslationDataset(pairs, input_vocab, output_vocab)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)

# Check one batch
src_sample, tgt_sample = next(iter(dataloader))
print("Batch Shapes:", src_sample.shape, tgt_sample.shape)

Batch Shapes: torch.Size([64, 7]) torch.Size([64, 11])


## 5. The Model (Transformer)

We reuse the Encoder-Decoder architecture. 

**Key Details:**
-   **Positional Encoding**: Adds order to the sequences.
-   **Encoder**: Processes the padded source batch.
-   **Decoder**: Processes the target batch with masking.
-   **Masks**: We need to handle the padding (0s) so the model ignores them.

In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=256, nhead=4, 
                 num_layers=2, dim_feedforward=512, dropout=0.1):
        super().__init__()
        
        self.embedding_src = nn.Embedding(src_vocab_size, d_model)
        self.embedding_tgt = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        self.transformer = nn.Transformer(
            d_model=d_model, 
            nhead=nhead, 
            num_encoder_layers=num_layers, 
            num_decoder_layers=num_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        
        self.out_fc = nn.Linear(d_model, tgt_vocab_size)
        
    def create_src_mask(self, src):
        # Mask: True where value is 0 (padding)
        return (src == 0)
    
    def create_tgt_mask(self, tgt):
        # Padding mask
        tgt_pad_mask = (tgt == 0)
        
        # Causal mask (prevent looking forward)
        sz = tgt.size(1)
        tgt_mask = torch.triu(torch.full((sz, sz), float('-inf')), diagonal=1)
        tgt_mask = tgt_mask.to(tgt.device)
        
        return tgt_mask, tgt_pad_mask

    def forward(self, src, tgt):
        src_key_padding_mask = self.create_src_mask(src)
        tgt_mask, tgt_key_padding_mask = self.create_tgt_mask(tgt)
        
        # Embeddings + Positional
        src_emb = self.pos_encoder(self.embedding_src(src))
        tgt_emb = self.pos_encoder(self.embedding_tgt(tgt))
        
        # Transformer Pass
        # Note: memory_key_padding_mask tells decoder to ignore source padding
        outs = self.transformer(
            src=src_emb, 
            tgt=tgt_emb, 
            src_key_padding_mask=src_key_padding_mask, 
            tgt_mask=tgt_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=src_key_padding_mask
        )
        
        return self.out_fc(outs)

## 6. Training

We will train the model. Note that we use `ignore_index=0` in the loss function so that the model isn't penalized for predicting the wrong thing in padded positions.

In [8]:
# Config
SRC_VOCAB = input_vocab.n_words
TGT_VOCAB = output_vocab.n_words
EPOCHS = 10  # Increase this for better results (e.g., 20-30)

model = TransformerModel(SRC_VOCAB, TGT_VOCAB).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0) # Ignore <pad>
optimizer = optim.Adam(model.parameters(), lr=0.0001)

print(f"Training on {len(pairs)} sentences...")

model.train()
for epoch in range(EPOCHS):
    total_loss = 0
    for src, tgt in dataloader:
        src, tgt = src.to(device), tgt.to(device)
        
        # tgt_input: <sos> ... words
        tgt_input = tgt[:, :-1]
        # tgt_output: words ... <eos>
        tgt_output = tgt[:, 1:]

        optimizer.zero_grad()
        
        output = model(src, tgt_input)
        
        # Reshape for loss calculation
        # Output: [batch, seq_len, vocab_size] -> [batch*seq_len, vocab_size]
        loss = criterion(output.reshape(-1, TGT_VOCAB), tgt_output.reshape(-1))
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Prevent exploding grads
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}/{EPOCHS} | Loss: {total_loss / len(dataloader):.4f}")

Training on 50000 sentences...




Epoch 1/10 | Loss: 3.8314
Epoch 2/10 | Loss: 2.6951
Epoch 3/10 | Loss: 2.2988
Epoch 4/10 | Loss: 2.0161
Epoch 5/10 | Loss: 1.7894
Epoch 6/10 | Loss: 1.6052
Epoch 7/10 | Loss: 1.4514
Epoch 8/10 | Loss: 1.3193
Epoch 9/10 | Loss: 1.2052
Epoch 10/10 | Loss: 1.1077


## 7. Inference / Translation

Now we can try to translate new sentences. The process is:
1. Tokenize input string.
2. Feed to Encoder.
3. Autoregressively generate the Output (start with `<sos>`, keep predicting until `<eos>`).

In [9]:
def translate(sentence, model, max_len=50):
    model.eval()
    
    # Preprocess input
    sentence = normalize_string(sentence)
    src_indices = input_vocab.sentence_to_indices(sentence)
    src_tensor = torch.tensor(src_indices, dtype=torch.long).unsqueeze(0).to(device)
    
    # Get Encoder memory
    src_mask = model.create_src_mask(src_tensor)
    src_emb = model.pos_encoder(model.embedding_src(src_tensor))
    memory = model.transformer.encoder(src_emb, src_key_padding_mask=src_mask)
    
    # Start decoder with <sos>
    tgt_indices = [1]
    
    for i in range(max_len):
        tgt_tensor = torch.tensor(tgt_indices, dtype=torch.long).unsqueeze(0).to(device)
        
        # Create masks for decoder
        tgt_mask, _ = model.create_tgt_mask(tgt_tensor)
        
        tgt_emb = model.pos_encoder(model.embedding_tgt(tgt_tensor))
        
        # Decoder forward pass
        out = model.transformer.decoder(
            tgt_emb, 
            memory, 
            tgt_mask=tgt_mask, 
            memory_key_padding_mask=src_mask
        )
        
        # Output projection
        out = model.out_fc(out)
        
        # Get last token probability
        prob = out[0, -1, :]
        next_token = torch.argmax(prob).item()
        
        if next_token == 2: # <eos>
            break
        
        tgt_indices.append(next_token)
    
    # Convert indices to words
    translated_words = [output_vocab.index2word.get(idx, "") for idx in tgt_indices[1:]]
    return " ".join(translated_words)

# Test on some sentences from the dataset (or new ones)
test_sentences = [
    "I am happy.",
    "She is my friend.",
    "Where are you going?",
    "This is a book."
]

print("--- Translations ---")
for s in test_sentences:
    print(f"En: {s}")
    print(f"Fr: {translate(s, model)}")
    print("-"*20)

--- Translations ---
En: I am happy.
Fr: je suis heureux .
--------------------
En: She is my friend.
Fr: elle est mon amie .
--------------------
En: Where are you going?
Fr: ou vas tu ?
--------------------
En: This is a book.
Fr: c est un livre .
--------------------


## 8. Conclusion

You have successfully trained a Transformer on a real translation dataset!

### Observations
1.  **Overfitting**: With 50k sentences and a small model, it might memorize common phrases nicely but struggle with complex grammar. More data + Dropout helps.
2.  **Training Time**: Real vocabularies (thousands of words) make the final Linear layer larger, increasing computation compared to synthetic examples.
3.  **Preprocessing**: Much of the work in real-world NLP is just getting the data cleaned, tokenized, and batched correctly.

### Next Steps to Improve
-   **Use Subword Tokenization**: We used simple whitespace splitting. Modern models use BPE (Byte Pair Encoding) or WordPiece to handle unknown words and morphology better.
-   **Beam Search**: Replace the "Argmax" in the translation loop with Beam Search to find better translations.
-   **Evaluation Metric**: Implement BLEU score to quantitatively measure translation quality.