A Transformer-based Neural Machine Translation model that translates English to Italian, trained from scratch on the opus_books corpus using the architecture described in "Attention Is All You Need" (Vaswani et al., 2017).
- Overview
- Architecture
- Training Results
- Project Structure
- Quick Start
- Configuration
- Sample Translations
- Key Design Decisions
- Future Work
This project implements a complete neural machine translation pipeline — from raw text to trained model — without relying on pre-trained weights. The goal was to deeply understand the Transformer architecture by building and training every component from the ground up.
Key highlights:
- Built the full Transformer encoder–decoder architecture in PyTorch
- Trained BPE (Byte-Pair Encoding) tokenizers from scratch using HuggingFace
tokenizers - Trained for 20 epochs on an NVIDIA Tesla P100 GPU (~4h 35m total)
- Loss reduced from 6.17 → 2.28 (best checkpoint: 2.276 at epoch 18)
- Qualitative translations improve visibly epoch-by-epoch (see Sample Translations)
Input (English) Output (Italian)
│ ▲
▼ │
Token Embedding + Positional Encoding │
│ │
▼ ┌─────┴──────┐
┌──────────────┐ Encoder Output │ Projection │
│ Encoder │ ──────────────────►│ + Softmax │
│ (6 layers) │ └─────────────┘
└──────────────┘ ▲
│
┌──────┴──────┐
│ Decoder │
│ (6 layers) │
└─────────────┘
▲
│
Token Embedding + Positional Encoding
│
Target (Italian) shifted right
Each Encoder block consists of:
- Multi-Head Self-Attention (8 heads)
- Add & Norm (residual connection + layer normalisation)
- Position-wise Feed-Forward Network (d_ff = 2048)
- Add & Norm
Each Decoder block adds: 5. Masked Multi-Head Self-Attention (causal masking prevents looking ahead) 6. Multi-Head Cross-Attention (attends to encoder output)
| Parameter | Value |
|---|---|
Model dimension (d_model) |
512 |
Feed-forward dimension (d_ff) |
2048 |
Attention heads (h) |
8 |
Encoder / Decoder layers (N) |
6 |
| Dropout | 0.1 |
| Max sequence length | 350 |
| Batch size | 16 |
| Learning rate | 1e-4 (Adam, ε=1e-9) |
| Label smoothing | 0.1 |
| Epochs | 20 |
| Hardware | NVIDIA Tesla P100 (Kaggle) |
Training was conducted on the full opus_books en–it split (~29k sentence pairs, 90/10 train/val).
| Epoch | Train Loss | Epoch | Train Loss |
|---|---|---|---|
| 1 | 6.173 | 11 | 3.726 |
| 2 | 5.004 | 12 | 3.738 |
| 3 | 5.172 | 13 | 3.135 |
| 4 | 4.990 | 14 | 3.423 |
| 5 | 4.682 | 15 | 2.942 |
| 6 | 4.502 | 16 | 3.075 |
| 7 | 4.372 | 17 | 2.720 |
| 8 | 4.648 | 18 | 2.276 ✓ |
| 9 | 3.900 | 19 | 2.676 |
| 10 | 4.237 | 20 | 2.287 |
Best checkpoint: Epoch 18 — loss = 2.276
Total training time: ~4 hours 35 minutes on Tesla P100
Steps per epoch: 1,819 @ ~2.22 batch/s
The non-monotonic loss curve (e.g., epochs 8, 10 slightly higher) is typical of Adam with label smoothing on small literary corpora — the model briefly overfits before generalising.
TranslationLM/
│
├── src/
│ ├── config.py # All hyperparameters and file path helpers
│ ├── model.py # Full Transformer architecture (from scratch)
│ ├── dataset.py # Bilingual dataset + BPE tokenizer training
│ ├── train.py # Training loop with TensorBoard + checkpointing
│ └── translate.py # Inference script (CLI)
│
├── notebooks/
│ └── translationlm.ipynb # Original Kaggle training notebook
│
├── requirements.txt
├── .gitignore
├── LICENSE # Apache 2.0
└── README.md
git clone https://github.com/atandra2000/TranslationLM.git
cd TranslationLM
pip install -r requirements.txtpython src/train.pyThis will:
- Download the
opus_booksen–it corpus (~29k pairs) automatically - Train BPE tokenizers and save them as
tokenizer_en.json/tokenizer_it.json - Save checkpoints to
weights/translationlm_<epoch>.pt - Log loss curves to
runs/(view withtensorboard --logdir runs)
python src/translate.py --text "The sun sets over the mountains."
# → Il sole tramonta sulle montagne.
python src/translate.py --text "She could not hide her feelings."
# → Non riusciva a nascondere i suoi sentimenti.tensorboard --logdir runs
# Open http://localhost:6006See notebooks/translationlm.ipynb. The notebook mounts the source files as a Kaggle dataset and runs train.py directly on a P100 GPU.
All hyperparameters live in src/config.py:
def get_config():
return {
'batch_size': 16,
'num_epochs': 20,
'learning_rate': 1e-4,
'seq_len': 350,
'd_model': 512,
'd_ff': 2048,
'h': 8, # attention heads
'N': 6, # encoder/decoder layers
'dropout': 0.1,
'datasource': 'opus_books',
'lang_src': 'en',
'lang_tgt': 'it',
}Greedy-decoded outputs logged during training on the same probe sentence show the model's progression from random noise to structured Italian:
| Epoch | Predicted (greedy) |
|---|---|
| 1 | Non si , ma non si , ma non si , e si , e si , e si . |
| 4 | Per quanto a questo , egli , senza aver sentito , senza aver fatto il salotto ... |
| 9 | Per quanto egli , senza pensare , si aspettava la fine ... dove non c'era nulla ... |
| 13 | Per chiarire via , non si sentì , ma si aspettava la fine della sala di discorso ... |
| 17 | Con questo sentimento , senza guardare , senza ascoltare , si aspettava la fine ... |
| 20 | Con questo sentimento , senza guardare , uscì dalla fine di quella discussione e di quella sala , dove nessuno c'era ... |
Source (probe): "To free himself from this feeling he went, without waiting to hear the end of the discussion, into the refreshment room, where there was no one except the waiters at the buffet."
Target: "Per liberarsi da questa sensazione penosa, senz'attendere la fine del dibattito, se ne andò in una sala dove non c'era nessuno, tranne i servitori vicino a una credenza."
The model learns Italian grammar structure by epoch 4–5, picks up key vocabulary by epoch 10, and produces largely fluent sentences by epoch 17–20.
Why train from scratch?
The goal was to understand every component — not just call from_pretrained(). Building the attention mechanism, positional encoding, and training loop manually gave me a concrete understanding of what each piece does.
Why BPE tokenization?
Byte-Pair Encoding handles unknown words gracefully (by splitting into subword units) and produces compact vocabularies — important for a small corpus like opus_books.
Why label smoothing?
Label smoothing (ε=0.1) prevents the model from becoming overconfident on the training set, which helps generalisation on a small literary corpus.
Why greedy decoding during training?
Beam search would give better BLEU scores but greedy decoding is deterministic, fast, and sufficient for monitoring training quality per epoch.
- Implement beam search decoding for better translation quality
- Evaluate BLEU score on a held-out test set
- Add learning rate warm-up schedule (as in the original paper)
- Extend to other language pairs (e.g., en–fr, en–de)
- Experiment with larger datasets (WMT14, OPUS-100)
- Deploy as a FastAPI web service
This project is released under the Apache 2.0 License.
Built and trained by Atandra Bharati