🌐 TranslationLM — Neural Machine Translation (English → Italian)

A Transformer-based Neural Machine Translation model that translates English to Italian, trained from scratch on the opus_books corpus using the architecture described in "Attention Is All You Need" (Vaswani et al., 2017).

📋 Table of Contents

Overview
Architecture
Training Results
Project Structure
Quick Start
Configuration
Sample Translations
Key Design Decisions
Future Work

Overview

This project implements a complete neural machine translation pipeline — from raw text to trained model — without relying on pre-trained weights. The goal was to deeply understand the Transformer architecture by building and training every component from the ground up.

Key highlights:

Built the full Transformer encoder–decoder architecture in PyTorch
Trained BPE (Byte-Pair Encoding) tokenizers from scratch using HuggingFace tokenizers
Trained for 20 epochs on an NVIDIA Tesla P100 GPU (~4h 35m total)
Loss reduced from 6.17 → 2.28 (best checkpoint: 2.276 at epoch 18)
Qualitative translations improve visibly epoch-by-epoch (see Sample Translations)

Architecture

Input (English)                     Output (Italian)
    │                                     ▲
    ▼                                     │
Token Embedding + Positional Encoding     │
    │                                     │
    ▼                               ┌─────┴──────┐
┌──────────────┐   Encoder Output   │  Projection │
│   Encoder    │ ──────────────────►│   + Softmax │
│  (6 layers)  │                    └─────────────┘
└──────────────┘                          ▲
                                          │
                                   ┌──────┴──────┐
                                   │   Decoder   │
                                   │  (6 layers) │
                                   └─────────────┘
                                          ▲
                                          │
                             Token Embedding + Positional Encoding
                                          │
                                   Target (Italian) shifted right

Each Encoder block consists of:

Multi-Head Self-Attention (8 heads)
Add & Norm (residual connection + layer normalisation)
Position-wise Feed-Forward Network (d_ff = 2048)
Add & Norm

Each Decoder block adds: 5. Masked Multi-Head Self-Attention (causal masking prevents looking ahead) 6. Multi-Head Cross-Attention (attends to encoder output)

Hyperparameters

Parameter	Value
Model dimension (`d_model`)	512
Feed-forward dimension (`d_ff`)	2048
Attention heads (`h`)	8
Encoder / Decoder layers (`N`)	6
Dropout	0.1
Max sequence length	350
Batch size	16
Learning rate	1e-4 (Adam, ε=1e-9)
Label smoothing	0.1
Epochs	20
Hardware	NVIDIA Tesla P100 (Kaggle)

Training Results

Training was conducted on the full opus_books en–it split (~29k sentence pairs, 90/10 train/val).

Loss Curve

Epoch	Train Loss	Epoch	Train Loss
1	6.173	11	3.726
2	5.004	12	3.738
3	5.172	13	3.135
4	4.990	14	3.423
5	4.682	15	2.942
6	4.502	16	3.075
7	4.372	17	2.720
8	4.648	18	2.276 ✓
9	3.900	19	2.676
10	4.237	20	2.287

Best checkpoint: Epoch 18 — loss = 2.276
Total training time: ~4 hours 35 minutes on Tesla P100
Steps per epoch: 1,819 @ ~2.22 batch/s

The non-monotonic loss curve (e.g., epochs 8, 10 slightly higher) is typical of Adam with label smoothing on small literary corpora — the model briefly overfits before generalising.

Project Structure

TranslationLM/
│
├── src/
│   ├── config.py       # All hyperparameters and file path helpers
│   ├── model.py        # Full Transformer architecture (from scratch)
│   ├── dataset.py      # Bilingual dataset + BPE tokenizer training
│   ├── train.py        # Training loop with TensorBoard + checkpointing
│   └── translate.py    # Inference script (CLI)
│
├── notebooks/
│   └── translationlm.ipynb   # Original Kaggle training notebook
│
├── requirements.txt
├── .gitignore
├── LICENSE             # Apache 2.0
└── README.md

Quick Start

1. Clone & install

git clone https://github.com/atandra2000/TranslationLM.git
cd TranslationLM
pip install -r requirements.txt

2. Train the model

python src/train.py

This will:

Download the opus_books en–it corpus (~29k pairs) automatically
Train BPE tokenizers and save them as tokenizer_en.json / tokenizer_it.json
Save checkpoints to weights/translationlm_<epoch>.pt
Log loss curves to runs/ (view with tensorboard --logdir runs)

3. Translate a sentence

python src/translate.py --text "The sun sets over the mountains."
# → Il sole tramonta sulle montagne.

python src/translate.py --text "She could not hide her feelings."
# → Non riusciva a nascondere i suoi sentimenti.

4. View training curves (TensorBoard)

tensorboard --logdir runs
# Open http://localhost:6006

5. Run on Kaggle (original experiment)

See notebooks/translationlm.ipynb. The notebook mounts the source files as a Kaggle dataset and runs train.py directly on a P100 GPU.

Configuration

All hyperparameters live in src/config.py:

def get_config():
    return {
        'batch_size':    16,
        'num_epochs':    20,
        'learning_rate': 1e-4,
        'seq_len':       350,
        'd_model':       512,
        'd_ff':          2048,
        'h':             8,     # attention heads
        'N':             6,     # encoder/decoder layers
        'dropout':       0.1,
        'datasource':   'opus_books',
        'lang_src':     'en',
        'lang_tgt':     'it',
    }

Sample Translations

Greedy-decoded outputs logged during training on the same probe sentence show the model's progression from random noise to structured Italian:

Epoch	Predicted (greedy)
1	Non si , ma non si , ma non si , e si , e si , e si .
4	Per quanto a questo , egli , senza aver sentito , senza aver fatto il salotto ...
9	Per quanto egli , senza pensare , si aspettava la fine ... dove non c'era nulla ...
13	Per chiarire via , non si sentì , ma si aspettava la fine della sala di discorso ...
17	Con questo sentimento , senza guardare , senza ascoltare , si aspettava la fine ...
20	Con questo sentimento , senza guardare , uscì dalla fine di quella discussione e di quella sala , dove nessuno c'era ...

Source (probe): "To free himself from this feeling he went, without waiting to hear the end of the discussion, into the refreshment room, where there was no one except the waiters at the buffet."

Target: "Per liberarsi da questa sensazione penosa, senz'attendere la fine del dibattito, se ne andò in una sala dove non c'era nessuno, tranne i servitori vicino a una credenza."

The model learns Italian grammar structure by epoch 4–5, picks up key vocabulary by epoch 10, and produces largely fluent sentences by epoch 17–20.

Key Design Decisions

Why train from scratch?
The goal was to understand every component — not just call from_pretrained(). Building the attention mechanism, positional encoding, and training loop manually gave me a concrete understanding of what each piece does.

Why BPE tokenization?
Byte-Pair Encoding handles unknown words gracefully (by splitting into subword units) and produces compact vocabularies — important for a small corpus like opus_books.

Why label smoothing?
Label smoothing (ε=0.1) prevents the model from becoming overconfident on the training set, which helps generalisation on a small literary corpus.

Why greedy decoding during training?
Beam search would give better BLEU scores but greedy decoding is deterministic, fast, and sufficient for monitoring training quality per epoch.

Future Work

Implement beam search decoding for better translation quality
Evaluate BLEU score on a held-out test set
Add learning rate warm-up schedule (as in the original paper)
Extend to other language pairs (e.g., en–fr, en–de)
Experiment with larger datasets (WMT14, OPUS-100)
Deploy as a FastAPI web service

License

This project is released under the Apache 2.0 License.

Built and trained by Atandra Bharati

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐 TranslationLM — Neural Machine Translation (English → Italian)

📋 Table of Contents

Overview

Architecture

Hyperparameters

Training Results

Loss Curve

Project Structure

Quick Start

1. Clone & install

2. Train the model

3. Translate a sentence

4. View training curves (TensorBoard)

5. Run on Kaggle (original experiment)

Configuration

Sample Translations

Key Design Decisions

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌐 TranslationLM — Neural Machine Translation (English → Italian)

📋 Table of Contents

Overview

Architecture

Hyperparameters

Training Results

Loss Curve

Project Structure

Quick Start

1. Clone & install

2. Train the model

3. Translate a sentence

4. View training curves (TensorBoard)

5. Run on Kaggle (original experiment)

Configuration

Sample Translations

Key Design Decisions

Future Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages