## Lesson Notebook 6: Machine Translation With Shakespeare and T5

In this notebook we will look at one example related to machine translation:

   * Train a transformer from scratch on translation of Shakespeare to Modern English



<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Shakespeare-to-Modern English translation with a from-scratch transformer](#shakespeare)








  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation_Shakespeare_T5.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup


We'll start with the usual setup. We need to begin with the sentencepiece code in order to tokenize the text for some of the models.

In [5]:
!pip install -q sentencepiece
!pip install -q transformers
!pip install -q tokenizers
!pip install -q datasets

In [6]:
import random
import transformers
from datasets import Dataset

from transformers import T5TokenizerFast, T5Config, T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

[Return to Top](#returnToTop)  
<a id = 'shakespeare'></a>

## 2. Shakespeare-to-Modern English Translation with a Seq2Seq Transformer

What if we want to create a translation model from scratch? These days, you'll usually be working with pre-trained transformers. But there are some circumstances in which it makes sense to build a new model, especially if you're working with a very rare language.

To explore how we might do so, just as a learning exercise, we'll train a brand-new sequence-to-sequence transformer model for the task of translating text from Shakespearean English to Modern English.


### 2.1 Downloading the data

The data includes aligned sentences from a number of plays by William Shakespeare.  The data was copied from this repo --[https://github.com/cocoxu/Shakespeare](https://github.com/cocoxu/Shakespeare) -- and consolidated into one file for easier handling.

You will to grab a copy from our git repo and import it to your Google drive.  From there you'll be able to easily load it in to a Colab notebook.

In [10]:
#This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
#Modify this path to the appropriate location in your Drive
text_file = '/content/drive/MyDrive/ISchool/MIDS/266/train_plays-org-mod.txt'

### 2.2 Parsing the data

Each line contains a Shakespearean sentence and its corresponding modern English translation.
The Shakesperean sentence is the *source sequence* and modern English one is the *target sequence*.

In [19]:
# Open a text file whose path is in `text_file`
with open(text_file) as f:
    # Read the whole file into a single string, split on newlines,
    # then drop the last element (often an empty string if the file ends with \n)
    lines = f.read().split("\n")[:-1]

text_pairs = []
for line in lines:
    # Expect each line to have two fields separated by a TAB
    old, mod = line.split("\t")
    # Normalize to lowercase
    old = old.lower()
    mod = mod.lower()
    # Append as a tuple ('old', 'mod')
    text_pairs.append((old, mod))


In [18]:
#look at some examples
for _ in range(5):
    print(random.choice(text_pairs))

('i am sorry that i am deceived in him. ', 'i am sorry that i was wrong about him. ')
('do you not remember he saw a flea stick upon bardolph’s nose, and he said it was a black soul burning in hell?', 'don’t you remember how he saw a flea land on bardolph’s nose and said it was a black soul burning in hell?')
('if he have caught the benedick, it will cost him a thousand pound ere a be cured.', 'if he’s caught the benedick, he’ll lose all his money before he’s cured.')
('heaven preserve you!', 'heaven preserve you!')
('i shall the effect of this good lesson keep as watchman to my heart.', 'i shall keep the purpose of this good lesson as watchman to my heart.')


In [20]:
# Let's create some splits
random.shuffle(text_pairs)                                  # in-place shuffle (NON-reproducible unless you set a seed)

num_val_samples = int(0.06 * len(text_pairs))               # 6% for validation
num_train_samples = len(text_pairs) - 2 * num_val_samples   # leave another 6% for test; rest for train

# slice the shuffled list into train/val/test in order
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

# report counts
print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

print(train_pairs[0])                                       # quick sanity check of one training example


19088 total pairs
16798 training pairs
1145 validation pairs
1145 test pairs
('unarm, eros.', 'remove your armor, eros.')


Here’s a cleaner, skimmable version:

# Dataset size & expectations

* **Split plan:** 6% validation and 6% test (≈ **~1k pairs each**), leaving **~17k pairs for training**.
* **Reality check:** ~17k sentence pairs is **small for training a model from scratch** → set modest expectations; still worthwhile as an experiment.

# 2.3 Define vocabulary & tokenizer

* **Goal:** Build a **smaller custom tokenizer** tailored to this dataset to reduce vocab size vs. typical pre-trained models.
* **Model choice:** Use a **T5-style architecture**, but a **much smaller variant** trained **from scratch** on your data.
* **Shared vocab:** T5 uses **one vocabulary** for **both encoder and decoder** → combine **Shakespearean + Modern English** text to fit the joint vocab.
* **Practical approach:**

  1. **Load an existing T5 tokenizer** as a base.
  2. **Retrain/fit** it on your corpus with your **chosen vocab size** (smaller than standard T5).
  3. Use the resulting tokenizer consistently for both sides (encoder/decoder).


In [21]:
# The size of our vocabulary covers both languages
VOCAB_SIZE = 15000              # target vocab size for a *joint* tokenizer

MAX_SEQUENCE_LENGTH = 40        # hard cap used later for trunc/pad during encoding

def get_word_piece_tokenizer(text_samples, vocab_size):
    # Load a *T5* tokenizer as the base config (normalizer, pre/post-processors, specials)
    base_tokenizer = T5TokenizerFast.from_pretrained('t5-base')

    # Train a NEW tokenizer from your text iterator, with a fixed vocab size.
    # NOTE: you're ignoring the function arg `vocab_size` and using the global `VOCAB_SIZE`.
    new_tokenizer = base_tokenizer.train_new_from_iterator(
        text_samples,
        vocab_size=VOCAB_SIZE
    )

    return new_tokenizer         # returns a Hugging Face fast tokenizer object


In [22]:
shakespeare_samples = [text_pair[0] for text_pair in train_pairs]
# ^ take the "old/Shakespearean" side from training only (good: no leakage)

modern_samples = [text_pair[1] for text_pair in train_pairs]
# ^ take the "modern" side from training only (also good)

t5_tokenizer = get_word_piece_tokenizer(
    shakespeare_samples + modern_samples,  # joint vocab for encoder+decoder
    VOCAB_SIZE                              # target vocab size
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

In [25]:
print("Vocab Tokens: ", t5_tokenizer.decode(range(110, 130)))

Vocab Tokens:  i of my? thati a ins is your! for be not with he have this me


In [24]:
old_input_ex = [text_pairs[1][0]]                  # takes the 2nd pair's Shakespearean text (list of 1 string)
old_tokens_ex = t5_tokenizer.batch_encode_plus(old_input_ex)  # tokenizes (returns a dict of lists/tensors)

print("Shakesperean English sentence: ", old_input_ex)
print("Tokens: ", old_tokens_ex)                   # prints the *entire* dict (input_ids, attention_mask, etc.)
print(
    "Recovered text after detokenizing: ",
    t5_tokenizer.batch_decode(old_tokens_ex['input_ids']),  # detokenizes IDs back to text (includes specials by default)
)

print()

mod_input_ex = [text_pairs[1][1]]                  # takes the 2nd pair's Modern English text
mod_tokens_ex = t5_tokenizer.batch_encode_plus(mod_input_ex)

print("Modern English sentence: ", mod_input_ex)
print("Tokens: ", mod_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    t5_tokenizer.batch_decode(mod_tokens_ex['input_ids']),
)


Shakesperean English sentence:  ['what, look you pale?']
Tokens:  {'input_ids': [[130, 103, 222, 108, 930, 113, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]]}
Recovered text after detokenizing:  ['what, look you pale?</s>']

Modern English sentence:  ['what, do you look pale?']
Tokens:  {'input_ids': [[130, 103, 139, 108, 222, 930, 113, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1]]}
Recovered text after detokenizing:  ['what, do you look pale?</s>']


### 2.4 Format Datasets

Let's turn our data into a Huggingface dataset, so that we can work with it similarly to earlier lesson notebooks. We'll need to write a preprocess function to convert the input and output texts into sequences of vocab IDs, using the tokenizers we just made. Then we'll map the preprocess function to the dataset.

In [26]:
def make_dataset(pairs):
    org_texts, mod_texts = zip(*pairs)          # unzip list[tuple] → two tuples (raises if pairs is empty)
    org_texts = list(org_texts)                 # materialize as lists
    mod_texts = list(mod_texts)

    dataset = Dataset.from_dict({"shakespeare": org_texts, "modern": mod_texts})
    return dataset.shuffle()                     # randomizes row order (non-deterministic unless you pass a seed)

# training/validation HF datasets
train_ds = make_dataset(train_pairs)
val_ds   = make_dataset(val_pairs)


In [27]:
def preprocess_batch(batch_text_pairs):
    # Encode the Shakespearean source texts into token IDs (PyTorch tensors),
    # padding/truncating each to MAX_SEQUENCE_LENGTH.
    shakespeare_encoded = t5_tokenizer.batch_encode_plus(
        batch_text_pairs["shakespeare"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Encode the Modern English target texts similarly (same length cap and padding).
    modern_encoded = t5_tokenizer.batch_encode_plus(
        batch_text_pairs["modern"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Return the two tensors in the format expected by HF seq2seq trainers:
    # - 'input_ids' for the encoder inputs (source)
    # - 'labels' for the decoder targets (ground truth)
    return {
        'input_ids': shakespeare_encoded['input_ids'],
        'labels': modern_encoded['input_ids']
    }


In [28]:
train_ds = train_ds.map(preprocess_batch, batched=True)
val_ds = val_ds.map(preprocess_batch, batched=True)

Map:   0%|          | 0/16798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1145 [00:00<?, ? examples/s]

# 2.5 Define the model

* **Approach:** Use Hugging Face to instantiate a model **from a config** (random init), not from a pretrained checkpoint.
* **Why config?** Lets you **customize architecture** (sizes, layers) while keeping the T5 design.
* **Project fit:** Build a **very small T5**:

  * **Vocabulary:** your custom small vocab.
  * **Dimensions:** roughly **half** of default T5 embedding and feed-forward sizes.
  * **Depth:** **1 transformer layer** in the **encoder** and **decoder**.
* **Outcome:** A lightweight, untrained T5 variant tailored to this dataset.
* **Looking ahead (A3):** You’ll explore these hyperparameters further, including **adding more transformer layers**.


In [29]:
# Define some hyperparameter values for our transformer model
EMBED_DIM = 256          # T5 d_model: token/hidden embedding size per position
INTERMEDIATE_DIM = 2048  # T5 d_ff: width of the feed-forward (MLP) layer
NUM_HEADS = 8            # Attention heads; d_model must be divisible by this (256/8=32 per head)
NUM_LAYERS = 1           # Encoder and decoder depth (T5 uses the same count for both by default)

# Also define some training parameters we'll use next
BATCH_SIZE = 64          # Batch size for training/validation steps
EPOCHS = 25              # Training epochs; comment notes ~25 needed for convergence in this setup

t5_config = T5Config(
    vocab_size=VOCAB_SIZE,        # Size of your jointly trained tokenizer vocab
    d_model=EMBED_DIM,            # Hidden dimension (matches EMBED_DIM)
    d_ff=INTERMEDIATE_DIM,        # Feed-forward layer dimension (MLP width)
    num_heads=NUM_HEADS,          # Number of attention heads per layer
    num_layers=NUM_LAYERS,        # Number of transformer blocks in encoder and decoder
    decoder_start_token_id=t5_tokenizer.pad_token_id  # T5 convention: decoder starts with <pad> token
)


In [30]:
t5_model = T5ForConditionalGeneration(config=t5_config)

### 2.6 Train the model

To train a Huggingface model, like we did with BERT, we'll create a Trainer class and associated TrainerArguments. We need to use the ones for Seq2Seq models this time.

In [31]:
args = Seq2SeqTrainingArguments(
    "shakespeare_translation_model",   # output directory: checkpoints, logs, etc.
    eval_strategy='epoch',             # run evaluation at the end of each epoch
    per_device_train_batch_size=BATCH_SIZE,  # train batch size per GPU/CPU device
    per_device_eval_batch_size=BATCH_SIZE,   # eval batch size per device
    num_train_epochs=EPOCHS,           # total number of training epochs
    report_to='none'                   # disable logging integrations (e.g., WandB/Comet)
)


In [32]:
trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds
)

Training this small model on our dataset will take about 5 minutes on a Colab T4 GPU. Note that we're training for quite a few epochs to get the model to start to pick up the task, since the model has not been pre-trained in any way. We might stop early in the live session if we're running low on time.

In [33]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.622731
2,2.619400,2.479866
3,2.619400,2.379849
4,2.326100,2.310635
5,2.326100,2.249267
6,2.185800,2.200431
7,2.185800,2.155406
8,2.089800,2.120936
9,2.089800,2.089776
10,2.023500,2.062647


TrainOutput(global_step=6575, training_loss=2.0026319148422647, metrics={'train_runtime': 202.2011, 'train_samples_per_second': 2076.892, 'train_steps_per_second': 32.517, 'total_flos': 370074184704000.0, 'train_loss': 2.0026319148422647, 'epoch': 25.0})

### 2.7 Generate and examine some test sentences

Finally, let's write a function to generate some translations of new inputs. We'll use the model's .generate() method and the tokenizer's .batch_decode() method. Huggingface text generation models has multiple options built into their .generate() method, including beam search or top-k/p sampling, constraints on repeat ngrams, min and max length constraints for the output, etc. We'll start with simple defaults here, that keep the generation loop relatively fast. You'll explore these options more in Assignment 3.

In [34]:
def generate_output(input_sentences):
    # Tokenize a list of input strings; return PyTorch tensors (CPU by default).
    inputs_encoded = t5_tokenizer(input_sentences, return_tensors='pt')

    # Generate output token IDs from the model.
    # Moves input IDs to GPU with .cuda(); requires a CUDA-available device.
    output_ids = t5_model.generate(
        inputs_encoded['input_ids'].cuda(),
        num_beams=1,              # greedy search (no beam expansion)
        no_repeat_ngram_size=4    # avoid repeating any 4-gram in the output
    )

    # Decode token IDs back to strings.
    generated_sentences = t5_tokenizer.batch_decode(
        output_ids,
        skip_special_tokens=True,           # drop <pad>, </s>, etc.
        clean_up_tokenization_spaces=False  # keep spacing exactly as decoded
    )
    return generated_sentences


In [35]:
test_org_texts = [pair[0] for pair in test_pairs]
# ^ collect the Shakespearean/source side from the held-out test split

for i in range(4):
    # pick a random source sentence from the test set
    input_sentence = random.choice(test_org_texts)

    # run generation (expects a list of strings)
    translated = generate_output([input_sentence])

    # unpack the single returned string from the list
    translated = translated[0]

    # print a simple before/after demo
    print(f"** Example {i} **")
    print(input_sentence)  # original (Shakespearean) sentence
    print(translated)      # model-generated modern translation
    print()                # blank line for readability


** Example 0 **
be buried quick with her, and so will i.
i will be with her, and i will be a her, i will be to her, and

** Example 1 **
o, i am out of breath in this fond chase!
o, i am this is this, i am of this, i’ll be a man of of

** Example 2 **
i am dying, egypt, dying.
i am am am, i am.

** Example 3 **
look on the tragic loading of this bed; this is thy work.
this is your the



This doesn't look very good yet, but that's to be expected. It's very difficult to get an NLP model to do well on a complex task without pre-training it on huge amounts of raw text. In Assignment 3, we'll try a few more options to see if we can get this model to do at least a little better on this task.

Remember that training a model from scratch is not something you would normally do, but it can be a useful educational exercise to better understand the starting point of these models and what it takes for them to learn language processing tasks.


**QUESTION 2**: What things could we do to improve the output?
* add more sentence pairs
* ensure a good distribution over all the sentence lengths
* add another transformer layer to encoder and decoder
* change the generation hyperparameters
* ???

[Return to Top](#returnToTop)  
<a id = 'answers'></a>

## ANSWERS

1.  The T5 model doesn't have the token type ids that BERT uses to identify different segments.

2.  The first two suggestions -- more sentence pairs and better balance on length are a good start.  More and better data typically lead to improved performance.  We might also look into separtely "pre-training" our encoder and decoder with their own language models.  We could then use those as pre-trained models as a foundation on which we train our connected encoder and decoder.  We could also look in to using back translation to augment our existing sentence pairs.

# Biggest wins

* **Fine-tune a pretrained model** (e.g., `t5-small`/`t5-base`) on your pairs instead of training from scratch.
* **Train your tokenizer properly** (SentencePiece Unigram like T5) and **raise max length** to cover ≥95% of sequences.

# Data

* **More pairs** (both quantity and coverage of styles/lengths).
* **Balance lengths** (stratify by length so train/val/test distributions match).
* **Clean/normalize** (consistent quotes, punctuation, casing; remove duplicates/leakage).
* **Augment** lightly (paraphrases on the modern side; filter noisy pairs).

# Tokenizer

* **Joint corpus training** (both sides, well-interleaved).
* **Tune `vocab_size`** (e.g., 16k–24k) to reduce OOV/over-fragmentation.
* **Inspect length histogram** post-tokenization; set `MAX_SEQUENCE_LENGTH` accordingly (e.g., 64–96).

# Model/Config

* **Depth**: increase to **2–4 layers** encoder/decoder.
* **Width**: consider `d_model=384/512`, `d_ff≈2–4×d_model`.
* **Dropout**: 0.1–0.2 to combat overfitting.
* **Label masking**: set pad tokens in `labels` to **-100** so they don’t count toward loss.

# Training

* **Optimizer/schedule**: AdamW, **lr ~1e-4–3e-4**, **warmup** (e.g., 500 steps), cosine/linear decay.
* **Batching**: dynamic padding or bucketing by length.
* **Mixed precision** (`fp16/bf16`) for stability/speed.
* **Early stopping** on val loss; train **longer** (more epochs) if underfitting.

# Decoding (quick wins you can try now)

* **Beam search**: `num_beams=4–8`, **length_penalty=1.0–1.2**.
* **Sampling** (if you want diversity): `do_sample=True, top_p=0.9, temperature=0.7`.

```python
t5_model.generate(
    inputs, num_beams=6, length_penalty=1.1,
    no_repeat_ngram_size=4, max_new_tokens=64
)
```

# Evaluation & debugging

* Track **BLEU/ROUGE/chrF/BERTScore** on val/test.
* **Qualitative error sets**: bucket errors by length, archaisms, syntax.
* **Ablations**: change one knob at a time (depth, max length, decoding) to see impact.

# If staying scratch-only (educational)

* Gradually **scale depth/width**.
* **Curriculum**: start with short/easy pairs → progressively harder/longer.
* **Regularize** (dropout, label smoothing 0.1).

