## Lesson Notebook 6: Machine Translation With Shakespeare and T5

In this notebook we will look at one example related to machine translation:

   * Train a transformer from scratch on translation of Shakespeare to Modern English



<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Shakespeare-to-Modern English translation with a from-scratch transformer](#shakespeare)








  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-summer-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation_Shakespeare_T5.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup


We'll start with the usual setup. We need to begin with the sentencepiece code in order to tokenize the text for some of the models.

In [1]:
!pip install -q sentencepiece
!pip install -q transformers
!pip install -q tokenizers
!pip install -q datasets

In [2]:
import random
import transformers
from datasets import Dataset

from transformers import T5TokenizerFast, T5Config, T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

[Return to Top](#returnToTop)  
<a id = 'shakespeare'></a>

## 2. Shakespeare-to-Modern English Translation with a Seq2Seq Transformer

What if we want to create a translation model from scratch? These days, you'll usually be working with pre-trained transformers. But there are some circumstances in which it makes sense to build a new model, especially if you're working with a very rare language.

To explore how we might do so, just as a learning exercise, we'll train a brand-new sequence-to-sequence transformer model for the task of translating text from Shakespearean English to Modern English.


### 2.1 Downloading the data

The data includes aligned sentences from a number of plays by William Shakespeare.  The data was copied from this repo --[https://github.com/cocoxu/Shakespeare](https://github.com/cocoxu/Shakespeare) -- and consolidated into one file for easier handling.

You will to grab a copy from our git repo and import it to your Google drive.  From there you'll be able to easily load it in to a Colab notebook.

In [3]:
#This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#Modify this path to the appropriate location in your Drive
text_file = 'drive/MyDrive/ISchool/MIDS/266/data/train_plays-org-mod.txt'

### 2.2 Parsing the data

Each line contains a Shakespearean sentence and its corresponding modern English translation.
The Shakesperean sentence is the *source sequence* and modern English one is the *target sequence*.

In [5]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]

text_pairs = []
for line in lines:
    old, mod = line.split("\t")
    old = old.lower()
    mod = mod.lower()
    text_pairs.append((old, mod))

In [6]:
#look at some examples
for _ in range(5):
    print(random.choice(text_pairs))

('where be your gibes now?', 'where are your jokes now?')
("think yourself a baby, that you have ta'en these tenders for true pay, which are not sterling.", 'think that you are a baby, that you have taken these offers for true love, which are not true offers.')
('well, i am glad that all things sort so well.', 'well, i’m glad that everything has been sorted out.')
('let him greet england with our sharp defiance.', 'tell him to greet the king of england with our sharp defiance.')
("here comes the fool, i' faith.", 'look, here comes the fool.')


In [7]:
#Let's create some splits
random.shuffle(text_pairs)
num_val_samples = int(0.06 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
16798 training pairs
1145 validation pairs
1145 test pairs


Note that if we only use 6% for validation and test (about 1k each), we have roughly almost 17,000 sentence pairs for training from scratch. This is still a small amount of data to train a language model from scratch, so we shouldn't have very high expectations, but it will be interesting to try.

### 2.3 Define vocabulary and tokenizer

We'll want a new tokenizer for our data, so that we can use a smaller vocabulary than typical pre-trained models. We're going to use a T5 model as the basis of our model architecture, but we'll make a much smaller version of it to train from scratch only on our dataset. We'll need to use a similar tokenizer to go with it.

T5 models use one vocabulary for both encoder and decoder, so we'll combine the text from our Shakespearean and Modern English (they do share a lot of words in common). The easiest thing to do is to load an existing T5 tokenizer, then retrain a new version of it with our own text and chosen vocab size.

In [8]:
# The size of our vocabulary covers both languages
VOCAB_SIZE = 15000
MAX_SEQUENCE_LENGTH = 40


def get_word_piece_tokenizer(text_samples, vocab_size):

    base_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
    new_tokenizer = base_tokenizer.train_new_from_iterator(
        text_samples,
        vocab_size=VOCAB_SIZE
    )

    return new_tokenizer

In [9]:
shakespeare_samples = [text_pair[0] for text_pair in train_pairs]
modern_samples = [text_pair[1] for text_pair in train_pairs]

t5_tokenizer = get_word_piece_tokenizer(shakespeare_samples + modern_samples, VOCAB_SIZE)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

In [10]:
print("Vocab Tokens: ", t5_tokenizer.decode(range(110, 130)))

Vocab Tokens:  i of my?i that a ins is your! for be not with have he this it


In [11]:
old_input_ex = [text_pairs[1][0]]
old_tokens_ex = t5_tokenizer.batch_encode_plus(old_input_ex)
print("Shakesperean English sentence: ", old_input_ex)
print("Tokens: ", old_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    t5_tokenizer.batch_decode(old_tokens_ex['input_ids']),
)

print()

mod_input_ex = [text_pairs[1][1]]
mod_tokens_ex = t5_tokenizer.batch_encode_plus(mod_input_ex)
print("Modern English sentence: ", mod_input_ex)
print("Tokens: ", mod_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    t5_tokenizer.batch_decode(mod_tokens_ex['input_ids']),
)

Shakesperean English sentence:  ['i say again, give out that anne my queen is sick and like to die.']
Tokens:  {'input_ids': [[110, 187, 351, 103, 209, 192, 115, 4546, 112, 586, 119, 655, 109, 169, 107, 329, 105, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Recovered text after detokenizing:  ['i say again, give out that anne my queen is sick and like to die.</s>']

Modern English sentence:  ['i repeat, spread the rumor that anne, my wife, is sick and likely to die.']
Tokens:  {'input_ids': [[110, 4301, 103, 3106, 106, 2832, 115, 4546, 103, 112, 396, 103, 119, 655, 109, 1951, 107, 329, 105, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Recovered text after detokenizing:  ['i repeat, spread the rumor that anne, my wife, is sick and likely to die.</s>']


### 2.4 Format Datasets

Let's turn our data into a Huggingface dataset, so that we can work with it similarly to earlier lesson notebooks. We'll need to write a preprocess function to convert the input and output texts into sequences of vocab IDs, using the tokenizers we just made. Then we'll map the preprocess function to the dataset.

In [12]:
def make_dataset(pairs):
    org_texts, mod_texts = zip(*pairs)
    org_texts = list(org_texts)
    mod_texts = list(mod_texts)

    dataset = Dataset.from_dict({"shakespeare": org_texts, "modern": mod_texts})
    return dataset.shuffle()

#make the training data
train_ds = make_dataset(train_pairs)

#make the validation data
val_ds = make_dataset(val_pairs)

In [13]:
def preprocess_batch(batch_text_pairs):
    shakespeare_encoded = t5_tokenizer.batch_encode_plus(
        batch_text_pairs["shakespeare"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    modern_encoded = t5_tokenizer.batch_encode_plus(
        batch_text_pairs["modern"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    return {'input_ids': shakespeare_encoded['input_ids'],
            'labels': modern_encoded['input_ids']}

In [14]:
train_ds = train_ds.map(preprocess_batch, batched=True)
val_ds = val_ds.map(preprocess_batch, batched=True)

Map:   0%|          | 0/16798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1145 [00:00<?, ? examples/s]

### 2.5 Define the model

Huggingface allows us to use an existing type of model architecture, but to load a randomly initialized version with no pretrained weights. To do so, we'll load a model from a config, instead of from a pretrained checkpoint.

Since this will be a new untrained model, we can specify some of the dimensions that we want in the config, to fit our specific project. Let's make a very small version of a T5 model, with our small vocabulary, and half the default sized embedding and intermediate dimensions. We'll only use a single transformer layer in both the encoder and decoder.

(In assignment 3, you will explore these parameters more, including adding more transformer layers.)

In [15]:
# Define some hyperparameter values for our transformer model
EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8
NUM_LAYERS = 1

# Also define some training parameters we'll use next
BATCH_SIZE = 64
EPOCHS = 25  # Should be at least 25 to converge; takes 5-6 mins to train


t5_config = T5Config(
    vocab_size=VOCAB_SIZE,
    d_model=EMBED_DIM,
    d_ff=INTERMEDIATE_DIM,
    num_heads=NUM_HEADS,
    num_layers=NUM_LAYERS,
    decoder_start_token_id=t5_tokenizer.pad_token_id
)

In [16]:
t5_model = T5ForConditionalGeneration(config=t5_config)

### 2.6 Train the model

To train a Huggingface model, like we did with BERT, we'll create a Trainer class and associated TrainerArguments. We need to use the ones for Seq2Seq models this time.

In [18]:
args = Seq2SeqTrainingArguments(
    "shakespeare_translation_model",
    eval_strategy='epoch',
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    report_to='none'
)

In [19]:
trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds
)

Training this small model on our dataset will take about 5 minutes on a Colab T4 GPU. Note that we're training for quite a few epochs to get the model to start to pick up the task, since the model has not been pre-trained in any way. We might stop early in the live session if we're running low on time.

In [20]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,2.603108
2,2.637300,2.449992
3,2.637300,2.357891
4,2.326000,2.288018
5,2.326000,2.229813
6,2.179700,2.180087
7,2.179700,2.133529
8,2.100500,2.094637
9,2.100500,2.060402
10,2.012900,2.031612


TrainOutput(global_step=6575, training_loss=1.9973726291438927, metrics={'train_runtime': 343.7513, 'train_samples_per_second': 1221.668, 'train_steps_per_second': 19.127, 'total_flos': 370074184704000.0, 'train_loss': 1.9973726291438927, 'epoch': 25.0})

### 2.7 Generate and examine some test sentences

Finally, let's write a function to generate some translations of new inputs. We'll use the model's .generate() method and the tokenizer's .batch_decode() method. Huggingface text generation models has multiple options built into their .generate() method, including beam search or top-k/p sampling, constraints on repeat ngrams, min and max length constraints for the output, etc. We'll start with simple defaults here, that keep the generation loop relatively fast. You'll explore these options more in Assignment 3.

In [21]:
def generate_output(input_sentences):
    inputs_encoded = t5_tokenizer(input_sentences, return_tensors='pt')
    output_ids = t5_model.generate(
        inputs_encoded['input_ids'].cuda(),
        num_beams=1,
        no_repeat_ngram_size=4
    )

    generated_sentences = t5_tokenizer.batch_decode(output_ids,
                                                    skip_special_tokens=True,
                                                    clean_up_tokenization_spaces=False)
    return generated_sentences

In [22]:
test_org_texts = [pair[0] for pair in test_pairs]
for i in range(4):
    input_sentence = random.choice(test_org_texts)
    translated = generate_output([input_sentence])
    translated = translated[0]
    print(f"** Example {i} **")
    print(input_sentence)
    print(translated)
    print()

** Example 0 **
your majesty shall mock at me.
your your your your at at at at your your your.

** Example 1 **
she was the wife of caius marcellus.
she was the was was was was of a good she was was was the of the of a of

** Example 2 **
the king doth wake tonight and takes his rouse, keeps wassail, and the swaggering upspring reels, and as he drains his draughts of rhenish down, the kettle-drum and trumpet thus bray out the triumph of his pledge.
the king, and the king, he was his his his his and the king of his his his

** Example 3 **
the ladies, her attendants of her chamber saw her abed, and in the morning early they found the bed untreasured of their mistress.
they have the king, and her her her her and they have her her her, they have the



This doesn't look very good yet, but that's to be expected. It's very difficult to get an NLP model to do well on a complex task without pre-training it on huge amounts of raw text. In Assignment 3, we'll try a few more options to see if we can get this model to do at least a little better on this task.

Remember that training a model from scratch is not something you would normally do, but it can be a useful educational exercise to better understand the starting point of these models and what it takes for them to learn language processing tasks.


**QUESTION 2**: What things could we do to improve the output?
* add more sentence pairs
* ensure a good distribution over all the sentence lengths
* add another transformer layer to encoder and decoder
* change the generation hyperparameters
* ???

[Return to Top](#returnToTop)  
<a id = 'answers'></a>

## ANSWERS

1.  The T5 model doesn't have the token type ids that BERT uses to identify different segments.

2.  The first two suggestions -- more sentence pairs and better balance on length are a good start.  More and better data typically lead to improved performance.  We might also look into separtely "pre-training" our encoder and decoder with their own language models.  We could then use those as pre-trained models as a foundation on which we train our connected encoder and decoder.  We could also look in to using back translation to augment our existing sentence pairs.