# Assignment 3: Machine Translation with T5

**Description:** This assignment notebook builds on the material from the
[lesson 6 notebook](https://github.com/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation_With_Transformer.ipynb), in which we set up a new, very small version of a T5 encoder decoder model to train from scratch on translations from Shakespearean to Modern English. Since the model was trained from scratch, it didn't work very well. In this notebook, we'll first try to make that model work a little better, changing the model configuration and output generation parameters. Then we'll fine tune a small pre-trained T5 model on this task, to see how much better we can do with even a small pre-trained model. We'll apply several evaluation metrics, find some trade-offs, and try adding a secondary dataset to address some of the remaining challenges.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) while editing and initially debugging your code (at least the setup before you train each model). You will need a GPU to full train or evaluate each of the models. Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1-2h, but potentially more depending on how much you experiment. If Colab tells you that you have reached your GPU limit, wait 10-24 hours and you should be able to access a GPU again.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-fall-main/blob/master/assignment/a3/Machine_Translation_T5.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries

  0.2 Data Acquisition

  0.3. Data Preparation


1. Tiny Seq2Seq Model Trained From Scratch
  
  1.1 Tokenizer and Model Setup

  1.2 Experimenting with Model Dimensions

  1.3 Text Generation Parameters

  1.4 Test Set Evaluation Metrics

2. Small Pre-Trained T5 Model

  2.1 Pre-Trained Model Setup and Tokenization

  2.2 Fine-Tuning the Pre-Trained Model

  2.3 Fine-Tuned Model Evaluation

  2.4 Style Classifier

  2.5 Revisit Decoder .Generate() Options

3. Adding Supplementary Paraphrase Dataset

  3.1 Load and preprocess the supplemental dataset

  3.2 Train T5 on Paraphrasing Task

  3.3 Fine-Tune Paraphrase-Trained Model on Main Task
  
  3.4 Paraphrase-Trained Model Evaluation

## 0. Setup

### 0.1 Libraries

In [None]:
# !pip install -q -U transformers
# !pip install -q -U datasets
# !pip install -q -U evaluate
# !pip install -q -U tokenizers

In [1]:
import re
import random
import numpy as np
from scipy.special import softmax

import torch
import transformers
import evaluate
from datasets import Dataset, load_dataset

# For from-scratch T5 model
from transformers import T5TokenizerFast, T5Config, T5ForConditionalGeneration

# For pre-trained T5 model
from transformers import T5Tokenizer, T5ForConditionalGeneration  # this won't import twice, just noting here what's for each model

# For all T5 models
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# For BLEURT (to load a trained model for evaluation)
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# For style classifier model (also for evaluating the seq2seq model output)
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


### O.2 Data Acquisition

We'll use the Shakespeare-to-Modern-English translation dataset from Lesson 6. The data includes aligned sentences from a number of plays by William Shakespeare.

The data was copied from this repo --[https://github.com/cocoxu/Shakespeare](https://github.com/cocoxu/Shakespeare) -- and consolidated into one file for easier handling.

You will to grab a copy from our git repo and import it to your Google drive.  From there you'll be able to easily load it in to a Colab notebook.

In [None]:
# # This cell will authenticate you and mount your Drive in the Colab.
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# Check the working directory and GPU
import os
print("Current working directory:", os.getcwd())
print("GPU available:", torch.cuda.is_available())
print("GPU device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU found")

Current working directory: /share/crsp/lab/pkaiser/ddlin/mids/datasci-266/2025-fall-main/assignment/a3
GPU available: True
GPU device name: NVIDIA A30


In [4]:
# # Modify this path to the appropriate location in your Drive
# text_file = 'drive/MyDrive/ISchool/MIDS/266/data/train_plays-org-mod.txt'
text_file = 'train_plays-org-mod.txt'

### O.3 Data Preparation

Each line contains a Shakespearean sentence and its corresponding modern English translation.

The Shakesperean sentence is the *source sequence* and modern English one is the *target sequence*.

In [6]:
# Read the Shakespeare-to-Modern-English translation data from the text file
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]  # Read all lines except the last (which may be empty)

text_pairs = []
for line in lines:
    old, mod = line.split("\t")  # Split each line into Shakespearean and Modern English
    old = old  # Shakespearean sentence
    mod = mod  # Modern English translation
    text_pairs.append((old, mod))  # Add the pair as a tuple to the list

In [7]:
# Look at some examples
for _ in range(5):
    print(random.choice(text_pairs))

('Be aidant and remediate In the good man’s distress.', 'May they relieve a sick old man’s suffering.')
('Go softly on.', 'Go softly on.')
('I do follow here in the chase, not like a hound that hunts, but one that fills up the cry.', 'I followed you here in the chase, not like a hound that hunts, but like the hunted by the hound.')
('Belovèd Regan, Thy sister’s naught.', 'My dear Regan, your sister’s not worth anything.')
('Farewell!', 'Goodbye!')


In [None]:
# Shuffle in-place so splits are random (not reproducible unless we set a seed)
random.shuffle(text_pairs)

# Reserve ~6% for validation and ~6% for test; the rest for training.
# Note: int() floors; with small datasets this can make val/test = 0.
num_val_samples = int(0.06 * len(text_pairs))

# Train gets what's left after taking 2 * num_val_samples for val+test
num_train_samples = len(text_pairs) - 2 * num_val_samples

# Slice the shuffled list into splits
train_pairs = text_pairs[:num_train_samples]
val_pairs   = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs  = text_pairs[num_train_samples + num_val_samples :]

# Report counts
print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")


19088 total pairs
16798 training pairs
1145 validation pairs
1145 test pairs


Like we did in the lesson notebook, let's create a Huggingface dataset object from our data, so that it's easy to work with and pass to our model trainer.

In [9]:
def make_dataset(pairs):
    """
    Build a Hugging Face `Dataset` from (original, modernized) text pairs.

    Args:
        pairs: Iterable of 2-tuples (original_text, modern_text).
               Example: [("To be, or not to be", "To live or not"), ...]

    Returns:
        datasets.Dataset: A shuffled dataset with two string columns:
            - "shakespeare": original texts
            - "modern": modernized texts

    Notes:
        - Uses `Dataset.from_dict` to construct columns.
        - Calls `.shuffle()` with the library's default seed. For reproducible
          shuffling, call `.shuffle(seed=...)` on the returned dataset.
        - Assumes every element of `pairs` unpacks cleanly into two strings.

    Example:
        >>> ds = make_dataset([("a", "A"), ("b", "B")])
        >>> ds.column_names
        ['shakespeare', 'modern']
        >>> len(ds)
        2
    """
    # Unzip list of (orig, modern) tuples into two parallel sequences.
    org_texts, mod_texts = zip(*pairs)

    # Materialize as lists to ensure they are concrete (not iterators).
    org_texts = list(org_texts)
    mod_texts = list(mod_texts)

    # Construct a Dataset with explicit column names.
    dataset = Dataset.from_dict({"shakespeare": org_texts, "modern": mod_texts})

    # Return a shuffled view (use seed=... here if you need determinism).
    return dataset.shuffle()

# Make the training data
train_dataset = make_dataset(train_pairs)

# Make the validation data
val_dataset = make_dataset(val_pairs)


## 1. Tiny Seq2Seq Model Trained From Scratch

As in the lesson 6 notebook, for our first model, we'll make a new tokenizer and model based on the T5 architecture, which we'll train from scratch only on our task dataset.

### 1.1 Tokenizer and Model Setup

The easiest way to make a new tokenizer is to load an existing T5 one, then call .train_new_from_iterator(), providing our own dataset and vocab size.

In [None]:
# Vocab size = how many distinct tokens your tokenizer can produce (rows in the embedding table).

# Embedding size (a.k.a. hidden size, d_model in T5) = the dimensionality of each token vector (columns in the embedding table).

VOCAB_SIZE = 15000

def get_word_piece_tokenizer(text_samples, vocab_size):
    """
    Train a new T5-style subword tokenizer from raw text.

    Args:
        text_samples: An iterable (list/generator) of strings. Each item is a training sample.
                      Large corpora can be streamed to avoid loading everything in RAM.
        vocab_size:   Target vocabulary size (e.g., 15_000). Special tokens are handled
                      by the base tokenizer and counted toward the size.

    Returns:
        transformers.T5TokenizerFast: A newly trained fast tokenizer that keeps
        T5's special tokens (e.g., <pad>, </s>) and normalization/pre-tokenization
        behavior, but with a vocabulary learned from `text_samples`.

    Notes:
        - Uses `train_new_from_iterator` on a T5 *fast* tokenizer, which trains a
          T5-compatible subword model (Unigram/SentencePiece-like) via 🤗 Tokenizers.
        - The iterator will be consumed once. If you pass a generator, it cannot be reused.
        - For reproducibility, ensure `text_samples` order is deterministic.
        - If you need custom special tokens, pass them when loading the base tokenizer
          (or add them after training with `tokenizer.add_special_tokens`).

    Example:
        >>> corpus = (line.strip() for line in open("corpus.txt"))
        >>> tok = get_word_piece_tokenizer(corpus, 15000)
        >>> tok("To be, or not to be.")
        {'input_ids': [...], 'attention_mask': [...]}
    """
    # Start from a pretrained T5 tokenizer so we inherit T5 specials & processing.
    base_tokenizer = T5TokenizerFast.from_pretrained("t5-base")

    # Train a new subword vocabulary on your samples, keeping T5 conventions.
    new_tokenizer = base_tokenizer.train_new_from_iterator(
        text_samples,
        vocab_size=vocab_size  # use the function arg (was hardcoded before)
    )

    return new_tokenizer

Training a **single tokenizer** that will be used on **both** styles (Shakespeare → Modern). To work well, that tokenizer needs to learn subword units that cover **both domains**. Splitting into two lists and then concatenating them is just a clear way to say: “train on the **union** of the corpora.”

Why this helps:

* **Shared vocabulary:** One model/tokenizer handles inputs and targets. Joint training lets it learn pieces common to both (e.g., roots, affixes), improving compression and reducing OOVs for either side.
* **Balanced coverage:** If you only fed Shakespeare (or only Modern), the tokenizer would overfit that style and fragment rare words in the other. Concatenating the two lists is the simplest way to include both. (You can also balance explicitly if one side is much bigger.)
* **Better sequence lengths:** Joint subwords lead to shorter, more consistent token sequences across styles, which helps training speed and quality.
* **Consistency for seq2seq:** In style-transfer/translation setups, sharing the tokenizer eases learning alignments between source and target tokens.

In [11]:
shakespeare_samples = [text_pair[0] for text_pair in train_pairs]
modern_samples = [text_pair[1] for text_pair in train_pairs]

part1_tokenizer = get_word_piece_tokenizer(shakespeare_samples + modern_samples, VOCAB_SIZE)





In [13]:
modern_samples[0]

'Are these the wretches that we threw dice for?'

We'll need to preprocess the data using the tokenizer. Since our task is to translate from Shakespearean to Modern English, the Shakespeare text will be our input_ids and the Modern English will be the labels we use for training and evaluation. We'll create a function to do the tokenization, and then map it to our Huggingface datasets containing the train and validation data. We'll have the function take a tokenizer, because later we'll use a different pre-trained one.

In [14]:
MAX_SEQUENCE_LENGTH = 40

def preprocess_translation_batch(batch_text_pairs, tokenizer, prefix=""):
    """
    Prepare a mini-batch for seq2seq training (e.g., T5) from parallel text.

    Args:
        batch_text_pairs: A dict-like batch with two string lists of equal length:
            - "shakespeare": source texts (list[str])
            - "modern":      target texts (list[str])
        tokenizer: A Hugging Face tokenizer with `batch_encode_plus`.
                   (For T5-style models, this should already include special tokens.)
        prefix: Optional instruction/task prefix prepended to each source string.
                Common for T5 prompts (e.g., "translate Old English to modern: ").

    Returns:
        dict with:
          - 'input_ids': Tensor[int] shape (batch, MAX_SEQUENCE_LENGTH)
                         Tokenized + padded + truncated source.
          - 'labels':    Tensor[int] shape (batch, MAX_SEQUENCE_LENGTH)
                         Tokenized + padded + truncated target.

    Notes:
        - Uses fixed-length padding ('max_length') to `MAX_SEQUENCE_LENGTH`.
        - Truncation will cut off longer samples; pick length carefully.
        - Attention masks are not returned here; the model/Trainer may
          infer them or you can compute them from padding if needed.
        - For label padding masking (-100), handle downstream (e.g., via
          data collator) if the loss expects it.
    """
    # If a task prefix is provided, prepend it to every source example.
    if prefix:
        batch_text_pairs["shakespeare"] = [prefix + text for text in batch_text_pairs["shakespeare"]]

    # Tokenize the source side (inputs to the encoder).
    shakespeare_encoded = tokenizer.batch_encode_plus(
        batch_text_pairs["shakespeare"],
        max_length=MAX_SEQUENCE_LENGTH,  # hard cap on tokenized length, hardcoded here
        padding='max_length',            # pad to exactly MAX_SEQUENCE_LENGTH
        truncation=True,                 # truncate sequences longer than the cap
        return_tensors='pt'              # return PyTorch tensors
    )

    # Tokenize the target side (labels for the decoder).
    modern_encoded = tokenizer.batch_encode_plus(
        batch_text_pairs["modern"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Return only IDs expected by most Trainer setups:
    #  - input_ids feed the encoder
    #  - labels are the teacher-forced targets for the decoder
    return {
        'input_ids': shakespeare_encoded['input_ids'],
        'labels': modern_encoded['input_ids']
    }


In [15]:
train_ds_part1 = train_dataset.map(preprocess_translation_batch, batched=True,
                                   fn_kwargs={'tokenizer': part1_tokenizer})
val_ds_part1 = val_dataset.map(preprocess_translation_batch, batched=True,
                               fn_kwargs={'tokenizer': part1_tokenizer})

Map: 100%|██████████| 16798/16798 [00:02<00:00, 7654.01 examples/s]
Map: 100%|██████████| 1145/1145 [00:00<00:00, 5967.26 examples/s]


We'll need to create the new model from a config, specifying the model's dimensions. Then we'll need to make training arguments and trainer objects to be able to train the model. Let's create a function for each of those purposes, so that later we can use the functions to experiment with the available options.

First, make a function to create the model config and the model itself. Use the Lesson 6 notebook as a guide, and make sure to include all of the arguments that we've included in the function definition below. Those are what you'll experiment with next.

In [17]:
"""
Fill in the code to create a T5Config and new T5 model, using all of the function arguments
"""

def create_from_scratch_model(num_layers, embed_dim, keyvalue_dim, dense_dim, num_heads):
    # Build a minimal-but-complete T5 configuration using your hyperparameters.
    
    """
    Args:
        num_layers: Number of Transformer blocks in both encoder and decoder (T5 shares this count).
        embed_dim:  Model hidden size (T5 `d_model`); also the token embedding width.
        keyvalue_dim: Dimension per-head for key/value projections (`d_kv` in T5Config).
        dense_dim:  Feed-forward (MLP) hidden size (T5 `d_ff`).
        num_heads:  Number of attention heads (must divide `embed_dim` evenly).

    Returns:
        A `T5ForConditionalGeneration` initialized from a fresh `T5Config`.

    Notes:
        - `vocab_size` is taken from the global VOCAB_SIZE you defined earlier to match your tokenizer.
        - `decoder_start_token_id` is set to 0 (T5’s default pad token id). If you have a tokenizer,
        you can later do: `model.config.decoder_start_token_id = tokenizer.pad_token_id`.
        - If you change your tokenizer’s size, call `model.resize_token_embeddings(len(tokenizer))`.
    """
    
    t5_config = T5Config(
        vocab_size=VOCAB_SIZE,        # must match your trained tokenizer
        d_model=embed_dim,            # hidden size
        d_ff=dense_dim,               # feed-forward width
        num_layers=num_layers,        # encoder/decoder depth
        num_heads=num_heads,          # attention heads
        d_kv=keyvalue_dim,            # per-head key/value size
        decoder_start_token_id=0      # T5 default (<pad>=0); can be overwritten later
    )

    # Create an untrained (randomly initialized) seq2seq model from the config.
    t5_model = T5ForConditionalGeneration(config=t5_config)
    return t5_model

We'll also need to specify training arguments and a trainer for our model. Use the Seq2SeqTrainingArguments and Seq2SeqTrainer classes imported at the top of this notebook. You can use the Lesson 6 notebook as a guide for this too.

In [18]:

def create_seq2seq_training_args(batch_size, num_epochs):
    
    """
    Create Hugging Face Seq2SeqTrainingArguments with fixed fields matching the template.

    Args:
        batch_size: Per-device batch size for both training and evaluation.
        num_epochs: Total number of training epochs.

    Returns:
        A `Seq2SeqTrainingArguments` object configured as in the provided template.
    """
    training_args = Seq2SeqTrainingArguments(
        output_dir="shakespeare_translation_model",  # checkpoints/logs directory
        eval_strategy="epoch",                       # evaluate at end of each epoch
        per_device_train_batch_size=batch_size,      # train batch size per device
        per_device_eval_batch_size=batch_size,       # eval batch size per device
        num_train_epochs=num_epochs,                 # total epochs
        report_to="none"                             # disable external loggers (W&B/Comet)
    )
    return training_args

In [19]:

def create_seq2seq_trainer(model, training_args, train_ds, val_ds):
    
    """
    Create a Seq2SeqTrainer that wires model, args, and datasets together.

    Args:
        model: A `T5ForConditionalGeneration` (or compatible seq2seq) model.
        training_args: The `Seq2SeqTrainingArguments` returned above.
        train_ds: Tokenized training dataset (must provide 'input_ids' and 'labels').
        val_ds: Tokenized validation dataset (same feature keys as train_ds).

    Returns:
        A `Seq2SeqTrainer` ready to `.train()`.
    """
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds
    )
    return trainer

### 1.2: Experimenting with Model Dimensions

In the Lesson 6 Notebook, we created a very small T5-style model with just one transformer layer and smaller dimensions for some of the internal layers. Now, you'll explore these options yourself, to see if you can get the model to work a little better when trained on this task.

Without adding any additional training data, can we configure the model to perform better when trained on this task? What happens if we add another one or more transformer layers to the encoder and decoder, or make some of the internal dimensions smaller or larger?

The T5Config gives us several hyperparameters to adjust the model's parameter dimensions. You can see the available arguments and their default values in the [T5Config documentation](https://huggingface.co/docs/transformers/v4.46.3/en/model_doc/t5#transformers.T5Config).

We'll give you the batch size and num_epochs:


In [22]:
part1_batch_size = 64
part1_num_epochs = 30

Now you decide the rest.

Try changing the values for *num_layers* (number of transformer blocks), *d_model* (size of embedding and pooler layers), *d_kv* (size of query, key, and value vectors per attention head), *num_heads* (the number of attention heads), and *d_ff* (size of feed forward layers after each attention layer).

Find hyperparameters that finish training 30 epochs in 10-20 minutes on a free Colab T4 GPU, and that give you as low of a validation loss as you can, at least below 1.8. Also try to do this without overwhelming overfitting, i.e. try to keep training_loss / validation_loss > 0.6 after 30 epochs.

Then answer the questions below.

In [23]:
"""
Define the values you want to use for d_model, d_kv, num_heads, and d_ff, for the T5Config below.
"""

### YOUR CODE HERE

embed_dim   = 256     # d_model
keyvalue_dim= 32      # d_kv (per-head size; 256 / 8)
num_heads   = 8       # must divide d_model
dense_dim   = 1024    # d_ff ≈ 4 * d_model
num_layers  = 3       # encoder/decoder depth (balanced speed vs quality)


### END YOUR CODE

In [24]:
part1_model = create_from_scratch_model(num_layers, embed_dim, keyvalue_dim, dense_dim, num_heads)
part1_training_args = create_seq2seq_training_args(part1_batch_size, part1_num_epochs)
part1_trainer = create_seq2seq_trainer(part1_model, part1_training_args,
                                       train_ds_part1, val_ds_part1)

part1_trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.569462
2,2.686400,2.430478
3,2.686400,2.343031
4,2.375000,2.272528
5,2.375000,2.210593
6,2.249500,2.165639
7,2.249500,2.122116
8,2.147800,2.079524
9,2.147800,2.043637
10,2.068100,2.014134


TrainOutput(global_step=7890, training_loss=1.9883010999633635, metrics={'train_runtime': 491.8976, 'train_samples_per_second': 1024.482, 'train_steps_per_second': 16.04, 'total_flos': 666334785945600.0, 'train_loss': 1.9883010999633635, 'epoch': 30.0})

**QUESTION:**

 1.a What is the final validation loss that you were able to achieve for the part1 model after training for 30 epochs? (Copy and paste the decimal value for the final validation loss, to 5 significant digits, e.g. a number like 0.56781 or 0.87632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
 - 1.8061

**QUESTION:**

 1.b Which model config parameters (if any) did you increase, to achieve a lower validation loss, while staying within the training time and overfitting guidelines? (List the names of the parameters you increased, e.g. embed_dim, keyvalue_dim, num_heads, dense_dim, num_layers. Put this list in square brackets in the answers file.)
 - num_layers

**QUESTION:**

 1.c Which model config parameters (if any) did you decrease, to achieve a lower validation loss, while staying within the training time and overfitting guidelines? (List the names of the parameters you decreased, e.g. embed_dim, keyvalue_dim, num_heads, dense_dim, num_layers. Put this list in square brackets in the answers file.)
  - dense_dim

In [25]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part1 model
# part1_model_checkpoint_filepath = 'drive/MyDrive/ISchool/MIDS/266/model_checkpoints/part1_model'
part1_model_checkpoint_filepath = 'model_checkpoints/part1_model'

In [27]:
# Run this line only after you've trained the part1 model
part1_model.save_pretrained(part1_model_checkpoint_filepath, from_pt=True)

In [22]:
# Run this line only if you need to reload the model you trained earlier
part1_model = T5ForConditionalGeneration.from_pretrained(part1_model_checkpoint_filepath)

### 1.3: Text Generation Parameters

Cross-entropy loss is great for training, but it's not a very interpretable metric for manually reviewing how well the model is doing as we experiment with available options. Ultimately, we want to actually look at the translations the model outputs, compare them to human translations, and potentially judge other aspects of the actual output.

To do that, we need to actually generate some model output. Remember that the model itself predicts probabilities for each word in the vocabulary, based on what words have already been generated, at each decoder time-step. In order to select which actual words to output, there are multiple decoder strategies we can use that are build on top of the model's predicted probabilities. (E.g. beam search, top-k or top-p sampling, repeat ngram constraints, min/max length constraints, etc.)

Let's define a function below to generate translations for new inputs. Then we'll define another function to translate the validation set and calculate some standard evaluation metrics for translation, as well as print out some translations for manual inspection. We'll include some arguments that you'll experiment with next.

In [28]:
def generate_output(model, tokenizer, input_sentences, batch_size, **kwargs):
    """
    Generate decoded text for a list of input sentences in mini-batches.

    Args:
        model:       A seq2seq HF model with `.generate(...)` (e.g., T5ForConditionalGeneration).
        tokenizer:   Matching tokenizer; must support `__call__` and `batch_decode`.
        input_sentences: List[str] of source texts to translate/generate from.
        batch_size:  Integer batch size used to chunk `input_sentences`.
        **kwargs:    Extra arguments passed through to `model.generate(...)`
                     (e.g., num_beams, do_sample, top_p, temperature, max_new_tokens, etc.).

    Returns:
        List[str]: One decoded string per input sentence, in original order.

    Notes:
        - This moves the model to CUDA inside the loop (`model.cuda()`), which works but is
          inefficient; typically we'll move the model to device once before the loop.
        - Only `input_ids` are passed to `generate`; if your model benefits from attention
          masks, consider including `inputs_encoded['attention_mask']` (outside this function).
        - The batching loop uses `range(int(len(...) / batch_size) + 1)` to cover the final
          partial batch; we break when `start_i` exceeds the input length.
        - `skip_special_tokens=True` strips tokens like <pad>, </s>, etc.; set to False if
          we need to inspect them.
    """
    all_outputs = []

    # Iterate over contiguous chunks of size `batch_size`
    for i in range(int(len(input_sentences) / batch_size) + 1):
        start_i, end_i = i * batch_size, (i + 1) * batch_size
        if start_i >= len(input_sentences):  # stop when no items remain
            break

        # Tokenize the current chunk and return PyTorch tensors.
        # padding=True pads to the longest sequence in this mini-batch.
        inputs_encoded = tokenizer(
            input_sentences[start_i:end_i],
            padding=True,
            return_tensors='pt'
        )

        # Move model and inputs to GPU (as written). `**kwargs` forwards decoding options.
        output_ids = model.cuda().generate(
            inputs_encoded['input_ids'].cuda(),
            **kwargs
        )

        # Convert generated token IDs back to strings.
        generated_sentences = tokenizer.batch_decode(
            output_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )

        # Accumulate this batch’s outputs
        all_outputs.extend(generated_sentences)

    return all_outputs


In [None]:
# Load the BLEU metric and the trained BLEURT model for semantic similarity scoring

# BLEU (n-gram overlap metric): quick, lexical similarity score between hypothesis and reference(s).
#   - Pros: fast, standard.
#   - Cons: surface-level; penalizes valid paraphrases; insensitive to meaning.
bleu = evaluate.load("bleu")

# BLEURT (learned semantic metric): a fine-tuned transformer that scores how well a hypothesis
# matches a reference in meaning (trained with human judgments).
#   - Pros: captures semantics beyond exact n-grams; better correlation with human eval.
#   - Cons: heavier to run; model-specific; requires tokenizer+model.
bleurt_tokenizer = AutoTokenizer.from_pretrained("Elron/bleurt-base-512")
bleurt_model = AutoModelForSequenceClassification.from_pretrained("Elron/bleurt-base-512")


Downloading builder script: 5.94kB [00:00, 3.14MB/s]
Downloading extra modules: 4.07kB [00:00, 2.83MB/s]                   
Downloading extra modules: 3.34kB [00:00, 3.10MB/s]


In [None]:
def calculate_eval_metrics(text_pairs, model, tokenizer, batch_size, prefix="", **kwargs):
    """
    Generate translations for (source, target) pairs, then compute BLEU and BLEURT.

    Args:
        text_pairs:  Iterable of (source_text, target_text) tuples.
        model:       Seq2seq HF model supporting `.generate(...)` (e.g., T5ForConditionalGeneration).
        tokenizer:   Matching tokenizer for the model.
        batch_size:  Mini-batch size used for generation and BLEURT scoring.
        prefix:      Optional prompt prefix prepended to every source (e.g., "translate ...: ").
        **kwargs:    Passed directly to `model.generate(...)` (e.g., num_beams, max_new_tokens, etc.).

    Returns:
        translations: List[str] of model outputs aligned with `text_pairs`.

    Notes:
        - BLEU here is from `evaluate.load("bleu")`. It performs its own tokenization.
          If we need strict control, pre-tokenize and pass tokens instead.
        - BLEURT is computed in mini-batches for speed. This snippet uses
          `max_length=MAX_SEQUENCE_LENGTH` when tokenizing for BLEURT; the
          `Elron/bleurt-base-512` model supports up to 512 tokens, so we may
          consider increasing this (e.g., 512) if your texts are long.
        - For GPU acceleration, move `bleurt_model` and tokenized tensors to CUDA.
          (Not done here to keep the function minimal.)
    """
    # Build source (optionally with a task prefix) and gold labels
    original_texts = [prefix + pair[0] for pair in text_pairs]
    label_texts = [pair[1] for pair in text_pairs]

    # Translate original texts using the provided generation kwargs
    translations = generate_output(model, tokenizer, original_texts, batch_size, **kwargs)

    # ---- BLEU (lexical overlap) ----
    bleu_results = bleu.compute(predictions=translations, references=label_texts)
    print('BLEU: ', bleu_results)

    # ---- BLEURT (semantic similarity) ----
    bleurt_scores = []
    for i in range(int(len(translations) / batch_size) + 1):
        start_i, end_i = i * batch_size, (i + 1) * batch_size
        if start_i >= len(translations):
            break

        with torch.no_grad():
            scores = bleurt_model(
                **bleurt_tokenizer(
                    label_texts[start_i:end_i],          # references
                    translations[start_i:end_i],         # hypotheses
                    truncation=True,
                    max_length=MAX_SEQUENCE_LENGTH,      # BLEURT-base supports up to 512; adjust if needed
                    padding='max_length',
                    return_tensors='pt'
                )
            )[0].squeeze().numpy()

            # If the last batch has a single item, .squeeze() yields a scalar.
            if scores.shape:
                bleurt_scores.extend(scores)
            else:
                bleurt_scores.append(float(scores))

    print('BLEURT: ', np.mean(bleurt_scores))

    return translations


First, choose some keyword arguments to pass to the generate_output() function. These can be any parameters for the .generate() method (e.g. beam search or top-k or top-p sampling, no_repeat_ngram_size, etc). You will want to try the options listed in Question 1.e below, to be able to answer that question (but some of them can't be used at the same time). More info on each can be found in the [Huggingface documentation on text generation here](https://huggingface.co/docs/transformers/en/main_classes/text_generation).

Then run the function to translate the validation set and print out eval metrics. The function returns the translations, so we'll also print out a sample of those to manually inspect. Use what you see to iterate on the .generate() arguments, trying to find the most reasonable .generate() arguments that you can for the model you trained.

The output will not be great no matter what you do, but you should be able to make it a little more readable, with slightly better BLEU and BLEURT metrics, than the basic options specified in the Lesson 6 notebook.

Then answer the questions below.

In [None]:
"""
Fill in the decoder .generate() arguments that you want to use, like num_beams or top_p, etc.
"""

part1_generate_kwargs = {
    # Decode strategy: moderate beams for quality
    "num_beams": 4,
    "early_stopping": True,

    # Length control: cap new tokens so outputs don’t run on
    # (use new-token cap so input length doesn’t affect it)
    "max_new_tokens": 40,          # ~ encoder cap; tweak 32–64 if needed

    # Repetition controls: stop obvious loops/phrases
    "no_repeat_ngram_size": 3,     # blocks repeated 3-grams, smaller the stricter
    "repetition_penalty": 1.3,     # soft deterrent; 1.05–1.2 is common

    # Beam length bias: mild preference for not-too-short outputs
    "length_penalty": 1.05,        # 1.0–1.2 typical; >1 favors longer
}


part1_val_translations = calculate_eval_metrics(
    val_pairs,
    part1_model,
    part1_tokenizer,
    part1_batch_size,
    **part1_generate_kwargs
)

BLEU:  {'bleu': 0.032094072946544074, 'precisions': [0.2832250622305871, 0.06407412240636827, 0.015578739602424925, 0.003752776288580838], 'brevity_penalty': 1.0, 'length_ratio': 1.1775092936802973, 'translation_length': 16471, 'reference_length': 13988}
BLEURT:  -1.1488445


In [39]:
# Print out a sample of outputs to manually review
for i in range(10):
    sample_i = random.choice(range(len(part1_val_translations)))
    print('Original:    ', val_pairs[sample_i][0])
    print('Reference:   ', val_pairs[sample_i][1])
    print('Translation: ', part1_val_translations[sample_i])
    print()

Original:     Lovers and madmen have such seething brains, Such shaping fantasies, that apprehend More than cool reason ever comprehends.
Reference:    Lovers and madmen have such busy brains, Such ability to shape fantasies, that they catch More than cool reason ever understands.
Translation:  I have to see that, and you have a

Original:     Thanks, Rosencrantz and gentle Guildenstern.
Reference:    Thanks, Rosencrantz and gentle Guildenstern.
Translation:  I’m not, and gentle gentle gentle it’s a good.

Original:     Come, let us go.
Reference:    Come, let’s go.
Translation:  Come, Come on, Come to the Come, let let’s

Original:     Drown thyself?
Reference:    drown yourself!
Translation:  ?

Original:     No, you shall paint when you are old.
Reference:    No, he means you’ll use makeup when you’re old.
Translation:  No, you are so that you are as you are.

Original:     What seest thou else In the dark backward and abyss of time?
Reference:    What else do You see in the old, da

**QUESTION:**

 1.d What seems to be particularly bad about the part1 model's translations? (Choose one of the following options that you agree with most and put it in the answers file.)

 - A. The model keeps repeating the same common words or phrases over and over, which don't produce very meaningful statements.

 - B. The model is generating pretty good modern English, but it's quite offensive.

 - C. The model's output has mostly the same meaning as the input, but with minor grammatical mistakes.

 - D. The model is making up elaborate narrative details that don't appear in the original text.

 -> A

**QUESTION:**

 1.e Which .generate() parameter seemed to help the most in addressing the main shortcoming(s) that you noticed in the part1 model's output? (Choose one of the following options and put it in the answers file.)

 - A. num_beams
 - B. do_sample
 - C. top_k
 - D. top_p
 - E. temperature
 - F. no_repeat_ngram_size

 -> F

### 1.4 Test Set Evaluation Metrics

Once you've settled on training hyperparameters that produce good validation loss, and generation options that produce the best output you can so far, go ahead and calculate evaluation metrics on the test set, to warp up this from-scratch model.

Then answer the questions below.

In [40]:
# Print out eval metrics for the part1_model on the TEST split.
# - Reuses the same generation kwargs you tuned for validation (part1_generate_kwargs).
# - Returns the list of test translations so you can inspect or save them later.
part1_test_translations = calculate_eval_metrics(
    test_pairs,          # [(source, target), ...] for the test set
    part1_model,         # trained T5-style model
    part1_tokenizer,     # matching tokenizer
    part1_batch_size,    # batch size for generation/metric computation
    **part1_generate_kwargs  # e.g., num_beams, max_new_tokens, no_repeat_ngram_size, etc.
)


BLEU:  {'bleu': 0.03598463879191276, 'precisions': [0.2962437550114106, 0.06975046456065835, 0.018530489118724413, 0.004379105411323116], 'brevity_penalty': 1.0, 'length_ratio': 1.1763041427845897, 'translation_length': 16213, 'reference_length': 13783}
BLEURT:  -1.1321745


**QUESTION:**

 1.f What is the overall BLEU score that you achieved on the test set for the part1 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.035985

**QUESTION:**

 1.g What is the mean BLEURT score that you achieved on the test set for the part1 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - -1.13217

## 2. Small Pre-Trained T5 Model

What if we use a model that has already been pre-trained to recognize English (at least modern English), even if it hasn't yet been trained for our particular translation task?

We'll use a T5 small model, which should be able to generate good modern English, but we'll need to train it to encode and translate Shakespearean text.

### 2.1 Pre-trained Model Setup and Tokenization

The next two cells load the pre-trained model, and preprocess the data with the pre-trained tokenizer. Fill in the necessary code for each of these cells.

For preprocessing, you'll need to map the `preprocess_translation_batch` function that we created earlier to the `train_dataset` and `val_dataset`. Use the code from part 1 as an example, but now pass in the pretrained T5 tokenizer as a function keyward argument (kwarg). Also pass in the given task_prefix as the "prefix" kwarg for the preprocessing function.

In [41]:
"""
Load the pre-trained model and tokenizer
"""

t5_pretrained_checkpoint_name = 'google-t5/t5-small'

### YOUR CODE HERE

# Pretrained T5 tokenizer (SentencePiece). Handles special tokens like <pad>, </s>.
part2_tokenizer = T5TokenizerFast.from_pretrained(t5_pretrained_checkpoint_name)

# Pretrained seq2seq model weights (encoder–decoder). Good modern-English prior.
part2_model = T5ForConditionalGeneration.from_pretrained(t5_pretrained_checkpoint_name)

# T5 convention: decoder starts with the pad token. (Usually set in config already; this ensures it.)
part2_model.config.decoder_start_token_id = part2_tokenizer.pad_token_id

### END YOUR CODE

In [42]:
"""
Preprocess the datasets using the pretrained tokenizer, and the given task_prefix.
Use the task_prefix as the "prefix" argument to the function preprocess_translation_batch().
"""

task_prefix = 'Translate Shakespeare to Modern English: '

train_ds_part2 = train_dataset.map(
    preprocess_translation_batch,
    batched=True,
    fn_kwargs={
        # Use the pretrained tokenizer and prepend the task prefix to each source
        "tokenizer": part2_tokenizer,
        "prefix": task_prefix,
    }
)

val_ds_part2 = val_dataset.map(
    preprocess_translation_batch,
    batched=True,
    fn_kwargs={
        # Same preprocessing for validation to ensure consistency
        "tokenizer": part2_tokenizer,
        "prefix": task_prefix,
    }
)

Map: 100%|██████████| 16798/16798 [00:02<00:00, 6364.99 examples/s]
Map: 100%|██████████| 1145/1145 [00:00<00:00, 8807.39 examples/s]


### 2.2 Fine-Tuning the Pre-Trained Model

Now create the training args and trainer to fine-tune this pre-trained model. We've given you part of the code: you'll use the same functions as above for `create_seq2seq_training_args` and `create_seq2seq_trainer`. Fill in the rest of the arguments that you need for this version of the model. Use the provided batch size and num_epochs.

In [43]:
"""
Create the training args and trainer for the pre-trained model.
Use the batch size and num_epochs provided below for this model.
"""

part2_batch_size = 32
part2_num_epochs = 4

# Use the helper to build Seq2SeqTrainingArguments (same fixed fields as Part 1)
part2_training_args = create_seq2seq_training_args(
    part2_batch_size,
    part2_num_epochs
)

# Wire up the pretrained model, args, and the preprocessed datasets
part2_trainer = create_seq2seq_trainer(
    part2_model,
    part2_training_args,
    train_ds_part2,
    val_ds_part2
)

Run the cell below to fine-tune the part2 model, then answer the following questions.

In [44]:
part2_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.9684,0.71193
2,0.7437,0.693941
3,0.7278,0.687807
4,0.714,0.684969


TrainOutput(global_step=2100, training_loss=0.7855644952683222, metrics={'train_runtime': 196.8031, 'train_samples_per_second': 341.417, 'train_steps_per_second': 10.671, 'total_flos': 710459869102080.0, 'train_loss': 0.7855644952683222, 'epoch': 4.0})

**QUESTION:**

 2.a What is the final validation loss that you were able to achieve for the part2 model after training for 4 epochs? (Copy and paste the decimal value for the final validation loss, to 5 significant digits, e.g. a number like 0.56781 or 0.87632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.68497

In [None]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part2 model
# part2_model_checkpoint_filepath = 'drive/MyDrive/ISchool/MIDS/266/model_checkpoints/part2_model'
part2_model_checkpoint_filepath = 'model_checkpoints/part2_model'

In [46]:
# Run this line only after you've fine-tuned the part2_model
part2_model.save_pretrained(part2_model_checkpoint_filepath, from_pt=True)

In [35]:
# Run this line only if you need to reload the model you fine-tuned earlier
part2_model = T5ForConditionalGeneration.from_pretrained(part2_model_checkpoint_filepath)

### 2.3 Fine-Tuned Model Evaluation

Now use the calculate_eval_metrics() function defined above to translate the test set and calculate evaluation metrics. Also print out a sample of the translated outputs. For now, use the same decoder .generate() kwargs that you chose for part1.

Then answer the questions below.

In [47]:
# Print out eval metrics for the part2_model on the test set

part2_test_translations = calculate_eval_metrics(
    test_pairs,
    part2_model,
    part2_tokenizer,
    part2_batch_size,
    task_prefix,
    **part1_generate_kwargs
)

BLEU:  {'bleu': 0.3499995259354115, 'precisions': [0.6313748881598569, 0.41110295915871853, 0.2961431268542659, 0.218063872255489], 'brevity_penalty': 0.972717288631285, 'length_ratio': 0.973082783138649, 'translation_length': 13412, 'reference_length': 13783}
BLEURT:  -0.011276488


In [48]:
# Print out a sample of the translated outputs to look at as well

for i in range(10):
    sample_i = random.choice(range(len(part2_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part2_test_translations[sample_i])
    print()

Original:     There the grown serpent lies; the worm that's fled Hath nature that in time will venom breed, No teeth for the present.
Reference:    There the grown serpent lies; the worm that has fled Has a nature that in time will breed venom, But he has no fangs now.
Translation:  There the grown serpent lies; the worm that's fled Hath nature that in time will venom breed, No teeth for the present.

Original:     Signior, it is the Moor.
Reference:    Signior, it is the Moor.
Translation:  Signior, it is the Moor.

Original:     She died, my lord, but whiles her slander lived.
Reference:    She was only dead, my lord, as long as her slander lived.
Translation:  She died, my lord, but while her slander lived.

Original:     Thou worms' meat in respect of a good piece of flesh, indeed.
Reference:    You are about as much of a thinker as worm’s meat is a nice steak.
Translation:  You worms' meat in respect of a good piece of flesh, indeed.

Original:     A very fine one.
Reference:    H

**QUESTION:**

 2.b What is the overall BLEU score that you achieved on the test set for the part2 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.35000

**QUESTION:**

 2.c What is the mean BLEURT score that you achieved on the test set for the part2 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
 - -0.011276


**QUESTION:**

 2.d What do you notice about the part2 model's output? It should be much better than the part1 model's output. But the translations still probably don't perfectly match the reference human translations. What does the part2 model seem to still be doing poorly? (Chose one of the following options that you agree with most, and put it in the answers file.)

 - A. The generated translations are gibberish.

 - B. The generated translations are written in a far more casual style than the reference human translations.

 - C. The generated translations mean something completely different from the input text and reference translations.

 - D. The generated translations are too similar to the input text, and haven't been rephrased as much as the reference human translations.

  -> D

### 2.4 Style Classifier

Now that the model is able to output more coherent translations, we can start to get more picky about different aspects of the output. We should also make sure that our evaluation metrics are capturing everything we want to be able to assess and improve in the model's output.

One thing we're not capturing yet is if the output has the right **style**. This task is sort of a translation task, but since it's between two different forms of English, we can also think of it as a style transfer task.

BLEU might help a little with that, but when the model chooses different words from the human reference, it could do so in ways that are still good modern English or that are still too much like Shakespeare. BLEURT won't tell us anything about the style, as long as the meaning is still similar to the reference.

How can we tell whether the output has the right style? We could train a separate classification model to predict whether text is Shakespearean or modern English. We have the data to do it! We just need to repurpose our data for a classification problem.

Use the code below to train a BERT classifier to predict whether a sentence is Shakespearean or modern English. We're providing this code for you, because it's not the main task and not based on a similar example from class. We want you to use it as one of your evaluation metrics, to help you iterate on your models for the main task.

In [None]:
def make_style_classifier_data(text_pairs):
    """
    Build a binary **style** dataset from parallel pairs for a classifier.

    Args:
        text_pairs: Iterable of (shakespeare_text, modern_text) tuples.

    Returns:
        datasets.Dataset
            Shuffled dataset with two columns:
              - "text":  strings (all Shakespeare examples followed by all Modern)
              - "label": ints    (0 = Shakespeare, 1 = Modern)

    Notes:
        - Class labeling: source (old/Shakespeare) → 0, target (modern) → 1.
        - Balance: for each pair you add one example to each class, so the split
          is naturally balanced (given balanced input pairs).
        - Shuffle: randomize order to avoid batches that are all one class first.
          (For reproducibility, you can pass a seed: `.shuffle(seed=42)`.)
    """
    # Concatenate texts from both domains into a single list
    style_texts  = [pair[0] for pair in text_pairs] + [pair[1] for pair in text_pairs]

    # Parallel labels: 0's for every Shakespeare sample, 1's for every Modern sample
    style_labels = [0 for _ in text_pairs]          + [1 for _ in text_pairs]

    # Construct a Hugging Face Dataset with explicit columns
    style_dataset = Dataset.from_dict({"text": style_texts, "label": style_labels})

    # Shuffle to interleave classes and break up ordering artifacts
    return style_dataset.shuffle()

# Build train/validation datasets for the style classifier
style_train_ds = make_style_classifier_data(train_pairs)
style_valid_ds = make_style_classifier_data(val_pairs)


In [50]:
bert_checkpoint_name = 'bert-base-cased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint_name)
bert_style_classifier_model = BertForSequenceClassification.from_pretrained(bert_checkpoint_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def preprocess_style_text(data):
    """
    Tokenize a batch of raw texts for the style classifier (e.g., BERT).

    Args:
        data: A batch dict from HF Datasets with key "text" (list[str]).

    Returns:
        Dict of batched tensors suitable for modeling:
          - input_ids: (batch, MAX_SEQUENCE_LENGTH)
          - attention_mask: (batch, MAX_SEQUENCE_LENGTH)
          - token_type_ids: (batch, MAX_SEQUENCE_LENGTH)  # BERT next-sentence segments (all zeros for single-sentence)
    Notes:
        - MAX_SEQUENCE_LENGTH caps the sequence; longer texts are truncated.
        - Using `padding='max_length'` yields fixed-size batches (good for static shapes).
        - `return_tensors="pt"` returns PyTorch tensors directly to the dataset mapping.
          (You can also return lists/np arrays and later call `set_format("torch")`.)
        - Some models (RoBERTa, DistilBERT) ignore `token_type_ids`; harmless to include.
    """
    return bert_tokenizer.batch_encode_plus(
        data['text'],
        max_length=MAX_SEQUENCE_LENGTH,   # hard cap; consider 128–256 for sentence-level style, we hardcode 40 here
        padding='max_length',             # pad all sequences to the same length
        truncation=True,                  # cut off texts longer than the cap
        return_attention_mask=True,       # mask distinguishes real tokens vs padding
        return_token_type_ids=True,       # BERT uses this; others may ignore
        return_tensors="pt"               # return PyTorch tensors
    )

# Apply batched tokenization to train/val splits
style_train_ds_preprocessed = style_train_ds.map(preprocess_style_text, batched=True)
style_valid_ds_preprocessed = style_valid_ds.map(preprocess_style_text, batched=True)


Map: 100%|██████████| 33596/33596 [00:09<00:00, 3638.16 examples/s]
Map: 100%|██████████| 2290/2290 [00:00<00:00, 4514.48 examples/s]


In [52]:
style_classifier_batch_size = 32
style_classifier_num_epochs = 2

# TrainingArguments for the BERT style classifier.
# - output_dir: where checkpoints/logs are written
# - per_device_*_batch_size: batch size per GPU/CPU device (effective batch = this × #devices)
# - num_train_epochs: total passes over the training data
# - eval_strategy: run evaluation at the end of each epoch
# - save_strategy: save a checkpoint at the end of each epoch
# - report_to: disable external loggers (e.g., "wandb", "tensorboard", etc.)
style_training_args = TrainingArguments(
    output_dir="bert_shakespeare_style_classifier",
    per_device_train_batch_size=style_classifier_batch_size,
    per_device_eval_batch_size=style_classifier_batch_size,
    num_train_epochs=style_classifier_num_epochs,
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to='none'
)


In [54]:
# Accuracy will be our primary evaluation metric for the style classifier.
metric = evaluate.load("accuracy")

def compute_metrics(p):
    """
    Convert model outputs to class predictions and compute accuracy.

    Args:
        p: Tuple or EvalPrediction from HF Trainer.
           - p.predictions: logits array of shape (batch, num_labels)
           - p.label_ids (or labels): true label ids of shape (batch,)

    Returns:
        dict: {"accuracy": <float>} suitable for Trainer logging/saving.
    """
    predictions, labels = p  # Trainer passes (predictions, label_ids)
    # Turn logits into hard class ids by argmax over the label dimension.
    predictions = np.argmax(predictions, axis=1)
    # Compute accuracy against the provided references.
    return metric.compute(predictions=predictions, references=labels)


In [55]:
# Hugging Face Trainer wiring for the BERT style classifier.
# - model:          the sequence classification model to fine-tune (2 labels: Shakespeare vs Modern)
# - args:           TrainingArguments controlling batch sizes, epochs, eval/save cadence, etc.
# - train_dataset:  tokenized training split (must include input_ids, attention_mask, [token_type_ids], label)
# - eval_dataset:   tokenized validation split with the same columns as train
# - compute_metrics:function that takes (predictions, labels) and returns a dict (e.g., {"accuracy": ...})
style_trainer = Trainer(
    model=bert_style_classifier_model,
    args=style_training_args,
    train_dataset=style_train_ds_preprocessed,
    eval_dataset=style_valid_ds_preprocessed,
    compute_metrics=compute_metrics
)


In [56]:
style_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4079,0.365615,0.821397
2,0.2895,0.37284,0.834498


TrainOutput(global_step=2100, training_loss=0.36029799415951685, metrics={'train_runtime': 216.7505, 'train_samples_per_second': 309.997, 'train_steps_per_second': 9.689, 'total_flos': 1381168596230400.0, 'train_loss': 0.36029799415951685, 'epoch': 2.0})

In [57]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the style classifier
# style_classifier_checkpoint_filepath = 'drive/MyDrive/ISchool/MIDS/266/model_checkpoints/style_classifier'
style_classifier_checkpoint_filepath = 'model_checkpoints/style_classifier'

In [58]:
# Run this line only after you've trained the style classifier model
bert_style_classifier_model.save_pretrained(style_classifier_checkpoint_filepath, from_pt=True)

In [47]:
# Run this line only if you need to reload the style classifier you trained earlier
bert_style_classifier_model = BertForSequenceClassification.from_pretrained(style_classifier_checkpoint_filepath)

Now let's use the style classifier to classify the output from the Shakespeare translation model, using the test set from our main task. The function reports the average predicted probability of the positive class, which is the modern English style (and which is our goal for our main task model).

We should also classify the original Shakespearean text and the human translations from the test set, to compare the scores as references.

Run the next two cells of code, then answer the following questions.

In [59]:
def get_modern_style_score(text):
    """
    Compute the average probability that a batch of texts are in **Modern English** style
    using the fine-tuned BERT style classifier.

    Args:
        text: List[str] (or any iterable of strings). Each string is a sample to score.

    Returns:
        float: Mean P(modern) across the batch, where class index 1 = Modern.
               (If you pass a single string in a length-1 list, this is that sample’s P(modern).)

    Notes:
        - Tokenization:
          * Caps each sequence at MAX_SEQUENCE_LENGTH tokens and pads to that length.
          * Returns PyTorch tensors sized (batch, MAX_SEQUENCE_LENGTH).
        - Device usage:
          * Moves inputs and the model to CUDA for scoring, then brings logits back to CPU.
          * This function relocates the model every call; for performance, move the model
            to GPU once outside this function and just feed tensors here.
        - Probabilities:
          * Applies softmax over logits to obtain P(class). Index 1 corresponds to Modern.
          * Returns the **mean** probability across all provided texts.
        - token_type_ids:
          * Included for BERT; models like RoBERTa ignore this field (harmless to provide).
    """
    text_inputs = bert_tokenizer.batch_encode_plus(
        text,
        max_length=MAX_SEQUENCE_LENGTH,     # truncate long texts to this cap
        padding='max_length',               # pad shorter texts to fixed length
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
        return_tensors="pt"
    )

    with torch.no_grad():
        logits = bert_style_classifier_model.cuda()(
            text_inputs['input_ids'].cuda(),
            attention_mask=text_inputs['attention_mask'].cuda()
        ).logits

    # Convert logits -> probabilities; column 1 is P(Modern)
    probs = softmax(logits.cpu().numpy(), axis=1)
    return np.mean(probs[:, 1])


'Translate Shakespeare to Modern English: Where is kind Hastings?'

In [None]:
# Prepare three comparable text sets for style scoring:
#  - Generated translations from your MT model
#  - Human (reference) modern translations
#  - Original Shakespeare lines (optionally with the task prefix; see note below)

test_original_texts = [task_prefix + pair[0] for pair in test_pairs]
test_label_texts    = [pair[1] for pair in test_pairs]

# Compute the mean probability that each set is Modern style (class 1).
# get_modern_style_score(...) returns the **mean** P(Modern) across the batch.
translations_score = get_modern_style_score(part2_test_translations)
reference_score    = get_modern_style_score(test_label_texts)
shakespeare_score  = get_modern_style_score(test_original_texts) # We have prepended the task prefix "Translate Shakespeare to Modern English:" here. This might inflate the modern style score a bit, but it's acceptable for comparison.

print("Modern style score for generated translations:  ", translations_score)
print("Modern style score for reference translations:  ", reference_score)
print("Modern style score for original Shakespeare:    ", shakespeare_score) # Still works not bad


Modern style score for generated translations:   0.4517341
Modern style score for reference translations:   0.82950926
Modern style score for original Shakespeare:     0.33652818


**QUESTION:**

 2.e What is the modern style classifier score that you got for the part2 model's generated translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.45173

**QUESTION:**

 2.f What is the modern style classifier score that you got for the human reference translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.82951

**QUESTION:**

 2.g What is the modern style classifier score that you got for the original Shakespeare text? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
 - 0.33653

**QUESTION:**

 2.h What do you notice about differences between these scores, and what does that tell you about what the part2 model is doing? (Chose one of the following options that you agree with most, and put it in the answers file.)

 - A. The part2 model is generating output that is way more modern, casual, and younger generation speak than the human translations.

 - B. The part2 model is generating output that looks about as modern as the human translations, even if it doesn't always mean the same thing.

 - C. The part2 model is generating output that is partly modernized, more modern than the original Shakespeare, but still not as modern as the human references.

 - D. The part2 model is generating output that is still pretty much the same style as the original Shakespeare text.

 -> C

### 2.5 Revisit Decoder .Generate() Options

Now that we have one more evaluation metrics, let's go back to the decoder .generate() arguments we used before. Are there any arguments you want to change, to try to do better on this latest evaluation metric?

Try different options for the part2_generate_kwargs below, and run the two cells afterward with each set of choices to see how the evaluation metrics change.

Then answer the questions below.

In [127]:
"""
Fill in the decoder .generate() arguments that you want to use for the part2 model, like num_beams or top_p, etc.
"""
part2_generate_kwargs = {
    "do_sample": True,
    "top_p": 0.95,
    "temperature": 1.05,
    "max_new_tokens": 48,
    "min_new_tokens": 10,
    "no_repeat_ngram_size": 2,
    "repetition_penalty": 1.15,
}




In [128]:
# Print out eval metrics for the part2_model on the test set, with the new kwargs

part2_test_translations = calculate_eval_metrics(
    test_pairs,
    part2_model,
    part2_tokenizer,
    part2_batch_size,
    task_prefix,
    **part2_generate_kwargs
)

BLEU:  {'bleu': 0.19298524299603242, 'precisions': [0.42736639492753625, 0.23161208305587505, 0.14693638610641344, 0.09536861339517887], 'brevity_penalty': 1.0, 'length_ratio': 1.2815787564390917, 'translation_length': 17664, 'reference_length': 13783}
BLEURT:  -0.51604944


In [129]:
# Calculate modern style scores for the part2 translations after using the new kwargs

translations_score = get_modern_style_score(part2_test_translations)

print("Modern style score for generated translations:  ", translations_score)

Modern style score for generated translations:   0.6875998


In [130]:
# Print out a sample of the translated outputs with the revised .generate() parameters

for i in range(10):
    sample_i = random.choice(range(len(part2_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part2_test_translations[sample_i])
    print()

Original:     Methinks the ground is even.
Reference:    The ground feels flat to me.
Translation:  Methinks the ground is actually even.’

Original:     Yond island carrions, desperate of their bones, Ill-favoredly become the morning field.
Reference:    Those island-bred skeletons, terrified for their bones, are an offensive sight on the morning field.
Translation:  Ill-favored island carrions, desperate from their bones, become the morning field.

Original:     If thou canst nod, speak too.
Reference:    If you can nod, speak too.
Translation:  If you can’t nod, speak too.

Original:     But you, O you, So perfect and so peerless, are created Of every creature's best!
Reference:    So perfect and so peerless, are created Out of every creature's best virtues.
Translation:  That means your eyes are so perfect and that they are always rounded, that you are created From every creature's best!

Original:     Most royal majesty, I crave no more than hath your highness offered.
Reference: 

**QUESTION:**

 2.i Which decoder strategy seemed to increase the modern style score the most? (Choose one of the following options and put it in the answers file.)

 - A. Using a stricter option to always choose the highest predicted possibility output (e.g. beam search, or small k or p when using sampling).

 - B. Using a looser sampling method to allow the model to choose more varied output (e.g. top-k or top-p rather than beam search, especially with higher k or p and/or higher temperature).

 -> B

**QUESTION:**

 2.j What happens to the other evaluation metrics when you try to increase the modern style score by varying the decoder strategy discussed in 2.i? (Choose one of the following options and put it in the answers file.)

 - A. BLEU and BLEURT both seem to be positively correlated with the modern style score, when changing the decoder strategy.

 - B. BLEU and BLEURT both seem to be negatively correlated with the modern style score, when changing the decoder strategy.

 - C. BLEU seems to move with the modern style score, but BLEURT seems to go the other direction.

 - D. BLEURT seems to move with the modern style score, but BLEU seems to go the other direction.

 -> B

**QUESTION:**

 2.k Why do you think the relationship in question 2.j is happening? (Choose one of the following options and put it in the answers file.)

 - A. A stricter decoder strategy makes the model more likely to output the best translation, which is good for BLEU, BLEURT, and modern style objectives.

 - B. A looser decoder strategy gives the model more freedom to find a good modern style translation, which should also end up saying more of the same things in the same way as the human translation.

 - C. A stricter decoder strategy makes the model more likely to output a translation that has correct exact words and style, increasing BLEU and modern style scores, but might not mean the same thing as the human translation.

 - D. A looser decoder strategy gives the model more freedom to choose more modern style words, which the pre-trained model is more familiar with, but that freedom can make the model less likely to end up with the exact same words or meaning as the human translation.

 - E. A stricter decoder strategy makes the model more likely to choose more of the exact same words as used in the dataset, but not necessarily in the same order, so the meaning and style don't end up being as close to the human translation.

  -> D

## 3. Adding Supplementary Paraphrase Dataset

Can we do anything else to make the model capable of rephrasing the input text more into a different style (i.e. modernizing it more fully away from the Shakespeare), but still keep the same meaning?

One related task that could help is a paraphrasing task. The [GLUE Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/nyu-mll/glue) dataset has pairs of sentences with labels indicating whether the two sentences are equivalent (i.e. they mean the same thing) or not.

We could use that data as an additional supporting task for our T5 model, to see if it helps our model get better at accurately rephrasing the input text into a differently worded output.

### 3.1 Load and preprocess the supplemental dataset

Load the dataset from Huggingface and look at the contents

In [107]:
mrpc_data = load_dataset('SetFit/mrpc', trust_remote_code=True)

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'SetFit/mrpc' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|██████████| 3668/3668 [00:00<00:00, 42567.27 examples/s]
Generating validation split: 100%|██████████| 408/408 [00:00<00:00, 95655.45 examples/s]
Generating test split: 100%|██████████| 1725/1725 [00:00<00:00, 296973.87 examples/s]


In [108]:
mrpc_data

DatasetDict({
    train: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 408
    })
    test: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 1725
    })
})

In [110]:

mrpc_data['train'].features['label']

Value('int64')

In [112]:
mrpc_data['train'].features['text2']

Value('string')

Let's use just the pairs that are labeled as equivalent (correct paraphrases), and split the sentences to use the first as the model's input and second as the model's output. Then we can use that to train our T5 model to better generate rephrased statements in modern English with the same meaning as the input but in different words.

In [113]:
# Keep only MRPC pairs labeled as *equivalent* (label == 1 → true paraphrases).
# We'll later map `text1` → source and `text2` → target for seq2seq training.

mrpc_equiv_train = mrpc_data["train"].filter(lambda example: example["label"] == 1)
mrpc_equiv_valid = mrpc_data["validation"].filter(lambda example: example["label"] == 1)


Filter: 100%|██████████| 3668/3668 [00:00<00:00, 65783.72 examples/s]
Filter: 100%|██████████| 408/408 [00:00<00:00, 55323.81 examples/s]


Fill in the code below to encode sentence1 e.g. "text1" as the model's input and sentence2 e.g. "text2" as the model's output.

We will also add a different prefix for this supporting task, so make sure to add the prefix to the inputs in the function below. (You can use the preprocess_translation_batch function above as an example.)

In [114]:
def preprocess_mrpc_for_paraphrase_generation(mrpc_ds, tokenizer, prefix):
    """
    Prepare MRPC paraphrase pairs for seq2seq generation:
      - Inputs:   text1 (with a task prefix prepended)
      - Outputs:  text2 (target paraphrase)
    Returns tensors for Trainer: {'input_ids', 'labels'}.
    """
    # Optionally prepend a task instruction to each source sentence
    if prefix:
        mrpc_ds["text1"] = [prefix + t for t in mrpc_ds["text1"]]

    # Encode inputs (encoder side)
    input_encoded = tokenizer.batch_encode_plus(
        mrpc_ds["text1"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    # Encode targets (decoder side labels)
    output_encoded = tokenizer.batch_encode_plus(
        mrpc_ds["text2"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    return {
        "input_ids": input_encoded["input_ids"],
        "labels": output_encoded["input_ids"],
    }


Now map the preprocessing function to the MRPC train and validation datasets. Use the part2_tokenizer to preprocess the data, since we're using the same T5 pre-trained model checkpoint as in part 2. For the preprocessing function's "prefix" argument, use the paraphrase_prefix provided below.

In [115]:
paraphrase_prefix = 'Paraphrase in modern English: '

### YOUR CODE HERE

# Map MRPC (label==1) splits into seq2seq features using the pretrained tokenizer
mrpc_paraphrase_train = mrpc_equiv_train.map(
    preprocess_mrpc_for_paraphrase_generation,
    batched=True,
    fn_kwargs={"tokenizer": part2_tokenizer, "prefix": paraphrase_prefix}
)

mrpc_paraphrase_valid = mrpc_equiv_valid.map(
    preprocess_mrpc_for_paraphrase_generation,
    batched=True,
    fn_kwargs={"tokenizer": part2_tokenizer, "prefix": paraphrase_prefix}
)



### END YOUR CODE

Map: 100%|██████████| 2474/2474 [00:00<00:00, 6061.51 examples/s]
Map: 100%|██████████| 279/279 [00:00<00:00, 4707.51 examples/s]


### 3.2 Train T5 on Paraphrasing Task

Load a fresh copy of the pre-trained T5 model (using the same pre-trained model checkpoint as part2), so that we can train it first on the paraphrase task, and last on the main task data.

In [131]:
"""
Load a new copy of the same pre-trained model (we'll use the same in tokenizer as part2)
"""

t5_pretrained_checkpoint_name = 'google-t5/t5-small'

### YOUR CODE HERE

part3_model = T5ForConditionalGeneration.from_pretrained(t5_pretrained_checkpoint_name)

# T5 convention: decoder starts with the pad token (usually set already, but make it explicit if you like)
part3_model.config.decoder_start_token_id = part2_tokenizer.pad_token_id


### END YOUR CODE

Now create the training args and trainer for the paraphrase task, and train the model. Use the `create_seq2seq_training_args` and `create_seq2seq_trainer` functions like before.

You'll be using the part3_model you just loaded, and the MRPC data you preprocessed. Use the batch_size and num_epochs provided for the paraphrase task below.

In [132]:
"""
Create the training args and trainer for the paraphrase task.
"""

paraphrase_batch_size = 32
paraphrase_num_epochs = 4

### YOUR CODE HERE

# Build training arguments (same fixed fields as before)
paraphrase_training_args = create_seq2seq_training_args(
    paraphrase_batch_size,
    paraphrase_num_epochs
)

# Wire up model, args, and MRPC paraphrase datasets
paraphrase_trainer = create_seq2seq_trainer(
    part3_model,
    paraphrase_training_args,
    mrpc_paraphrase_train,
    mrpc_paraphrase_valid
)


### END YOUR CODE

In [133]:
paraphrase_trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.359217
2,No log,1.255481
3,No log,1.228767
4,No log,1.223685


TrainOutput(global_step=312, training_loss=1.6392105298164563, metrics={'train_runtime': 28.8769, 'train_samples_per_second': 342.696, 'train_steps_per_second': 10.804, 'total_flos': 104636130263040.0, 'train_loss': 1.6392105298164563, 'epoch': 4.0})

### 3.3 Fine-Tune Paraphrase-Trained Model on Main Task

Now create the training args and trainer for the main task. Use the `create_seq2seq_training_args` and `create_seq2seq_trainer` functions one more time.

You'll be using the same model that you just trained on the paraphrase task (part3_model). Use the batch size and num epochs provided below.

For training data, use the same data as part2: `train_ds_part2` and `val_ds_part2`. We're using the same pre-trained model checkpoint, i.e. the same tokenizer, and the same task prefix, so the data has already been preprocessed correctly.

In [134]:
"""
Create the training args and trainer for the main task using the part3_model.
"""

part3_batch_size = 32
part3_num_epochs = 4

### YOUR CODE HERE

# Training args for fine-tuning on the Shakespeare→Modern task (same helper)
part3_training_args = create_seq2seq_training_args(
    part3_batch_size,
    part3_num_epochs
)

# Trainer that continues training the *same* model (part3_model) on the main task data
part3_trainer = create_seq2seq_trainer(
    part3_model,
    part3_training_args,
    train_ds_part2,
    val_ds_part2
)

### END YOUR CODE

In [135]:
part3_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.8016,0.707475
2,0.7378,0.691277
3,0.7239,0.685622
4,0.7108,0.683007


TrainOutput(global_step=2100, training_loss=0.742597910563151, metrics={'train_runtime': 193.3912, 'train_samples_per_second': 347.441, 'train_steps_per_second': 10.859, 'total_flos': 710459869102080.0, 'train_loss': 0.742597910563151, 'epoch': 4.0})

In [136]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part3 model
# part3_model_checkpoint_filepath = 'drive/MyDrive/ISchool/MIDS/266/model_checkpoints/part3_model'
part3_model_checkpoint_filepath = 'model_checkpoints/part3_model'

In [137]:
# Run this line only after you've trained the part3 model on both tasks
part3_model.save_pretrained(part3_model_checkpoint_filepath, from_pt=True)

In [138]:
# Run this line only if you need to reload the part3 model you trained earlier
part3_model = T5ForConditionalGeneration.from_pretrained(part3_model_checkpoint_filepath)

### 3.4 Paraphrase-Trained Model Evaluation

Use the functions defined above to translate the test set and calculate the same set of evaluation metrics as used on the part2 model.

Use the same decoder .generate() arguments as part2 (`part2_generate_kwargs`), so that we can compare the part2 and part3 models as closely as possible.

Run the next three cells, then answer the questions below.

In [139]:
# Print out eval metrics for the part3_model on the test set, with the new kwargs

part3_test_translations = calculate_eval_metrics(
    test_pairs,
    part3_model,
    part2_tokenizer,
    part3_batch_size,
    task_prefix,
    **part2_generate_kwargs
)

BLEU:  {'bleu': 0.18760278053208665, 'precisions': [0.42463608420808446, 0.22679836004472606, 0.14151006486992576, 0.0908893395133256], 'brevity_penalty': 1.0, 'length_ratio': 1.2510338823187985, 'translation_length': 17243, 'reference_length': 13783}
BLEURT:  -0.51736


In [140]:
# Calculate modern style scores for the part3 translations after using the new kwargs

translations_score = get_modern_style_score(part3_test_translations)

print("Modern style score for generated translations:  ", translations_score)

Modern style score for generated translations:   0.6705979


In [141]:
# Print out a sample of the translated outputs to look at as well

for i in range(10):
    sample_i = random.choice(range(len(part3_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part3_test_translations[sample_i])
    print()

Original:     There the grown serpent lies; the worm that's fled Hath nature that in time will venom breed, No teeth for the present.
Reference:    There the grown serpent lies; the worm that has fled Has a nature that in time will breed venom, But he has no fangs now.
Translation:  When the grown serpent lie, the worm that has escaped the Hath nature that in time will venom breed, No teeth for the present.

Original:     Signior, it is the Moor.
Reference:    Signior, it is the Moor.
Translation:  Signior, it is the Moor.

Original:     She died, my lord, but whiles her slander lived.
Reference:    She was only dead, my lord, as long as her slander lived.
Translation:  She died, my lord, but as she was living, she lived.

Original:     Thou worms' meat in respect of a good piece of flesh, indeed.
Reference:    You are about as much of a thinker as worm’s meat is a nice steak.
Translation:  You're pretty nice, a good piece of flesh.

Original:     A very fine one.
Reference:    He’s a 

**QUESTION:**

 3.a What is the overall BLEU score that you achieved on the test set for the part3 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)  
  - 0.18760

**QUESTION:**

 3.b What is the mean BLEURT score that you achieved on the test set for the part3 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - -0.51736

**QUESTION:**

 3.c What is the modern style classifier score that you got for the part3 model's generated translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)
  - 0.67060

**QUESTION:**

 3.d How do the part3 model's evaluation scores and output compare to the part2 model? Write a short answer about what you observe in the markdown cell below.

*** YOUR ANSWER TO QUESTION 3.d HERE IN THIS TEXT CELL***

Compared to the part 2 model, the part 3 model (which was pre-trained on the MRPC paraphrase task) performs slightly worse across all three metrics.

BLEU Score: Part 3 (0.18760) is slightly lower than Part 2 (0.19299).

BLEURT Score: Part 3 (-0.51736) is marginally lower than Part 2 (-0.51605), indicating a very small decrease in semantic similarity to the reference.

Modern Style Score: Part 3 (0.67060) is also slightly lower than Part 2 (0.68760), suggesting its output is classified as slightly less "modern."

Observing the sample outputs, the additional pre-training on paraphrasing seems to have made the model more prone to changing the core meaning of the original text, rather than just the style. For example, it translated "Retire!" as "Come on! I mean, let’s wait!", which is a significant semantic shift. It appears that while the paraphrasing pre-training encouraged rephrasing, it didn't perfectly align with the specific goal of Shakespeare-to-Modern style transfer, leading to a slight degradation in performance on the target task.

*** END YOUR ANSWER ***