# Finetuning T5 for Size Selection

The purpose of this notebook is to document the process of finetuning Google's T5 model for translating from Literary Tibetan to English. This notebook relies on a dataset in the form of a pickled pandas dataframe which consists of a single column, 'translation'. Entries in that column should be a python dictionary of the structure: {'bo':'Tibetan text', 'en': 'English text'}.

In creating this notebook I drew on the following tutorial from HuggingFace: https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt

In [1]:
# only this line should be changed for testing different sizes
# must be one of ['small', 'base', large', '3b']
size = 'large'

## Load and Split the Data for Training

In [2]:
from datasets import load_dataset
train_dataset = load_dataset('pandas', data_files='/home/j/Documents/Projects/MLotsawa/data/size-selection-data/1M-train.p')
eval_dataset = load_dataset('pandas', data_files='/home/j/Documents/Projects/MLotsawa/data/size-selection-data/100k-eval.p')

## Format and Tokenize the Data

This notebook uses Google's T5-small model and its associated tokenizer. This model gives really great results on translation tasks despite its small size. The data must be reformatted though to accomodate the model's expectations.

We will format each input as 'translate Tibetan to English: \<Tibetan text\>' with the English translation as the target.

Once the sentence pairs are formatted, we can tokenize them for processing by the model.

In [5]:
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

checkpoint = f"google-t5/t5-{size}"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [6]:
source_lang = 'bo'
target_lang = 'en'
prefix = "translate Tibetan to English: "

def preprocess_function(examples):

    inputs = [prefix + example[source_lang] for example in examples['translation']]
    targets = [example[target_lang] for example in examples['translation']]
    
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

    return model_inputs


In [7]:
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True)

## Prepare Evaluation Metrics

We will be using the BLEU metric as implemented by SacreBLEU as our evaluation metric. BLEU (BiLingual Evaluation Understudy) is a standard (if not uncontroversial) metric in machine translation. BLEU gives each prediction a score between 0 and 1, where 0 means the model's predicted translation is nothing like the correct translation and 1 means the predicited translation is identical to the correct one. You can read more about the specifics here: https://en.wikipedia.org/wiki/BLEU 

In [8]:
import evaluate

metric = evaluate.load("sacrebleu")

In [9]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Load and Train the Model

Finally, we train the model. You can uncomment the line 'model.to("cuda:0")' if are working on a machine that has a CUDA compatible GPU. The training arguments below are taken from the HuggingFace tutorial cited in above. 

In [10]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, EarlyStoppingCallback

early_stop = EarlyStoppingCallback()

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")

In [11]:
training_args = Seq2SeqTrainingArguments(
    output_dir=f"../../models/size-selection/{size}/",
    learning_rate=2e-5,
    auto_find_batch_size=True,
    weight_decay=0.01,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    optim="adafactor"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset['train'],
    eval_dataset=tokenized_eval_dataset['train'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stop]
)

trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mbillingsmoore[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/375000 [00:00<?, ?it/s]

{'loss': 4.0054, 'grad_norm': 2.791461229324341, 'learning_rate': 1.9973333333333334e-05, 'epoch': 0.0}
{'loss': 3.7206, 'grad_norm': 2.7115252017974854, 'learning_rate': 1.9946666666666667e-05, 'epoch': 0.01}
{'loss': 3.5792, 'grad_norm': 3.222254753112793, 'learning_rate': 1.9920000000000002e-05, 'epoch': 0.01}
{'loss': 3.4919, 'grad_norm': 2.945887565612793, 'learning_rate': 1.9893333333333335e-05, 'epoch': 0.02}
{'loss': 3.4448, 'grad_norm': 3.42431902885437, 'learning_rate': 1.9866666666666667e-05, 'epoch': 0.02}
{'loss': 3.3361, 'grad_norm': 3.5975170135498047, 'learning_rate': 1.9840000000000003e-05, 'epoch': 0.02}
{'loss': 3.2599, 'grad_norm': 4.015270233154297, 'learning_rate': 1.9813333333333336e-05, 'epoch': 0.03}
{'loss': 3.2441, 'grad_norm': 4.453704357147217, 'learning_rate': 1.9786666666666668e-05, 'epoch': 0.03}
{'loss': 3.1757, 'grad_norm': 3.2731821537017822, 'learning_rate': 1.976e-05, 'epoch': 0.04}
{'loss': 3.1305, 'grad_norm': 4.653375148773193, 'learning_rate': 1



  0%|          | 0/12500 [00:00<?, ?it/s]

{'eval_loss': 0.9471283555030823, 'eval_bleu': 34.1513, 'eval_gen_len': 12.7709, 'eval_runtime': 10288.2974, 'eval_samples_per_second': 9.72, 'eval_steps_per_second': 1.215, 'epoch': 1.0}
{'loss': 1.237, 'grad_norm': 4.952690601348877, 'learning_rate': 1.3306666666666667e-05, 'epoch': 1.0}
{'loss': 1.2496, 'grad_norm': 5.2980546951293945, 'learning_rate': 1.3280000000000002e-05, 'epoch': 1.01}
{'loss': 1.2195, 'grad_norm': 4.0981292724609375, 'learning_rate': 1.3253333333333334e-05, 'epoch': 1.01}
{'loss': 1.2483, 'grad_norm': 6.4036431312561035, 'learning_rate': 1.3226666666666668e-05, 'epoch': 1.02}
{'loss': 1.2435, 'grad_norm': 4.18477201461792, 'learning_rate': 1.3200000000000002e-05, 'epoch': 1.02}
{'loss': 1.2049, 'grad_norm': 4.696059703826904, 'learning_rate': 1.3173333333333333e-05, 'epoch': 1.02}
{'loss': 1.2498, 'grad_norm': 4.632524013519287, 'learning_rate': 1.3146666666666667e-05, 'epoch': 1.03}
{'loss': 1.2377, 'grad_norm': 4.924367904663086, 'learning_rate': 1.312000000



  0%|          | 0/12500 [00:00<?, ?it/s]

{'eval_loss': 0.5441192984580994, 'eval_bleu': 50.5472, 'eval_gen_len': 12.8646, 'eval_runtime': 10311.6134, 'eval_samples_per_second': 9.698, 'eval_steps_per_second': 1.212, 'epoch': 2.0}
{'loss': 0.8139, 'grad_norm': 5.939826011657715, 'learning_rate': 6.640000000000001e-06, 'epoch': 2.0}
{'loss': 0.8128, 'grad_norm': 5.081154823303223, 'learning_rate': 6.613333333333334e-06, 'epoch': 2.01}
{'loss': 0.8308, 'grad_norm': 6.368315696716309, 'learning_rate': 6.5866666666666666e-06, 'epoch': 2.01}
{'loss': 0.8308, 'grad_norm': 4.87898063659668, 'learning_rate': 6.560000000000001e-06, 'epoch': 2.02}
{'loss': 0.8281, 'grad_norm': 6.912765979766846, 'learning_rate': 6.533333333333334e-06, 'epoch': 2.02}
{'loss': 0.809, 'grad_norm': 5.473793029785156, 'learning_rate': 6.5066666666666665e-06, 'epoch': 2.02}
{'loss': 0.8124, 'grad_norm': 6.679011344909668, 'learning_rate': 6.480000000000001e-06, 'epoch': 2.03}
{'loss': 0.8252, 'grad_norm': 4.134126663208008, 'learning_rate': 6.453333333333334e



  0%|          | 0/12500 [00:00<?, ?it/s]

{'eval_loss': 0.4411280155181885, 'eval_bleu': 56.1403, 'eval_gen_len': 12.8662, 'eval_runtime': 10276.0708, 'eval_samples_per_second': 9.731, 'eval_steps_per_second': 1.216, 'epoch': 3.0}


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


{'train_runtime': 206664.8716, 'train_samples_per_second': 14.516, 'train_steps_per_second': 1.815, 'train_loss': 1.2270930525716146, 'epoch': 3.0}


TrainOutput(global_step=375000, training_loss=1.2270930525716146, metrics={'train_runtime': 206664.8716, 'train_samples_per_second': 14.516, 'train_steps_per_second': 1.815, 'total_flos': 3.93261829496832e+17, 'train_loss': 1.2270930525716146, 'epoch': 3.0})