## **Model Fine-tuning** (Notebook sourced from translation notebook [here](https://huggingface.co/docs/transformers/notebooks))

Enable logging with Weights and Biases:

In [1]:
wb = False

In [None]:
import os
work_dir = os.getcwd()
if work_dir == '/content':
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/MyDrive/XLdefgen')

If running this on Colab, uncomment the following cell to install requisite packages.

In [None]:
!pip install datasets transformers sacrebleu sentencepiece wandb
!apt install git-lfs

In [None]:
if wb:
  import wandb
  wandb.login()
  %env WANDB_PROJECT=XLdefgen

If storing model on HF Model Hub, uncomment the following:

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

A script version of this notebook to fine-tune the model in a distributed fashion using multiple GPUs or TPUs is available [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

Specify model checkpoint to load (from HF Model Hub)


In [None]:
model_checkpoint = "google/mt5-small"

## Loading the dataset

In [None]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16", "de-en")
metric = load_metric("sacrebleu")

To get a sense of what the data looks like, the following function shows some examples picked randomly from the dataset.

In [None]:
# import datasets
# import random
# import pandas as pd
# from IPython.display import display, HTML

# def show_random_elements(dataset, num_examples=5):
#     assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
#     picks = []
#     for _ in range(num_examples):
#         pick = random.randint(0, len(dataset)-1)
#         while pick in picks:
#             pick = random.randint(0, len(dataset)-1)
#         picks.append(pick)
    
#     df = pd.DataFrame(dataset[picks])
#     for column, typ in dataset.features.items():
#         if isinstance(typ, datasets.ClassLabel):
#             df[column] = df[column].transform(lambda i: typ.names[i])
#     display(HTML(df.to_html()))

In [None]:
# show_random_elements(raw_datasets["train"])

Demonstration of the metric in use:

In [None]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

## Preprocessing the data

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Model-specific tokenizer adaptations

In [None]:
if "t5" in model_checkpoint:
    prefix = "translate English to German: "
    print("Inputs will include prefix!")
else:
    prefix = ""
    print("Inputs will not include prefix!")

if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "de-DE"

Inputs will include prefix!


Create preprocessing function

In [None]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "de"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Specify whether reduced dataset should be passed to model

In [None]:
trim_datasets = True
train_size = 10000
eval_size = 1000

Preprocess data

In [None]:
if trim_datasets:
  small_train_dataset = raw_datasets["train"].shuffle(seed=42).select(range(train_size))
  small_eval_dataset = raw_datasets["validation"].shuffle(seed=42).select(range(eval_size))
  raw_datasets_trim = datasets.DatasetDict({'train': small_train_dataset, 'validation': small_eval_dataset})
  tokenized_datasets = raw_datasets_trim.map(preprocess_function, batched=True)
  print("Datasets trimmed and tokenized.")
else:
  tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
  print("Raw datasets tokenized.")

The results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). 🤗 Datasets warns you when it uses cached files, but you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)


Specify batch size and training arguments

In [None]:
batch_size = 8
model_name = model_checkpoint.split("/")[-1]
if wb:
  report = "wandb"
else:
  report = "none"
args = Seq2SeqTrainingArguments(
    # f"drive/MyDrive/{model_name}-finetuned-{source_lang}-to-{target_lang}",
    # f"drive/MyDrive/XLdefgen-{source_lang}-to-{target_lang}",
    f"XLdefgen-trans-{source_lang}-to-{target_lang}-train{train_size}-bat{batch_size}", #output directory
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3, #max num of checkpoints to keep
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,         #mixed precision (acceleration) - doesn't work well with t5 models
    push_to_hub=False,  #push to HF Model Hub
    report_to=report,   #for data logging
    ignore_data_skip=True   #if true and loading from checkpoint, this will start at beginning of dataset rather than where left off
)

using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices


Add data collator to pad inputs and labels to max length for each batch

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Post-processing and compute metrics

In [None]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Instantiate Trainer

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Train/fine-tune the model

In [None]:
torch.cuda.empty_cache() #to free up space
if wb:
  wandb.init()
trainer.train(resume_from_checkpoint=False)
if wb:
  wandb.finish()

The following columns in the training set  don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been ignored: translation.
***** Running training *****
  Num examples = 10000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1250


Step,Training Loss,Validation Loss


RuntimeError: ignored

## Model testing

Test model predictive capacity with an example

In [None]:
input_ids = tokenizer.encode(prefix + 'I enjoy walking with my cute dog', return_tensors='pt')
print(input_ids)

input_ids = tokenizer(prefix + 'I enjoy walking with my cute dog', return_tensors='pt').input_ids
print(input_ids)

input_ids = input_ids.to(device)

greedy_output = model.generate(input_ids)
print("\nGreedy Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, min_length=5))

outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3)
print("\n" + 100 * '-' + "\n\nBeam Output:")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

tensor([[37194,  5413,   288, 20567,   267,   336,  9070,   259, 42822,   514,
          1037, 64712, 10990,     1]])
tensor([[37194,  5413,   288, 20567,   267,   336,  9070,   259, 42822,   514,
          1037, 64712, 10990,     1]])

Greedy Output:
Für einen kleines Haustiere spielen!

----------------------------------------------------------------------------------------------------

Beam Output:
['。『 walking with I walking with  my cute sheeping.', '。『 walking with I walking with  my cute sheeping, dass ich', '。『 walking with I walking with  my cute sheeping and a ']


Push Model to HF Model Hub

In [None]:
# trainer.push_to_hub()