## Fine-tuning a Seq2Seq model (T5) in Colab with limited RAM, PyTorch version

This notebook is an extension of the notebook [Training NLP models in Colab without running out of RAM](https://github.com/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/keras_training_with_limited_ram.ipynb).

This one focuses on sequence-to-sequence (encoder-decoder, text generation) models like T5, because the way you fine-tune the Huggingface pretrained versions of those models is a little different than BERT. With T5, you use the full pre-trained model end-to-end without adding any additional layers.

Some Huggingface models are only available in pytorch versions. The way you fine-tune those is a little different from using keras with tensorflow models. But Huggingface provides a trainer class that is pretty easy to use, and can work with an iterator somewhat like the data generator used in the first notebook (to avoid loading all your data at once).

(If you're using a tensorflow model, we have a [separate notebook for fine-tuning a pretrained tensorflow T5 model](https://github.com/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/fine_tune_t5_with_limited_ram_keras.ipynb).)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/fine_tune_t5_with_limited_ram_pytorch.ipynb)

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 55.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 52.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 4.8 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [None]:
import os
import re
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import torch  # Only if you use a pytorch model, both options are shown below
from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

### Data

To fine-tune T5, we'll use the dataset from the [week 6 lesson notebook](https://github.com/datasci-w266/2022-fall-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation.ipynb) for translating Shakespeare to modern English. You can [download the dataset here](https://github.com/cocoxu/Shakespeare), or access [the copy that is in the lesson_notebooks directory](https://github.com/datasci-w266/2022-fall-main/blob/master/materials/lesson_notebooks/train_plays-org-mod.txt) in the class git repo and then upload to your drive folder.

In [None]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Modify this path to where you saved the Shakespear data in your Drive
text_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/train_plays-org-mod.txt'

In [None]:
with open(text_file) as f:
    lines = f.read().split('\n')[:-1]

prefix = 'translate old to modern: '
text_pairs = []
for line in lines:
    orig, target = line.split('\t')
    text_pairs.append({'orig': prefix + orig, 'target': target})

In [None]:
# Look at some examples
for _ in range(5):
    print(np.random.choice(text_pairs))

{'orig': 'translate old to modern: Marry, they say my uncle grew so fast That he could gnaw a crust at two hours old.', 'target': 'Indeed, they say my uncle grew so fast that he could gnaw on a crust of bread when he was two hours old.'}
{'orig': 'translate old to modern: Sunday comes apace.', 'target': 'Sunday is just around the corner.'}
{'orig': 'translate old to modern: There is one goat for you.', 'target': 'Here’s a goat for you.'}
{'orig': 'translate old to modern: Never make denial.', 'target': 'Don’t even think about refusing.'}
{'orig': 'translate old to modern: Lorenzo, certain; and my love, indeed; For who love I so much?', 'target': 'Lorenzo, to be sure and my love indeed, Because who do love I so much?'}


In [None]:
# Let's create some splits
np.random.shuffle(text_pairs)
num_valid_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = text_pairs[:num_train_samples]
valid_pairs = text_pairs[num_train_samples : num_train_samples + num_valid_samples]
test_pairs = text_pairs[num_train_samples + num_valid_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(valid_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
13362 training pairs
2863 validation pairs
2863 test pairs


In [None]:
# Save splits to separate csv files, to load only part at a time later
train_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/train_pairs.csv'
valid_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/valid_pairs.csv'
test_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/test_pairs.csv'

pd.DataFrame(train_pairs).to_csv(train_file)
pd.DataFrame(valid_pairs).to_csv(valid_file)
pd.DataFrame(test_pairs).to_csv(test_file)

### Preprocessor and Data Iterator

As in the previous notebook for BERT models, we'll define a preprocessing function and data iterator class, to only load a little data at a time while training.

The preprocessor and data iterator need to be defined slightly differently from the data generator used in the previous notebook. The trainer will get an item from our iterator for one example at a time. So the preprocessing function needs to return a one-dimensional vector of input_ids for each example, not a two dimensional batch. (The trainer will put them together into a batch to pass into the model.)

For the preprocessor, the pytorch models want the inputs in a dictionary with keys for 'input_ids', 'attention_mask', and 'labels'. The first two are the inputs to the encoder (the original text), and the labels are the translated text vocab ids. Since we're passing this all into a trainer, we don't need to separate out the decoder input_ids. The trainer will infer those from the labels (offset by one).

In [None]:
def preprocess_data(text_pair, tokenizer, max_length=128):
    orig_text, target_text = text_pair
    orig_encoded = tokenizer.batch_encode_plus(
        [orig_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    orig_input_ids = orig_encoded['input_ids'][0]
    orig_attention_mask = orig_encoded['attention_mask'][0]
    
    target_encoded = tokenizer.batch_encode_plus(
        [target_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    
    label_ids = target_encoded['input_ids'][0]
    
    return {'input_ids': orig_input_ids,
            'attention_mask': orig_attention_mask,
            'labels': label_ids}

In [None]:
class TranslationDataIterator:
    
    def __init__(self,
                 tokenizer,
                 n_examples,
                 data_filename,
                 max_length=128,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        return self.n_examples
    
    def __getitem__(self, idx):
        row_to_load = self.row_order[idx]
        df = pd.read_csv(self.data_filename,
                         skiprows=range(1, row_to_load),
                         nrows=1)
        
        text_pair = df[['orig', 'target']].values.astype(str)[0]
        
        item_data = preprocess_data(
            text_pair,
            self.tokenizer,
            self.max_length
        )

        return item_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [None]:
# Download tokenizer and model

model_name = 't5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
# Create the data iterators for train and validation data, pytorch version

max_length = 32

train_data_iterator = TranslationDataIterator(
    tokenizer=t5_tokenizer,
    n_examples=len(train_pairs),
    data_filename=train_file,
    max_length=max_length
)

valid_data_iterator = TranslationDataIterator(
    tokenizer=t5_tokenizer,
    n_examples=len(valid_pairs),
    data_filename=valid_file,
    max_length=max_length
)

In [None]:
# Specify batch size and other training arguments

batch_size = 16

# Modify this filepath to where you want to save the model after fine-tuning
dir_path = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/'
file_path = dir_path + 't5base-finetuned-shakespeare-to-modern'

args = Seq2SeqTrainingArguments(
    file_path,
    evaluation_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
)

In [None]:
# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=train_data_iterator,
    eval_dataset=valid_data_iterator
)

In [None]:
# Call train

trainer.train()

***** Running training *****
  Num examples = 13362
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 836


Epoch,Training Loss,Validation Loss
1,1.0129,0.727536


Saving model checkpoint to drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/t5base-finetuned-shakespeare-to-modern/checkpoint-500
Configuration saved in drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/t5base-finetuned-shakespeare-to-modern/checkpoint-500/config.json
Model weights saved in drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/t5base-finetuned-shakespeare-to-modern/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2863
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=836, training_loss=0.9167570999364534, metrics={'train_runtime': 485.8905, 'train_samples_per_second': 27.5, 'train_steps_per_second': 1.721, 'total_flos': 508555958353920.0, 'train_loss': 0.9167570999364534, 'epoch': 1.0})

### Does it seem to have worked?

Depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [None]:
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model.generate(test_inputs['input_ids'].cuda())

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])



['You’ll not vex me again.']
['Do you foresee me?']
['Make your own dinner.']


You can load the model you trained using the .from_pretrained function you use for pretrained models. If you look in your drive folder, at the filepath you used in the trainer arguments, you'll see a checkpoint folder. Use the full path to that checkpoint folder as the argument to .from_pretrained, to load the model you saved again later.

In [None]:
t5_model_saved = T5ForConditionalGeneration.from_pretrained(file_path + '/checkpoint-500')

loading configuration file drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/t5base-finetuned-shakespeare-to-modern/checkpoint-500/config.json
Model config T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams":

In [None]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_saved.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

['You’ll not vex me again.']
['Do you foresee me?']
['Make your own dinner.']
