## Fine-tuning a Seq2Seq model (T5) in Colab with limited RAM, PyTorch version

This notebook is an extension of the notebook [Training NLP models in Colab without running out of RAM](https://github.com/datasci-w266/2024-summer-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/keras_training_with_limited_ram.ipynb). This series focuses on how to avoid running out of memory by loading part of your data at a time while you train, and saving model checkpoints as you go. We recommend reading that earlier notebook first, which has more complete explanations of these techniques shown, but for fine-tuning a BERT model.

This notebook focuses on sequence-to-sequence (encoder-decoder, text generation) models like T5, because the way you fine-tune the Huggingface pretrained versions of those models is a little different than BERT. With T5, you use the full pre-trained model end-to-end without adding any additional layers.

Some Huggingface models are only available in pytorch versions. The way you fine-tune those is a little different from using keras with tensorflow models. But Huggingface provides a trainer class that is pretty easy to use, and can work with an iterator somewhat like the data generator used in the first notebook (to avoid loading all your data at once).

(If you're using a tensorflow model, we have a [separate notebook for fine-tuning a pretrained tensorflow T5 model](https://github.com/datasci-w266/2024-summer-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/fine_tune_t5_with_limited_ram_keras.ipynb).)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-summer-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/fine_tune_t5_with_limited_ram_pytorch.ipynb)

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.0 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/258.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/258.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m225.3/258.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.23.0


In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
import os
import re
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import torch
import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

### Data

To fine-tune T5, we'll use the dataset from the [week 6 lesson notebook](https://github.com/datasci-w266/2024-summer-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation.ipynb) for translating Shakespeare to modern English. You can [download the dataset here](https://github.com/cocoxu/Shakespeare), or access [the copy that is in the lesson_notebooks directory](https://github.com/datasci-w266/2024-summer-main/blob/master/materials/lesson_notebooks/train_plays-org-mod.txt) in the class git repo and then upload to your drive folder.

In [None]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Modify this path to where you saved the Shakespear data in your Drive
text_file = 'drive/MyDrive/ISchool/MIDS/266/lessons/train_plays-org-mod.txt'

In [None]:
with open(text_file) as f:
    lines = f.read().split('\n')[:-1]

prefix = 'translate old to modern: '
text_pairs = []
for line in lines:
    orig, target = line.split('\t')
    text_pairs.append({'orig': prefix + orig, 'target': target})

In [None]:
# Look at some examples
for _ in range(5):
    print(np.random.choice(text_pairs))

{'orig': 'translate old to modern: All hail, Macbeth!', 'target': 'All hail, Macbeth!'}
{'orig': 'translate old to modern: Fie on her!', 'target': 'The hell with her!'}
{'orig': 'translate old to modern: If it were done when ’tis done, then ’twere well It were done quickly.', 'target': 'If this deed were done when it is done, then it would be better If it were done quickly.'}
{'orig': 'translate old to modern: Tis one Iago, ancient to the general.', 'target': 'It is one Iago, ensign to the general.'}
{'orig': 'translate old to modern: Come, sir, dispatch.', 'target': 'Come, sir, get going.'}


In [None]:
# Let's create some splits
np.random.shuffle(text_pairs)
num_valid_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = text_pairs[:num_train_samples]
valid_pairs = text_pairs[num_train_samples : num_train_samples + num_valid_samples]
test_pairs = text_pairs[num_train_samples + num_valid_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(valid_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
13362 training pairs
2863 validation pairs
2863 test pairs


In [None]:
# Save splits to separate csv files, to load only part at a time later
train_file = 'drive/MyDrive/ISchool/MIDS/266/walkthroughs/train_pairs.csv'
valid_file = 'drive/MyDrive/ISchool/MIDS/266/walkthroughs/valid_pairs.csv'
test_file = 'drive/MyDrive/ISchool/MIDS/266/walkthroughs/test_pairs.csv'

pd.DataFrame(train_pairs).to_csv(train_file)
pd.DataFrame(valid_pairs).to_csv(valid_file)
pd.DataFrame(test_pairs).to_csv(test_file)

### Preprocessor and Data Iterator

As in the previous notebook for BERT models, we'll define a preprocessing function and data iterator class, to only load a little data at a time while training. The preprocessor and data iterator need to be defined slightly differently from the data generator used in the previous notebook.

For the preprocessor, the pytorch models want the inputs in a dictionary with keys for 'input_ids', 'attention_mask', and 'labels'. The first two are the inputs to the encoder (the original text), and the labels are the translated text vocab ids. Since we're passing this all into a trainer, we don't need to separate out the decoder input_ids. The trainer will infer those from the labels (offset by one).

For the input data, the trainer will index an item from our iterator for one example at a time (not a whole batch). So the preprocessing function needs to return a one-dimensional vector of input_ids for each example, not a two dimensional batch. (The trainer will put them together into a batch to pass into the model.)

This process would get very slow if we call pandas.read_csv once for every single example (row of data), though. So we'll add a function to the iterator to pre-load larger chunks of data (but not too large to run out of memory), and then return one example each time the get-item method is called. (The first notebook in this series explains why we're using pandas, which has a handy function for specifying only certain rows of a csv to skip vs load. Pandas isn't the fastest IO method, but others get more complicated, so replace it if you're comfortable using something else.)

In [None]:
def preprocess_data(text_pair, tokenizer, max_length=128):
    orig_text, target_text = text_pair
    orig_encoded = tokenizer.batch_encode_plus(
        [orig_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    orig_input_ids = orig_encoded['input_ids'][0]
    orig_attention_mask = orig_encoded['attention_mask'][0]

    target_encoded = tokenizer.batch_encode_plus(
        [target_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    label_ids = target_encoded['input_ids'][0]

    return {'input_ids': orig_input_ids,
            'attention_mask': orig_attention_mask,
            'labels': label_ids}

In [None]:
class TranslationDataIterator:

    def __init__(self,
                 tokenizer,
                 n_examples,
                 max_load_at_once,
                 data_filename,
                 max_length=128,
                 shuffle=True):

        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.max_load_at_once = max_load_at_once
        self.data_filename = data_filename
        self.max_length = max_length
        self.shuffle = shuffle

        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()

        # Load first chunk of max_load_at_once examples
        self.df_curr_loaded = self._load_next_chunk(0)
        self.curr_idx_in_load = 0

    def _load_next_chunk(self, idx):
        load_start = idx
        load_end = idx + self.max_load_at_once

        # Indices to skip are the ones in the shuffled row_order before and
        # after the chunk we'll use for this chunk
        load_idx_skip = self.row_order[:load_start] + self.row_order[load_end:]
        self.df_curr_loaded = pd.read_csv(self.data_filename, skiprows=load_idx_skip)
        self.df_curr_loaded = self.df_curr_loaded.sample(frac=1)

    def __len__(self):
        return self.n_examples

    def __getitem__(self, idx):
        if self.df_curr_loaded is None or self.curr_idx_in_load >= len(self.df_curr_loaded):
            self._load_next_chunk(idx)
            self.curr_idx_in_load = 0

        text_pair = self.df_curr_loaded[['orig', 'target']].values.astype(str)[self.curr_idx_in_load]
        self.curr_idx_in_load += 1

        item_data = preprocess_data(
            text_pair,
            self.tokenizer,
            self.max_length
        )

        return item_data

    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)

            if i == self.__len__()-1:
                self.on_epoch_end()

    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

The code below loads the pretrained T5 pytorch model and tokenizer, creates the data iterators for train and validation data (we have to pass in the tokenizer), then creates the training arguments and trainer objects that we'll use to fine-tune the pytorch model. You can see the main options you can change in the below code cells, like max length, max number of train examples to load at once in our iterator, and batch size. Reducing each of those will help reduce your memory usage if you're still having trouble with resources.

In [None]:
# Download tokenizer and model

model_name = 't5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Create the data iterators for train and validation data, pytorch version

max_length = 32
max_load_at_once = 100

train_data_iterator = TranslationDataIterator(
    tokenizer=t5_tokenizer,
    n_examples=len(train_pairs),
    max_load_at_once=max_load_at_once,
    data_filename=train_file,
    max_length=max_length
)

valid_data_iterator = TranslationDataIterator(
    tokenizer=t5_tokenizer,
    n_examples=len(valid_pairs),
    max_load_at_once=max_load_at_once,
    data_filename=valid_file,
    max_length=max_length
)

In [None]:
# Specify batch size and other training arguments

batch_size = 16

# Modify this filepath to where you want to save the model after fine-tuning
dir_path = 'drive/MyDrive/ISchool/MIDS/266/walkthroughs/'
file_path = dir_path + 't5base-finetuned-shakespeare-to-modern'

args = Seq2SeqTrainingArguments(
    file_path,
    evaluation_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
)

In [None]:
# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=train_data_iterator,
    eval_dataset=valid_data_iterator
)

In [None]:
# Call train

trainer.train()

Epoch,Training Loss,Validation Loss
1,0.9925,0.705465


TrainOutput(global_step=836, training_loss=0.8966267599443499, metrics={'train_runtime': 319.5366, 'train_samples_per_second': 41.817, 'train_steps_per_second': 2.616, 'total_flos': 508555958353920.0, 'train_loss': 0.8966267599443499, 'epoch': 1.0})

### Does it seem to have worked?

Depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [None]:
# Reduce unnecessary output
transformers.logging.set_verbosity_error()

for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model.generate(test_inputs['input_ids'].cuda())

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True,
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])



['You’ll never vex me again.']
['Do you want to leave me?']
['Make your own dinner.']


You might have noticed that the Seq2SeqTrainingArguments object takes a filepath where you want to save model checkpoints. If you look in your drive folder after you called .train(), at the filepath you used in the trainer arguments, you'll see a checkpoint folder saved.

You can load previously saved model weights from that filepath the same way you load external pretrained models from huggingface, using the ".from_pretrained()" method. Use the full path to that checkpoint folder as the argument to from_pretrained, to load the model you saved again later.

In [None]:
t5_model_saved = T5ForConditionalGeneration.from_pretrained(file_path + '/checkpoint-500')

In [None]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_saved.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True,
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

['You’ll never vex me again.']
['Do you want to leave me?']
['Make your own dinner.']
