## Fine-tuning a Seq2Seq model (FLAN-T5) in Colab with limited RAM

This notebook shows how to fine-tune a sequence-to-sequence (encoder-decoder) model, using FLAN-T5, on a new task. It also focuses on how to avoid running out of memory by loading part of your data at a time while you train (using a Huggingface streaming dataset), and saving model checkpoints as you go.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-summer-main/blob/master/materials/walkthrough_notebooks/fine_tuning_with_limited_ram/fine_tune_t5_with_limited_ram.ipynb)

In [1]:
!pip install -q -U transformers
!pip install -q -U datasets
!pip install -q sentencepiece
# !pip install accelerate -U

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "L

In [2]:
import os
import re
import numpy as np
import pandas as pd

import torch
import transformers
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

### Data

To fine-tune T5, we'll use a small dataset for translating Shakespeare to modern English. You can [download the dataset here](https://github.com/cocoxu/Shakespeare), and save it to someplace in your Drive. In the next cells, we'll mount our Drive folder and load the data from there.

In [3]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Modify this path to where you saved the Shakespear data in your Drive.
text_file = 'drive/MyDrive/ISchool/MIDS/266/data/train_plays-org-mod.txt'

In [5]:
#confirm we can access the file
!ls -lh drive/MyDrive/ISchool/MIDS/266/data

total 123M
-rw------- 1 root root 651K Mar 13 23:11 embold_github_dev.csv
-rw------- 1 root root  24M Mar 13 23:11 embold_github_train.csv
-rw------- 1 root root  94M Mar 13 23:09 embold_train.json
-rw------- 1 root root 261K Mar 15 05:46 HerdOfModelsPaperTable.png
-rw------- 1 root root 440K Jun 29 18:56 test_pairs.csv
-rw------- 1 root root 2.1M Jun 29 18:56 train_pairs.csv
-rw------- 1 root root 2.1M Aug 13  2023 train_plays-org-mod.txt
-rw------- 1 root root 459K Jun 29 18:56 valid_pairs.csv


In [6]:
# Read the data, which is in one txt file, with one example per line.
# Each example is a pair of Shakespearean and modern English sentences, separated by a tab.

with open(text_file) as f:
    lines = f.read().split('\n')[:-1]

prefix = 'Translate the following text to modern English: '
text_pairs = []
for line in lines:
    orig, target = line.split('\t')
    text_pairs.append({'orig': prefix + orig, 'target': target})

In [7]:
# Look at some examples
for _ in range(5):
    print(np.random.choice(text_pairs))

{'orig': 'Translate the following text to modern English: Away!', 'target': 'Let’s go!'}
{'orig': "Translate the following text to modern English: Get your apparel together; good strings to your beards, new ribbons to your pumps; meet presently at the palace; every man look o'er his part; for the short and the long is, our play is preferred.", 'target': 'Get your costumes together; good strings to your beards, new ribbons to your shoes; meet presently at the palace; every man look over his part; for the short and the long of it is, they want to see our play.'}
{'orig': 'Translate the following text to modern English: Conrade, I say!', 'target': 'Conrade, I say!'}
{'orig': 'Translate the following text to modern English: Kent banished thus?', 'target': 'Kent’s been banished just like that?'}
{'orig': 'Translate the following text to modern English: He hath fought today As if a god, in hate of mankind, had Destroyed in such a shape.', 'target': 'Today he fought as though he were a god wh

In [8]:
# Let's create some splits
np.random.shuffle(text_pairs)
num_valid_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = text_pairs[:num_train_samples]
valid_pairs = text_pairs[num_train_samples : num_train_samples + num_valid_samples]
test_pairs = text_pairs[num_train_samples + num_valid_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(valid_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
13362 training pairs
2863 validation pairs
2863 test pairs


In [9]:
# Save this because we'll need to tell the trainer how many examples we have
num_train_examples = len(train_pairs)

In [10]:
# Save splits to separate csv files, to load only part at a time later
train_filepath = 'drive/MyDrive/ISchool/MIDS/266/data/train_pairs.csv'
valid_filepath = 'drive/MyDrive/ISchool/MIDS/266/data/valid_pairs.csv'
test_filepath = 'drive/MyDrive/ISchool/MIDS/266/data/test_pairs.csv'

pd.DataFrame(train_pairs).to_csv(train_filepath, index=False)
pd.DataFrame(valid_pairs).to_csv(valid_filepath, index=False)
pd.DataFrame(test_pairs).to_csv(test_filepath, index=False)

In [11]:
# Remove the full dataset from memory
train_pairs = None
valid_pairs = None
test_pairs = None

The code below loads the pretrained T5 pytorch model and tokenizer. We'll need the tokenizer before we write our preprocessing function, which we'll map to the streaming datasets, so let's load them now.

We'll use FLAN-T5, which is a version of T5 that has been fine-tuned on a collection of instruction tuning datasets. This makes it better than the original T5 at learning new tasks, when provided with a natural language instruction prompt explaining the new task.

In [12]:
# Download tokenizer and model

model_name = 'google/flan-t5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Streaming Dataset

Now we'll create our train and dev datasets. We're going to use Huggingface dataset objects, which allow us to load publicly available datasets directly from Huggingface, or from a local source (we'll use the CSVs stored in our Drive).

When loading a dataset, we can set streaming=True, which means the data won't all get loaded at once. Instead, the trainer will only load a batch of data at a time while the model is trained.

In [13]:
# Create the datasets for train and validation data

train_dataset = load_dataset("csv", data_files=train_filepath, streaming=True)['train']
valid_dataset = load_dataset("csv", data_files=valid_filepath, streaming=True)['train']

Make a preprocessing function that just takes a pair of sentences (input text and output text) from the dataset. We use the tokenizer to encode the input text into vocab_ids, that will be the inputs to the model's encoder.

We also encode the output text into vocab_ids, and use those as the labels. (For training, the model will automatically create the decoder's input_ids from the labels, shifting them one token to the right and inserting the start of sequence token first.)

In [14]:
max_length = 128

def preprocess_data(text_pair):
    orig_text, target_text = text_pair
    orig_encoded = t5_tokenizer.batch_encode_plus(
        [orig_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    orig_input_ids = orig_encoded['input_ids'][0]
    orig_attention_mask = orig_encoded['attention_mask'][0]

    target_encoded = t5_tokenizer.batch_encode_plus(
        [target_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    label_ids = target_encoded['input_ids'][0]

    return {'input_ids': orig_input_ids,
            'attention_mask': orig_attention_mask,
            'labels': label_ids}

Now map the preprocessing function to the train and dev datasets. Since they're set to stream data, the preprocessing function won't be called until batches of data are dynamically loaded during training.

In [15]:
# Map the preprocessing function to the datasets (it will be called when batches are loaded)

train_dataset = train_dataset.map(preprocess_data)
valid_dataset = valid_dataset.map(preprocess_data)

# Training

Now we're ready to set up the trainer. We need a TrainingArguments class, where we specify the batch size and number of epochs. Since the streaming dataset doesn't indicate how much data there actually is, we specify the total number of batches we want to run in the argument max_steps.

We also specify a filepath to save the model after training. Don't use a local filepath to the notebook's temporary memory, or this will go away when the notebook shuts down. Instead, you want to save the your fine-tuned model to a place in your mounted Drive folder. That way, you can load the model you trained later on (e.g. to continue training more epochs, and/or to do model evaluation), rather than having to start over.

In [16]:
# Specify batch size and other training arguments

batch_size = 16
num_epochs = 3

# Modify this filepath to where you want to save the model after fine-tuning
dir_path = 'drive/MyDrive/ISchool/MIDS/266/model_checkpoints/'
file_path = dir_path + 't5base-finetuned-shakespeare-to-modern'

args = Seq2SeqTrainingArguments(
    file_path,
    eval_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    max_steps=int(num_epochs * num_train_examples / batch_size),
    report_to='none'
)

The Trainer class takes the TrainingArguments we just defined, the model to use, and the train and validation datasets (our streaming datasets).

Then we just call trainer.train()!

In [17]:
# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)

In [18]:
# Call train - note this will take awhile to complete the three epochs

trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
0,1.1113,2.9e-05
1,0.0002,1.1e-05
2,0.0001,8e-06


TrainOutput(global_step=2505, training_loss=0.22204420561083135, metrics={'train_runtime': 2913.7441, 'train_samples_per_second': 13.755, 'train_steps_per_second': 0.86, 'total_flos': 6856475370061824.0, 'train_loss': 0.22204420561083135, 'epoch': 2.3325349301397207})

### Does it seem to have worked?

Depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned FLAN-T5 for this new task we defined. It's not perfect, but it does seem to convert at least some of the antiquated Shakespearean style text into modern equivalents (e.g. "Dost thou" to "Do you" and  "Makest thine" to "Make your").

In [19]:
# Reduce unnecessary output
transformers.logging.set_verbosity_error()

for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model.generate(test_inputs['input_ids'].cuda())

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True,
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

['Thus forth thou shalt not vex me again.']
['Do you foresee me?']
['Make your own dinner.']


Later, we can load previously saved model checkpoints from the filepath we specified in Drive, the same way we load external pretrained models from huggingface, using the ".from_pretrained()" method. Look in Drive for the exact checkpoint name that was saved in the directory you chose.

In [20]:
# Edit this to use the correct checkpoint name that was saved in your Drive.
# If it saved one per epoch, you can use the latest one, or the one that
# had the lowest validation loss during training.

t5_model_saved = T5ForConditionalGeneration.from_pretrained(file_path + '/checkpoint-2505')

In [21]:
# Check that it still works
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_saved.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True,
                               clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

['Thus forth thou shalt not vex me again.']
['Do you foresee me?']
['Make your own dinner.']
