# Overview

In this lab, you will be fine-tuning Google FLAN-T5 model with a custom dataset using Vertex AI Workbench with attached GPU(s).


# Goals

The goals of this lab is for you to learn the end-to-end workflow to tune Google FLAN-T5 model through doing hands-on exercise.

# Google FLAN-T5

Flan-T5 is a large language model (LLM) developed by Google AI. It is a fine-tuned version of T5, which is a text-to-text transfer transformer. Flan-T5 is trained on a mixture of tasks, rather than a single task, which allows it to learn a more general-purpose representation of language. Flan-T5 includes the same improvements as T5 version 1.1, as well as the following new features:

- Instruction finetuning: Flan-T5 is finetuned using a mixture of tasks, rather than a single task. This allows the model to learn a more general-purpose representation of language.
- Mixed-precision training: Flan-T5 is trained using mixed-precision training, which allows it to use more of the GPU's resources. This results in faster training times and better performance.
- Data augmentation: Flan-T5 uses data augmentation to artificially increase the size of the training dataset. This helps the model to learn more robust representations of language.

<img src="https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67" alt="Drawing" style="width: 1080px;"/>


It was released in the paper [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf). Flan-T5 is a powerful language model that can be used for a variety of tasks, such as text generation, translation, summarization, and question answering.

## Bias, risks, and limitations
The information below in this section are copied from the model's official paper:

> Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

## Ethical considerations and risks

Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

## Known limitations

Flan-T5 has not been tested in real world applications.

## Sensitive use:

Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.

# Environment setup

## Check for attached GPU

Fine-tuning models is a computationally intensive task. You will need a good GPU to support the workload. To check the attached GPU of this notebook instance, please run the following code:

In [None]:
! nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b0c857a0-2e9a-5d49-8dd0-457178afa2fb)


It should show something like "GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-XXX-XX)"

## Install required packages

To successfully run this notebook, you need to install the required packages. You can do this by executing following cell below.

In [None]:
! pip install datasets py7zr transformers[torch] rouge-score nltk evaluate --upgrade

⚠️⚠️⚠️⚠️⚠️ **In order to reflect the changes, you will need to restart the runtime after the installation.** ⚠️⚠️⚠️⚠️⚠️

# Dataset

## Download dataset

Here you will download the [samsum](https://huggingface.co/datasets/samsum) dataset. It is a collection of about 16,000 messenger-like conversations with summaries. The conversations were created by linguists fluent in English.

In [None]:
from datasets import load_dataset

# Define dataset ID to download
DATASET_ID = "samsum"

# Load the defined dataset
dataset = load_dataset(DATASET_ID)

Downloading and preparing dataset samsum/samsum to /home/jupyter/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e...


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /home/jupyter/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Here you will analyze the data in a fair bit of detail. You will check the following:

- The column names
- The column data types
- The amount of samples
- A few samples of the data

In [None]:
# Find out column names
dataset.column_names

{'train': ['id', 'dialogue', 'summary'],
 'test': ['id', 'dialogue', 'summary'],
 'validation': ['id', 'dialogue', 'summary']}

In [None]:
# Print out the value types of columns
dataset['train'].features

{'id': Value(dtype='string', id=None),
 'dialogue': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None)}

In [None]:
# Check for number of samples
for key, value in dataset.shape.items():
    print(f"{key} dataset: {value[0]} rows")

train dataset: 14732 rows
test dataset: 819 rows
validation dataset: 818 rows


In [None]:
import random

# Print out a few samples from the train dataset
for i in random.sample(range(len(dataset["train"])), 3):
    sample = dataset['train'][i]
    print(f"DIALOGUE:\n{sample['dialogue']}\n")
    print(f"SUMMARY:\n{sample['summary']}")
    print("-" * 80)

DIALOGUE:
Isla: Why didn’t you tell me you were dating Mike?
Linda: I thought you wouldn’t like it
Isla: I don’t care what he does
Isla: Or who he dates
Isla: This chapter is over for me
Isla: All I can tell you is good luck
Isla: Maybe it will work out for you

SUMMARY:
Linda didn't tell Isla about dating Mike. Isla is ok with it and hopes it will work out for them.
--------------------------------------------------------------------------------
DIALOGUE:
Lionel: What's your name?
Simona: You see my channel right?
Lionel: Yeah, what's your name?
Simona: Like, the one in the channel, dude. 

SUMMARY:
Simona's channels name is her real name too.
--------------------------------------------------------------------------------
DIALOGUE:
Caleb: Eva put channel 5 on
Eva: why???
Caleb: they're playing Broadchurch :D
Eva: whoa, thanks :D

SUMMARY:
Channel 5 is airing Broadchurch.
--------------------------------------------------------------------------------


## Prepare dataset

FLAN-T5 requires token IDs as input. In this step, you will use the **AutoTokenizer** from the **transformers** library to tokenize the data and convert it to token IDs.

To do this, you need to pass in the correct model ID to the **AutoTokenizer**. This will tell the tokenizer which tokenizer to initialize. It is important to use the correct tokenizer, as using the wrong tokenizer may not work or may create inaccurate tokens.

If you know the required tokenizer for the FLAN-T5 model, you can also directly import it. For FLAN-T5, the tokenizer is **T5Tokenizer** or **T5TokenizerFast**.

In [None]:
from transformers import AutoTokenizer

# Define the ID of the model
MODEL_ID = "google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
# Print out tokenized words
print(tokenizer.tokenize(sample['dialogue']))

['▁Cal', 'e', 'b', ':', '▁Eva', '▁put', '▁channel', '▁5', '▁on', '▁Eva', ':', '▁why', '???', '▁Cal', 'e', 'b', ':', '▁they', "'", 're', '▁playing', '▁Broad', 'church', '▁', ':', 'D', '▁Eva', ':', '▁who', 'a', ',', '▁thanks', '▁', ':', 'D']


In [None]:
# The result after coverting to token IDs
tokenizer(sample['dialogue'])

{'input_ids': [3104, 15, 115, 10, 17627, 474, 4245, 305, 30, 17627, 10, 572, 8665, 3104, 15, 115, 10, 79, 31, 60, 1556, 13017, 28854, 3, 10, 308, 17627, 10, 113, 9, 6, 2049, 3, 10, 308, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Abstractive summarization is a text-to-text generation task. That means the model will take a text as input and generate a summary (text) as output. Hence, before fine-tuning the model to perform summarization tasks, you need find out the maximum token legenth for both input and output texts.

After knowing the maximum token length, you can truncate the sentences that are longer than the maximum token length or pad the sentences which are shorter. By doing this, all of the sentences in the dataset become the same length, which will make it easier for the model to learn how to summarize text.

In [None]:
from datasets import concatenate_datasets

# Combined train and test datasets
combined_dataset = concatenate_datasets([dataset["train"], dataset["test"]])

# Tokenize the dialogue column and find the maximum length
tokenized_inputs = combined_dataset.map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_input_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Maxium input length: {max_input_length} tokens")

# Tokenize the summary column and find the maximum length
tokenized_targets = combined_dataset.map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_output_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Maxium output length: {max_output_length} tokens")

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Maxium input length: 512 tokens


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Maxium output length: 95 tokens


In [None]:
def preprocess_func(sample, padding=True):
    # add prefix to convert the dialogue to prompt
    prompts = [f"summarize: {item}" for item in sample["dialogue"]]

    # Tokenize the prompts as inputs
    model_inputs = tokenizer(prompts, max_length=max_input_length, padding=padding, truncation=True)

    # Tokenize the summaries as labels, `text_target` keyword argument is used to tokenize targets
    labels = tokenizer(text_target=sample["summary"], max_length=max_output_length, padding=padding, truncation=True)

    # If padding is applied here, replace all tokenizer.pad_token_id in the labels with -100.
    # Those -100 will be ignored while computing evaluation metrics
    if padding:
        labels["input_ids"] = [
            [(label if label != tokenizer.pad_token_id else -100) for label in label_list] for label_list in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


# Create a tokenized dataset
tokenized_dataset = dataset.map(preprocess_func, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


# Model

## Load FLAN-T5 model

Here you will load the model using the `MODEL_ID` defined earlier. As T5 is a Seq2Seq model, you will use **AutoModelForSeq2SeqLM** to load the correct model.

In [None]:
from transformers import AutoModelForSeq2SeqLM

# Load the model using pre-defined model ID
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Define evalution metrics

Although a domain expert can evaluate the model's responses using their knowledge, it is better to have measurable metrics to compare the model responses. In this step, you will define the necessary functions to compare the metrics.

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Use rouge metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_outputs(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(preds_and_labels):
    preds, labels = preds_and_labels
    if isinstance(preds, tuple):
        preds = preds[0]

    # Batch decode the predictions using tokenizer
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Ignore -100 because they are paddings
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Post-process the outputs, both labels and predictions
    decoded_preds, decoded_labels = postprocess_outputs(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

# Fine-tuning

## Prepare for fine-tuning

As mentioned earlier, if the samples in the training or validation dataset are shorter than the maximum input length, padding is required to ensure that the samples can be processed in batches. In this step, you will use the **DataCollatorForSeq2Seq** class to create a data collator that pads the samples.

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

## Fine-tune the model

Now, you have all the necessary components to begin the fine-tuning process. All that remains is to define the training arguments and initialize the trainer object. You will do both in the code cell below.

In [None]:
import os
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

LOGGING_DIR = os.path.join(MODEL_ID.split("/")[-1], "logs")
OUTPUT_DIR = os.path.join(MODEL_ID.split("/")[-1], "output")

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    # model traning parameters
    per_device_train_batch_size=24,
    per_device_eval_batch_size=24,
    predict_with_generate=True,
    fp16=False,
    learning_rate=5e-5,
    num_train_epochs=5,

    # output & logging
    output_dir=OUTPUT_DIR,
    logging_dir=LOGGING_DIR,
    logging_strategy="steps",
    logging_steps=500,

    # evaluation strategies
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
)

# Create trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
)


Here you will tune the model by training the model with the new dataset. It will take awhile.

In [None]:
# Start the training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.464,1.39833,47.9942,24.7961,40.5846,44.6705,17.506112
2,1.4026,1.385565,48.0781,24.8065,40.4669,44.5701,17.464548
3,1.3618,1.387767,48.199,25.2246,40.8419,44.8597,17.405868


TrainOutput(global_step=1842, training_loss=1.3994161997245267, metrics={'train_runtime': 1638.5979, 'train_samples_per_second': 26.972, 'train_steps_per_second': 1.124, 'total_flos': 3.026353594879181e+16, 'train_loss': 1.3994161997245267, 'epoch': 3.0})

Here you will evaluate the fine-tuning model.

In [None]:
trainer.evaluate()

{'eval_loss': 1.3855650424957275,
 'eval_rouge1': 48.0781,
 'eval_rouge2': 24.8065,
 'eval_rougeL': 40.4669,
 'eval_rougeLsum': 44.5701,
 'eval_gen_len': 17.464547677261614,
 'eval_runtime': 30.2485,
 'eval_samples_per_second': 27.043,
 'eval_steps_per_second': 1.157,
 'epoch': 3.0}

## Test the fine-tuned Model

The moment of truth. Here you will test the tuned model with a random sample.

In [None]:
from transformers import pipeline
from random import randrange

# Create an inference pipeline with fine-tuned model and tokenizer
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=0)

# select a test sample
sample = dataset['test'][10]
print(f"DIALOGUE:\n{sample['dialogue']}\n")
print(f"GIVEN SUMMARY:\n{sample['summary']}\n")

# summarize dialogue using the base model
max_length = len(tokenizer(sample['dialogue'])['input_ids'])
response = summarizer(sample["dialogue"], max_length=max_length)

print(f"flan-t5-base-tuned SUMMARY:\n{response[0]['summary_text']}")

DIALOGUE:
Wanda: Let's make a party!
Gina: Why?
Wanda: beacuse. I want some fun!
Gina: ok, what do u need?
Wanda: 1st I need too make a list
Gina: noted and then?
Wanda: well, could u take yours father car and go do groceries with me?
Gina: don't know if he'll agree
Wanda: I know, but u can ask :)
Gina: I'll try but theres no promisess
Wanda: I know, u r the best!
Gina: When u wanna go
Wanda: Friday?
Gina: ok, I'll ask

GIVEN SUMMARY:
Wanda wants to throw a party. She asks Gina to borrow her father's car and go do groceries together. They set the date for Friday. 

flan-t5-base-tuned SUMMARY:
Wanda and Gina are going to make a party on Friday. Gina will take her father's car and go grocery shopping with her.


## Save the fine-tuned model & its tokenizer

When you are satisfied with the model performance, run the cell below to save the model and its tokenizer to local disk. You can also upload the model and tokenizer to depoly on the cloud.

In [None]:
SAVE_DIR = "fine-tuned-flan-t5"
MODEL_DIR = os.path.join(SAVE_DIR, "model")
TOKENIZER_DIR = os.path.join(SAVE_DIR, "tokenizer")

trainer.save_model(MODEL_DIR)
tokenizer.save_pretrained(TOKENIZER_DIR)

('fine-tuned-flan-t5/tokenizer/tokenizer_config.json',
 'fine-tuned-flan-t5/tokenizer/special_tokens_map.json',
 'fine-tuned-flan-t5/tokenizer/tokenizer.json')

## Load the saved model & its tokenizer

In the next cell, you will reload the model and tokenzier from local disk and perform inference again.

In [None]:
local_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR)
local_tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)

In [None]:
# Create an inference pipeline with locally saved model and tokenizer
summarizer = pipeline("summarization", model=local_model, tokenizer=local_tokenizer, device=0)

# select a test sample
random_sample = dataset['test'][10]
print(f"DIALOGUE:\n{random_sample['dialogue']}\n")
print(f"GIVEN SUMMARY:\n{random_sample['summary']}\n")

# summarize dialogue using the base model
max_length = len(tokenizer(random_sample['dialogue'])['input_ids'])
response = summarizer(random_sample["dialogue"], max_length=max_length)

print(f"flan-t5-base-tuned SUMMARY:\n{response[0]['summary_text']}")

DIALOGUE:
Wanda: Let's make a party!
Gina: Why?
Wanda: beacuse. I want some fun!
Gina: ok, what do u need?
Wanda: 1st I need too make a list
Gina: noted and then?
Wanda: well, could u take yours father car and go do groceries with me?
Gina: don't know if he'll agree
Wanda: I know, but u can ask :)
Gina: I'll try but theres no promisess
Wanda: I know, u r the best!
Gina: When u wanna go
Wanda: Friday?
Gina: ok, I'll ask

GIVEN SUMMARY:
Wanda wants to throw a party. She asks Gina to borrow her father's car and go do groceries together. They set the date for Friday. 

flan-t5-base-tuned SUMMARY:
Wanda and Gina are going to make a party on Friday. Gina will take her father's car and go grocery shopping with her.


# Conclusion

This is end of the lab. Through this lab, you have worked through
- FLAN-T5 model summary, it's bias, risk, and limitations
- Checking for attached GPU to the instance
- Setting up the environment for the lab
- Download and preparing the dataset for the fine-tuning task
- Loading the model and defining evaluation metrics for the fine-tuning-task
- Testing, saving, and loading the models and its tokenizer

Now, you are ready to fine-tune similar LLMs with your own dataset. Whenever you create a new AI model, please try to follow the [Responsible AI Practices](https://ai.google/responsibility/responsible-ai-practices/). Also, have a look at [Google AI Principles](https://ai.google/responsibility/principles/) if you are in need of guielines.

If you want to learn more about T5 fine-tuning, this blog from Google Research is a good read.