# Fine-tuning a T5-small model for summarization

Here, I target the fine-tuning, the quantization and the evaluation parts of the practical use case.

I chose to work on summarization as it is a task we all need from time to time, or even daily. Moreover, several datasets are available on HuggingFace. I will be using the [XSum](https://arxiv.org/pdf/1808.08745) dataset available on [HuggingFace](https://huggingface.co/datasets/EdinburghNLP/xsum), which contains BBC articles accompanied with single-sentence summaries.

Now, let's talk about the chosen model. The T5 model is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation or summarization. In my case, the prefix for summarization task will be `summarize: `. It uses relative scalar embeddings and comes in different sizes: small, base, large, 3b, 11b.

To fine-tune the model, I use LoRA, or Low Rank Adaptation, from the PEFT (Parameter Efficient Fine-Tuning) family. It is a technique that accelerates the fine-tuning of large models while consuming less memory, which is something valuable in my Google Colab environment! The idea is to freeze the original pre-trained weights and introduce an additional matrix computed using matrix decomposition, which makes it small. This new matrix is trained on new data while the original weights matrix are frozen, thus reducing the number of trainable parameters. Finally, both the original and the new weights are combined.

To evaluate the fine-tuned model and compare it with the original one, I chose the ROUGE metric. [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge), or Recall-Oriented Understudy for Gisting Evaluation, is used for evaluating automatic summarization and machine translation software in Natural Language Processing (NLP). In my case, it will compare an automatically produced summary with a human-produced (set of) reference(s). ROUGE is case insensitive and consists of 4 different metrics (depending on the length of the subsequence you base your score).

## Requirements and setup

In [1]:
from google.colab import drive
import os

drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/CDI/UseCaseLLM_Valeo/training')

Mounted at /content/drive


In [2]:
!pip install datasets evaluate transformers rouge-score nltk peft

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m10.1 MB/s[

In [3]:
from os import path
import numpy as np
import nltk
import transformers
from datasets import load_dataset
from evaluate import load, SummarizationEvaluator
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5ForConditionalGeneration, pipeline
from peft import LoraConfig, get_peft_model
import torch.quantization

assert transformers.__version__ >= "4.11.0"
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## T5-small summarization fine-tuning with LoRA

### Load the data

We will use the HuggingFace [Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation.

In [4]:
import datasets
datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory='.': True

raw_dataset = load_dataset("xsum", trust_remote_code=True)
metric = load("rouge")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Let's see some excerpts of the dataset:

In [5]:
print("XSum dataset summary:")
print(raw_dataset)

print("\nXSum dataset sample:")
raw_dataset["train"][0]

XSum dataset summary:
DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

XSum dataset sample:


 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

### Pre-process the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a HuggingFace Transformers `Tokenizer` which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, I instantiate a tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- the tokenizer corresponds to the T5-small model architecture ;
- the vocabulary used when pretraining this specific checkpoint is downloaded.

In [6]:
model_checkpoint = "t5-small"
prefix = "summarize: "

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Then, the inputs are fed to the tokenizer with the argument `truncation=True`, that will ensure that an input is not longer that what the model selected can handle. Otherwise, the input will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [7]:
max_input_length = 1024
max_target_length = 128

def preprocessing(sample, prefix="summarize: "):
    inputs = [prefix + doc for doc in sample["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = raw_dataset.map(preprocessing, batched=True)

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

### LoRA fine-tuning

 Let's download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class.

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

For the LoRA fine-tuning, we need to freeze the parameters of the model.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

In [None]:
config = LoraConfig(
    r=4, # dimension of the low-rank matrices
    lora_alpha=32, # scaling factor for the low-rank matrices
    lora_dropout=0.05, # dropout probability of the LoRA layers
    target_modules=["k","q","v","o"],
    bias="none",
    task_type="Seq2Seq" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 294,912 || all params: 60,801,536 || trainable%: 0.4850


We need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels

In [None]:
batch_size = 32

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=2, # equivalent to batch_size = 64
    per_device_eval_batch_size=batch_size,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=100,
    num_train_epochs=1,
    # fp16=True,
    evaluation_strategy="steps",
    eval_steps=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    predict_with_generate=True,
    save_strategy="steps",
    save_steps=0.1,
    save_total_limit=3,
    output_dir="outputs"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)



We need to define a function to compute the ROUGE metric, which will just use the metric we loaded earlier. We also have to do a bit of pre-processing to decode the predictions into texts.

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Now, we can start the fine-tuning.

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    args=training_args,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()

  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
319,No log,2.986775,18.0733,2.75,14.1757,14.3312,18.7725
638,3.457900,2.794375,20.7295,3.7632,16.0454,16.0487,18.5356


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


In [None]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	eval_steps: 0.1 (from args) != 319 (from trainer_state.json)
	save_steps: 0.1 (from args) != 319 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss


In [None]:
trainer.train(resume_from_checkpoint=True)

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
	eval_steps: 0.1 (from args) != 319 (from trainer_state.json)
	save_steps: 0.1 (from args) != 319 (from trainer_state.json)
  checkpoint_rng_state = torch.load(rng_file)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1914,2.9765,2.694472,24.1718,5.0889,18.6163,18.6197,18.6437
2233,2.9611,2.688494,24.3006,5.161,18.7483,18.7466,18.6482
2552,2.9424,2.684112,24.4401,5.2554,18.8577,18.8551,18.6632
2871,2.9424,2.681842,24.5071,5.2821,18.9143,18.9133,18.6684


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=3188, training_loss=1.4718977522520977, metrics={'train_runtime': 8694.082, 'train_samples_per_second': 23.469, 'train_steps_per_second': 0.367, 'total_flos': 5.547512787841843e+16, 'train_loss': 1.4718977522520977, 'epoch': 0.9998431864513094})

Let's save the LoRA checkpoint.

In [None]:
model.save_pretrained("outputs/lora-xsum-t5-small", from_pt=True)


### Quantization

**This part can be run INDEPENDENTLY from the *LoRA fine-tuning* and the *Evaluation* parts**

To quantize the model, we could perform a QLoRA fine-tuning, which combine a LoRA fine-tuning (used previously) with quantization. However, this technique implies a second GPU run, so this technique has not been done because of the lack of resource.

Even if inference speed increase is not necessary with an already fast T5-small model, we can try to dynamically quantize the model.

In [37]:
finetuned_ckpt = "/content/drive/MyDrive/CDI/UseCaseLLM_Valeo/training/outputs/lora-xsum-t5-small"
finetuned_t5 = T5ForConditionalGeneration.from_pretrained(finetuned_ckpt)

def count_parameters(model):
  return sum(p.numel() for p in model.parameters())

param_count = count_parameters(finetuned_t5)

memory = (param_count * 4) / (1024 * 1024)  # 1 float32 => 4 bytes
print(f"The model weights use {memory:.2f} Mo of memory.")

quantized_t5 = torch.quantization.quantize_dynamic(finetuned_t5, {torch.nn.Linear}, dtype=torch.qint8)
torch.save(quantized_t5.state_dict(), "outputs/quantized_t5.pt")
quantized_memory = path.getsize("./outputs/quantized_t5.pt") / (1024 * 1024)
print(f"The quantized model weights use {quantized_memory:.2f} Mo of memory.")

The model weights use 231.94 Mo of memory.
The quantized model weights use 121.14 Mo of memory.


We almost reduced the memory size by half! This is what we could expect from a int8 quatization. Now let's see if we have increased the inference speed.

In [46]:
%timeit finetuned_t5.generate(torch.tensor([tokenized_dataset['test'][0]['input_ids']], dtype=torch.int))



2.91 s ± 979 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [47]:
%timeit quantized_t5.generate(torch.tensor([tokenized_dataset['test'][0]['input_ids']], dtype=torch.int))

AttributeError: 'function' object has no attribute 'dtype'

We can't evaluate the impact of the quantization on the inference time as a bug is still causing problem on the `transformer` library (see PR [here](https://github.com/huggingface/transformers/pull/21843), apparently not integrated in last version 4.44.2 used here...).

This quantized model won't be evaluated because of this bug and the lack of GPU resource. Yet, [here](https://huggingface.co/Intel/t5-small-xsum-int8-dynamic-inc) is what we can expect from such a quantization.

### Evaluation

**This part can be run INDEPENDENTLY from the *LoRA fine-tuning* and the *Quantization* parts**

Let's reload the base model and the fine-tuned model to compare them.

In [None]:
base_ckpt = "t5-small"
finetuned_ckpt = "/content/drive/MyDrive/CDI/UseCaseLLM_Valeo/training/outputs/lora-xsum-t5-small"

base_t5 = T5ForConditionalGeneration.from_pretrained(model_checkpoint).to('cuda:0')
finetuned_t5 = T5ForConditionalGeneration.from_pretrained(finetuned_ckpt).to('cuda:0')

evaluator = SummarizationEvaluator()

Now, we use the test split to compare the fine-tuned model with the base one. We start to evaluate the base T5-small model, downloaded from HuggingFace.

In [None]:
evaluator.compute(
    model_or_pipeline=base_t5,
    tokenizer=tokenizer,
    data=tokenized_dataset['test'],
    metric=metric,
    input_column='document',
    label_column='summary')



{'rouge1': 0.19024007944134058,
 'rouge2': 0.02759556554189877,
 'rougeL': 0.12973109642401592,
 'rougeLsum': 0.12973427387429495,
 'total_time_in_seconds': 8122.8225630328525,
 'samples_per_second': 1.3953277831749382,
 'latency_in_seconds': 0.7166774804158155}

Then, we evaluate the fine-tuned T5-small model.

In [None]:
evaluator.compute(
    finetuned_t5,
    tokenizer=tokenizer,
    data=tokenized_dataset['test'],
    metric=metric,
    input_column='document',
    label_column='summary')



{'rouge1': 0.24084408398175577,
 'rouge2': 0.051862710863645536,
 'rougeL': 0.17612313934129226,
 'rougeLsum': 0.17612388769238957,
 'total_time_in_seconds': 8200.11055256,
 'samples_per_second': 1.3821764874208469,
 'latency_in_seconds': 0.723496607778366}

We observe a slight improvement in the scores of the fine-tuned model compared with the one of the base T5-small model. This is disapointing but predictible as we only fine-tuned the model with 1 epoch.

Let's visualize the predictions of both model on a specific example.

In [None]:
data = raw_dataset['test'][1]
doc = data["document"]
summary = data["summary"]
input_ids = tokenizer(prefix + doc, return_tensors="pt").input_ids.to("cuda:0")

base_output = base_t5.generate(input_ids)
finetuned_output = finetuned_t5.generate(input_ids)

print(f"Original summary:\n{summary}\n")
print(f"Prediction from base T5-small model:\n{tokenizer.decode(base_output[0], skip_special_tokens=True)}\n")
print(f"Prediction from the fine-tuned T5-small model:\n{tokenizer.decode(finetuned_output[0], skip_special_tokens=True)}")

Original summary:
A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.

Prediction from base T5-small model:
police say three firearms, ammunition and a five-figure sum of money were recovered

Prediction from the fine-tuned T5-small model:
Police have recovered three firearms, ammunition and a five-figure sum of money from a man who was arrested and charged with aggravated burglary.
