# Fine-Tune a LLM for Summarization Task: Samsum Dataset

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [1]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check torch==1.13.1 torchdata==0.5.1 --quiet
# %pip install transformers==4.27.2 datasets==2.11.0 py7zr evaluate==0.4.0 rouge_score==0.1.2 loralib==0.1.1 peft==0.3.0 --quiet



Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook. 

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

2023-09-21 07:11:10.691580: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

The [SAMSum](https://huggingface.co/datasets/samsum) dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person. The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes (non-commercial licence: CC BY-NC-ND 4.0).

In [3]:
huggingface_dataset_name = "samsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

HF google storage unreachable. Downloading and preparing it from source


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [4]:
model_name='google/flan-t5-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

cuda


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [6]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Abdellilah: Where are you?
Sam: work
Abdellilah: What time you finish?
Sam: Not til 5
Abdellilah: Are your bringing him over tonight:
Sam: No in the morning:
Abdellilah: ok, what time?
Sam: About 9. Is that ok?
Abdellilah: ok - see you then

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Sam finishes work at 5 and will bring Abdellilah to work at about 9.


<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [7]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary',])

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

Check the shapes of all three parts of the dataset:

In [8]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (14732, 2)
Validation: (818, 2)
Test: (819, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 818
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [9]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=1e-5,
    weight_decay=0.01,
    num_train_epochs=2,
    logging_steps=500,
    eval_steps=500,
    save_steps=500,
    evaluation_strategy="steps",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Start training process...

It might take few minutes to run

In [10]:
trainer.train()



Step,Training Loss,Validation Loss
500,39.117,38.819073
1000,33.1493,36.078239
1500,31.993,35.715771
2000,31.937,35.691319
2500,31.8903,35.690708
3000,31.888,35.683372
3500,31.8775,35.685818


TrainOutput(global_step=3684, training_loss=33.06504478827362, metrics={'train_runtime': 2176.0261, 'train_samples_per_second': 13.54, 'train_steps_per_second': 1.693, 'total_flos': 2.017569063252787e+16, 'train_loss': 33.06504478827362, 'epoch': 2.0})

Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [11]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./dialogue-summary-training-1695280355/checkpoint-3500", torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [12]:
index = 20
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""
print(device)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
original_model.cuda()
instruct_model.cuda()

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (603 > 512). Running this sequence through the model will result in indexing errors


cuda
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Beth wants to organize a girls weekend to celebrate her mother's 40th birthday. She also wants to work at Deidre's beauty salon. Deidre offers her a few hours on Saturdays as work experience. They set up for a meeting tomorrow.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Beth wants to do a girls weekend with Kira and the girls. Deirdre wants to ask Beth about her mum's 40th birthday. Beth wants to try out the beauty therapist. Deirdre wants to meet Beth tomorrow after school.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Beth's mother's 40th birthday is in 6 weeks. Deirdre is looking for Saturday girls. Beth wants to try a bit of work experience in the salon. Deirdre is looking for Saturday girls. Beth is 16 and wants t

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [13]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [14]:
dialogues = dataset['test']['dialogue']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

In [16]:
human_baseline_summaries = dataset['test']['summary']
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
pd.set_option('display.max_colwidth', 1000) 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df.head(50)

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.,Hannah wants to know Betty's number. Amanda can't find it. Amanda asks Larry Larry if Betty's phone number is the same.,Amanda can't find Betty's number. Amanda will ask Larry.
1,Eric and Rob are going to watch a stand-up on youtube.,Eric and Rob are watching a train video.,Eric and Rob are watching a stand-up. Eric and Rob will watch some of his stand-ups on youtube.
2,Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.,Lenny will buy the third pair of trousers from Bob.,Lenny wants to buy two pairs of black trousers. Bob recommends the first pair.
3,Emma will be home soon and she will let Will know.,Emma will be home soon. Will will pick her up.,Emma will be home soon. Will will pick her up.
4,Jane is in Warsaw. Ollie and Jane has a party. Jane lost her calendar. They will get a lunch this week on Friday. Ollie accidentally called Jane and talked about whisky. Jane cancels lunch. They'll meet for a tea at 6 pm.,Jane is in Warsaw. Ollie reminds Jane that they have a party on Friday. Jane lost her calendar. Ollie reminds Jane that she has lost it. Jane is in Morocco. Jane is busy. Ollie is going to bring some sun with Jane. Jane is busy. Jane is on her way. Jane is going to bring some tea. Ollie is on her way. Jane is on her way. Ollie will bring some pastries. Jane is on her way. Jane is on her way. Jane is on her way.,Jane lost her calendar. Ollie and Jane have lunch on Friday. Jane will be in Morocco at 6 pm. Ollie will bring pastries.
5,Hilary has the keys to the apartment. Benjamin wants to get them and go take a nap. Hilary is having lunch with some French people at La Cantina. Hilary is meeting them at the entrance to the conference hall at 2 pm. Benjamin and Elliot might join them. They're meeting for the drinks in the evening.,"Hilary, Elliot and Benjamin are meeting for drinks in the evening. Hilary and Hilary are meeting at the entrance to the conference hall. Hilary will meet with the French people at La Cantina. Hilary and Benjamin will have lunch together. Hilary and Hilary will meet for a nap.",Hilary and Elliot are meeting at the conference hall at 2 pm. Hilary will meet them at the entrance to the conference hall at 2 pm. Hilary will meet them at La Cantina. Hilary will take the keys and take a nap.
6,Payton provides Max with websites selling clothes. Payton likes browsing and trying on the clothes but not necessarily buying them. Payton usually buys clothes and books as he loves reading.,Payton usually buys clothes from 2 or 3 of them. Max will check them out.,Payton likes to browse and try on clothes. Max will check out the websites.
7,Rita and Tina are bored at work and have still 4 hours left.,Rita is tired and bored at work.,Rita is tired and is not able to concentrate at work.
8,"Beatrice wants to buy Leo a scarf, but he doesn't like scarves. She cares about his health and will buy him a scarf no matter his opinion.","Beatrice is in town, and she's looking for a scarf for Leo.","Beatrice is in town, shopping. She has a scarf in the shop next to the church. Leo doesn't like them."
9,Eric doesn't know if his parents let him go to Ivan's brother's wedding. Ivan will talk to them.,Eric is coming to Eric's brother's wedding. Ivan will take care of his parents.,Eric is coming to Ivan's brother's wedding. Eric has a lot of work at home. Eric will talk to Ivan's parents.


In [17]:
df.shape

(819, 3)

Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [18]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print(len(instruct_model_summaries))
print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

819
ORIGINAL MODEL:
{'rouge1': 0.4193288022161734, 'rouge2': 0.17492223858519834, 'rougeL': 0.344462213698505, 'rougeLsum': 0.34458873490984254}
INSTRUCT MODEL:
{'rouge1': 0.4809178705029691, 'rouge2': 0.2372639007386037, 'rougeL': 0.40366507380512123, 'rougeLsum': 0.40344438847621056}


The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

The results show substantial improvement in all ROUGE metrics:

In [19]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 6.16%
rouge2: 6.23%
rougeL: 5.92%
rougeLsum: 5.89%


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [20]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [21]:
original_model_pre = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = get_peft_model(original_model_pre, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [22]:
output_dir = f'./peft-samsum-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=3,
    logging_steps=200,
    max_steps=2000
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.

In [23]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
200,2.6132
400,0.1257
600,0.1104
800,0.1076
1000,0.1033
1200,0.1022
1400,0.1006
1600,0.0981
1800,0.0979
2000,0.0987


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [24]:
from peft import PeftModel, PeftConfig

original_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(original_model_base, 
                                       './peft-dialogue-summary-checkpoint-local/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [25]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [26]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
original_model.cuda()
instruct_model.cuda()
peft_model.cuda()
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Sam finishes work at about 9.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Sam finishes work at 5 and is not bringing Abdellilah over tonight. Sam will bring Abdellilah to work around 9 in the morning.
---------------------------------------------------------------------------------------------------
PEFT MODEL: Sam finishes work at 5 and will bring Abdellilah to work at about 9.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [27]:
dialogues = dataset['test']['dialogue']
human_baseline_summaries = dataset['test']['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)
    
    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Token indices sequence length is longer than the specified maximum sequence length for this model (530 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.,Amanda can't find Betty's number. Amanda and Hannah are going to text him.,Amanda can't find Betty's number. Amanda will ask Larry.,Amanda can't find Betty's number. She asked Larry last time she was at the park together. Hannah doesn't know Larry well.
1,Eric and Rob are going to watch a stand-up on youtube.,Eric and Rob like the Russian stand-up.,Eric and Rob are watching a stand-up. Eric and Rob will watch some of his stand-ups on youtube.,Eric and Rob are watching a Russian stand-up on youtube. Eric and Rob will watch some of his stand-ups on youtube.
2,Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.,Lenny will buy purple trousers.,Lenny wants to buy two pairs of black trousers. Bob recommends the first pair.,Lenny wants to buy two pairs of purple trousers. Bob has four black trousers. Lenny will buy the first or the third pair.
3,Emma will be home soon and she will let Will know.,Emma will be home soon.,Emma will be home soon. Will will pick her up.,Emma will be home soon. Will will pick her up when she gets home.
4,Jane is in Warsaw. Ollie and Jane has a party. Jane lost her calendar. They will get a lunch this week on Friday. Ollie accidentally called Jane and talked about whisky. Jane cancels lunch. They'll meet for a tea at 6 pm.,Jane is in Warsaw. Jane will be free on the 19th and the 18th. Ollie will be free on the 18th. Jane will be in Morocco. Ollie and Jane will have lunch this week. Jane will be in Warsaw for the meeting. Jane will be in Morocco at 6pm tomorrow.,Jane lost her calendar. Ollie and Jane have lunch on Friday. Jane will be in Morocco at 6 pm. Ollie will bring pastries.,Jane lost her calendar. She will be at the party on Friday. Ollie will bring some sun with her. Jane will be in Morocco at 6 pm after her courses. Ollie will bring the pastries.
...,...,...,...,...
814,Benjamin didn't come to see a basketball game on Friday's night. The team supported by Alex won 101-98. Benjamin's mom has a flu and he's looking after her. Benjamin declares to attend the next basketball match.,"Benjamin didn't attend the basketball game on Friday. He's taking care of his mom, who has a nasty flu.",Benjamin missed Friday night's basketball game because his mom is sick. Benjamin will go to the next game next weekend.,Benjamin missed Friday night's basketball game. He was unable to attend because his mom is sick. He will go to the next game next weekend.
815,The audition starts at 7.30 P.M. in Antena 3.,The audition starts at 7.30 P.M.,The audition starts at 7.30 P.M.,Jamilla and Kiki are going to the audition at 7.30 PM on Rogers 3 station.
816,"Marta sent a file accidentally,",Marta's gifs are not working.,Marta clicked something by accident.,Marta clicked something by accident.
817,There was a meet-and-greet with James Charles in Birmingham which gathered 8000 people.,Cora and Ellie are talking about a meet and greet with James Charles in Birmingham.,James Charles met with 8000 fans in Birmingham.,James Charles was a celebrity meet and greeter in Birmingham. 8000 fans showed up for the meet and greet. The host from LBC tried to find an answer to an unanswerable question. James called him and introduced himself on air.


In [28]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.4263967291442344, 'rouge2': 0.18548751320085638, 'rougeL': 0.3534714469453165, 'rougeLsum': 0.3536871906870632}
INSTRUCT MODEL:
{'rouge1': 0.4809178705029691, 'rouge2': 0.2372639007386037, 'rougeL': 0.40366507380512123, 'rougeLsum': 0.40344438847621056}
PEFT MODEL:
{'rouge1': 0.49352677359911207, 'rouge2': 0.2435781694750285, 'rougeL': 0.4054014140779256, 'rougeLsum': 0.4055151761823076}


Notice, that PEFT model results are not bad, while the training process was much easier!

The results show less improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [29]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 6.71%
rouge2: 5.81%
rougeL: 5.19%
rougeLsum: 5.18%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [30]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: 1.26%
rouge2: 0.63%
rougeL: 0.17%
rougeLsum: 0.21%


Here you see a small percentage increase in the ROUGE metrics vs. full fine-tuned. Additionally, the training requires much less computing and memory resources (often just a single GPU).

Full finetuned models requires right hyperparameter tuning to perform better than PEFT techniques.