# Fine-Tune a Generative AI Model for Dialogue Summarization

(***Note to self***: I ran it on Google Colab!)
In this notebook, we will fine-tune an existing LLM from Hugging Face for enahanved dialogue summarization. We will use the FLAN-T5 model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferencex, we will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then, we will perform Parameter Efficient Fine-Tuning (PEFT), evaluate theresulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

In [2]:
%pip install --upgrade pip
%pip install --disable-pip-version-check\
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 \
    loralib==0.1.1 --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m109.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.3.0+cu121 requires torch==2.3.0, but you have torch 1.13.1 which is incompatible.
t

In [3]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

### 1.2 – Load Dataset and LLM
We are going to continue expemrimenting with the DialogueSume Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [4]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

Load the pre-trained FLAN-T5 model and its tokenizer directly from HuggingFace. Notice that we will be using the small version of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specified the memory type to be used by this model.

In [5]:
model_name = "google/flan-t5-base"
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, we do not need to go into details of it.

In [6]:
def print_number_of_trainable_parameters(model):
    trainable_model_parameters = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_parameters += param.numel()
    return f"trainable model parameters: {trainable_model_parameters}\nall model parameters: {all_model_params}\npercentable of trainable parameters: {100 * trainable_model_parameters / all_model_params:.2f}"

print(print_number_of_trainable_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentable of trainable parameters: 100.00


### 1.3 – Test the Model with Zero Shot Inferencing
Test the model with zero shot inferencing. You can see that the model struggles to summarize teh dialogue compared to teh baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [7]:
index = 200

dialogue = dataset["test"][index]["dialogue"]
summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors="pt")
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
)[0],
    skip_special_tokens=True,
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f"INPUT PROMPT: \n{prompt}")
print(dash_line)
print(f"BASELINE HUMAN SUMMARY: \n{summary}\n")
print(dash_line)
print(f"MODEL GENERATION - ZERO SHOT: \n{output}")

---------------------------------------------------------------------------------------------------
INPUT PROMPT: 

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

------------------------------------------------------------------

## 2 - Perform Full Fine-Tuning

### 2.1 - Preprocess the Dialog-Summary Dataset
We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):

```
Summarize teh following conversation.
  
  Chris: This is his part of the conversation.
  Antje: This is her part of the conversation.

Summary:
```

Training response (summary):

```
  Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).



In [10]:
def tokenize_function(example):
  start_prompt = "Summarize the following conversation.\n\n"
  end_prompt = "\n\nSummary:"
  prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
  example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors='pt').input_ids
  example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors='pt').input_ids
  return example


# The dataset actually contains 3 different splits: train, validation, test.
# The `tokenizer_function` code is handline all data across all split in batches
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["id", "topic", "dialogue", "summary",])
tokenized_datasets



DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})

To save some time in the lab, we will subsample the dataset:

In [11]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Check the shapes of all three parts of teh datasets:

In [13]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})


The output dataset is ready for fine-tuning.


### 2.2 - Fine-Tune the Model with the Preprocessed Dataset
Now utilize the built-in `HuggingFace Training class`. Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [14]:
output_dir = f"./dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Start training process...

In [15]:
trainer.train()



Step,Training Loss
1,49.75


TrainOutput(global_step=1, training_loss=49.75, metrics={'train_runtime': 5.8483, 'train_samples_per_second': 1.368, 'train_steps_per_second': 0.171, 'total_flos': 5478058819584.0, 'train_loss': 49.75, 'epoch': 0.06})

Training a fully fine-tuned verstion of the model would take a few hours on a GPU. To save time, we download a checkpoinit of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

In [17]:
# !aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialog-summary-checkpoint/ ./flan-dialog-summary-checkpoint/

instruct_model_name="truocpham/flan-dialogue-summary-checkpoint"
instruct_model = AutoModelForSeq2SeqLM.from_pretrained( instruct_model_name, torch_dtype=torch.bfloat16)



config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)
As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [21]:
index = 200
dialogue = dataset["test"][index]["dialogue"]
human_baseline_summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(
    input_ids.to(original_model.device),
    generation_config=GenerationConfig(
        max_new_tokens=200,
        num_beams=1,
))
original_model_text_output = tokenizer.decode(
        original_model_outputs[0],
        skip_special_tokens=True,
)

instruct_model_outputs = instruct_model.generate(
    input_ids.to(instruct_model.device),
    generation_config=GenerationConfig(
        max_new_tokens=200,
        num_beams=1,
))
instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0],
        skip_special_tokens=True,
)

print(dash_line)
print(f"BASELINE HUMAN SUMMARY: \n{human_baseline_summary}\n")
print(dash_line)
print(f"ORIGINAL MODEL:\n{original_model_text_output}")
print(dash_line)
print(f"INSTRUCT MODEL:\n{instruct_model_text_output}")

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY: 
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of adding a painting program to your software. #Person2#: That would be a bonus. #Person1#: I'm thinking about adding a more powerful processor, more memory, and a faster modem. #Person2#: I'm thinking about adding a CD-ROM drive.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.


### 2.4 - Evaluate the Model Quantivatively (with ROUGE Metric)
The ROUGE metric helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [22]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs from the sample of the test dataset (only 10 dialogues and summaries to save time), and save results.

In [23]:
dialogues = dataset["test"][0:10]["dialogue"]
human_baseline_summaries = dataset["test"][0:10]["summary"]

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary:
    """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(
        input_ids.to(original_model.device),
        generation_config=GenerationConfig(
            max_new_tokens=200,))
    original_model_text_output = tokenizer.decode(
        original_model_outputs[0],
        skip_special_tokens=True,
    )
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(
        input_ids.to(instruct_model.device),
        generation_config=GenerationConfig(
            max_new_tokens=200,))
    instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0],
        skip_special_tokens=True,
    )
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=["human_baseline_summary", "original_model_summary", "instruct_model_summary"])
df

Unnamed: 0,human_baseline_summary,original_model_summary,instruct_model_summary
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: This memo should go out as an intra...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation.,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Employees are being advised that they will not...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The following is a list of people who have bee...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic in Montreal is always congested.,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,The driver of the car is a waste of time.,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
7,#Person1# tells Kate that Masha and Hero are g...,#Person1: Masha and Hero are getting a divorce.,Masha and Hero are getting divorced. Kate can'...
8,#Person1# and Kate talk about the divorce betw...,"#Person1: I'm so sorry, but I'm not sure what ...",Masha and Hero are getting divorced. Kate can'...
9,#Person1# and Brian are at the birthday party ...,Brian's birthday is coming.,Brian's birthday is coming. #Person1# invites ...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [24]:
original_model_results = rouge.compute(predictions=original_model_summaries,
                                       references=human_baseline_summaries[0:len(original_model_summaries)],
                                       use_aggregator=True,
                                       use_stemmer=True,
                                       )

instruct_model_results = rouge.compute(predictions=instruct_model_summaries,
                                       references=human_baseline_summaries[0:len(instruct_model_summaries)],
                                       use_aggregator=True,
                                       use_stemmer=True,
                                       )

print(f"ORIGINAL MODEL ROUGE METRICS: \n{original_model_results}")
print(f"INSTRUCT MODEL ROUGE METRICS: \n{instruct_model_results}")

ORIGINAL MODEL ROUGE METRICS: 
{'rouge1': 0.2218497975732803, 'rouge2': 0.07043084792445406, 'rougeL': 0.1951029228812898, 'rougeLsum': 0.19809402337596624}
INSTRUCT MODEL ROUGE METRICS: 
{'rouge1': 0.41026607717457186, 'rouge2': 0.17840645241958838, 'rougeL': 0.2977022096267017, 'rougeLsum': 0.2987374187518165}


In [26]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE:")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values()))) / np.array(list(original_model_results.values()))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f"{key}: {value:.2f}")

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE:
rouge1: 0.85
rouge2: 1.53
rougeL: 0.53
rougeLsum: 0.51


## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)
Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOTE THE SAME as prompt engineering!). In most cases, wehn someone says PEFT, they typically mean LoRA. LoRA, at a vey high level, allows the use to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use cases, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly trained "LoRA adapter" emrges. This LoRA adapter is much, much smaller than the original LLM - on the order of a sinle digit % of the original LLM size (MBs vs. GBs).

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LORA adapter can re-use the original LLM which reduces overall memory requirements when serveing multiple tasks and use cases.

### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning
We need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [28]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32,  # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,  # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [29]:
peft_model = get_peft_model(original_model, lora_config)

print(print_number_of_trainable_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentable of trainable parameters: 1.41


### 3.2 - Train PEFT Adapter
Define training arguments and create `Trainer` instance.

In [30]:
output_dir = f"./dialogue-summary-training-{str(int(time.time()))}"

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,  # Higher learnnign rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everythign is ready to train the PEFT adapter and save the model.

In [31]:
peft_trainer.train()

peft_model_path='./peft-dialogue-summary-checkpoint-local'

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,49.0


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

Prepare this model by adding an adapter to teh original FLAN-T5 model. Your are setting `is_trainable=False` because the plan is only to perform inference wiht this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [32]:
# peft_model_name="z7ye/peft-dialogue-summary-checkpoint"
# peft_model = AutoModelForSeq2SeqLM.from_pretrained(peft_model_name, torch_dtype=torch.bfloat16)

from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, peft_model_path,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False,  # setting this parameter is important here!
                                       )



The number of trainable parameters will be 0 due to `is_trainable=False` setting:

In [33]:
print(print_number_of_trainable_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentable of trainable parameters: 0.00


### 3.3 - Evaluate the Model Qualitatively (Humman Evaluation)
Make inferenec for the same example as in sections 1.3 and 2.3, with the original model, fully fine-tuned and PEFT model.

In [35]:
index = 200
dialogue = dataset["test"][index]["dialogue"]
human_baseline_summary = dataset["test"][index]["summary"]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(
    input_ids.to(original_model.device),
    generation_config=GenerationConfig(
        max_new_tokens=200,
    ))
original_model_text_output = tokenizer.decode(
        original_model_outputs[0],
        skip_special_tokens=True,
)

instruct_model_outputs = instruct_model.generate(
    input_ids.to(instruct_model.device),
    generation_config=GenerationConfig(
        max_new_tokens=200,
    ))
instruct_model_text_output = tokenizer.decode(
        instruct_model_outputs[0],
        skip_special_tokens=True,
)

peft_model_outputs = peft_model.generate(
    input_ids=input_ids.to(peft_model.device),
    generation_config=GenerationConfig(
        max_new_tokens=200,
    ))
peft_model_text_output = tokenizer.decode(
        peft_model_outputs[0],
        skip_special_tokens=True,
)

print(dash_line)
print(f"BASELINE HUMAN SUMMARY: \n{human_baseline_summary}\n")
print(dash_line)
print(f"ORIGINAL MODEL:\n{original_model_text_output}")
print(dash_line)
print(f"INSTRUCT MODEL:\n{instruct_model_text_output}")
print(dash_line)
print(f"PEFT MODEL:\n{peft_model_text_output}")

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY: 
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: Have you considered upgrading your software?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.
---------------------------------------------------------------------------------------------------
PEFT MODEL:
#Person1#: I'm thinking of upgrading my computer.


### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test datasets (only 10 dialogues and summaries to save time).

In [36]:
dialogues = dataset["test"][0:10]["dialogue"]
human_baseline_summaries = dataset["test"][0:10]["summary"]

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
  prompt = f"""
  Summarize the following conversation.

  {dialogue}

  Summary:
  """

  input_ids = tokenizer(prompt, return_tensors="pt").input_ids

  human_baseline_text_output = human_baseline_summaries[idx]

  original_model_outputs = original_model.generate(
      input_ids.to(original_model.device),
      generation_config=GenerationConfig(
          max_new_tokens=200))
  original_model_text_output = tokenizer.decode(
      original_model_outputs[0],
      skip_special_tokens=True,
  )
  original_model_summaries.append(original_model_text_output)

  instruct_model_outputs = instruct_model.generate(
      input_ids.to(instruct_model.device),
      generation_config=GenerationConfig(
          max_new_tokens=200))
  instruct_model_text_output = tokenizer.decode(
      instruct_model_outputs[0],
      skip_special_tokens=True,
  )
  instruct_model_summaries.append(instruct_model_text_output)

  peft_model_outputs = peft_model.generate(
      input_ids=input_ids.to(peft_model.device),
      generation_config=GenerationConfig(
          max_new_tokens=200))
  peft_model_text_output = tokenizer.decode(
      peft_model_outputs[0],
      skip_special_tokens=True,
  )
  peft_model_summaries.append(peft_model_text_output)


In [41]:
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=["human_baseline_summary", "original_model_summary", "instruct_model_summary", "peft_model_summary"])
df

Unnamed: 0,human_baseline_summary,original_model_summary,instruct_model_summary,peft_model_summary
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees are required to sign this memo by th...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1#: I need to take a dictation for you.
1,In order to prevent employees from wasting tim...,#Person1#: This memo is to be a memo to all em...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1#: I need to take a dictation for you.
2,Ms. Dawson takes a dictation for #Person1# abo...,Employees are to be directed to the following ...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1#: I need to take a dictation for you.
3,#Person2# arrives late because of traffic jam....,The driver has a lot of problems.,#Person2# got stuck in traffic again. #Person1...,The traffic jam at the Carrefour intersection ...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam in the city was a terrible one.,#Person2# got stuck in traffic again. #Person1...,The traffic jam at the Carrefour intersection ...
5,#Person2# complains to #Person1# about the tra...,The driver of the car is a bit stressed becaus...,#Person2# got stuck in traffic again. #Person1...,The traffic jam at the Carrefour intersection ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,#Person1#: Masha and Hero are getting divorced...,Masha and Hero are getting divorced. Kate can'...,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,Brian's birthday is today.,Brian's birthday is coming. #Person1# invites ...,"#Person1#: Happy birthday, Brian. #Person2#: I..."


Compute ROUGE score for this subset of the data.

In [42]:
rouge = evaluate.load("rouge")

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)
print("PEFT MODEL:")
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2158799462199012, 'rouge2': 0.07161076604554865, 'rougeL': 0.1849479172751595, 'rougeLsum': 0.188327392869481}
INSTRUCT MODEL:
{'rouge1': 0.41026607717457186, 'rouge2': 0.17840645241958838, 'rougeL': 0.2977022096267017, 'rougeLsum': 0.2987374187518165}
PEFT MODEL:
{'rouge1': 0.24089921652421653, 'rouge2': 0.11769053708439897, 'rougeL': 0.22001958689458687, 'rougeLsum': 0.22134175465057818}


Notice, that PEFT model results are not too bad, while the training process was much easier!

In [43]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE:")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE:
rouge1: 2.50%
rouge2: 4.61%
rougeL: 3.51%
rougeLsum: 3.30%


Now, calculate the improvement of PEFT over a full fine-tuned model:


In [44]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL:")

improvement = np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values()))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL:
rouge1: 19.44%
rouge2: 10.68%
rougeL: 11.28%
rougeLsum: 11.04%
