# Fine-Tuning a Generative AI model for dialogue summarization

use FLAN-T5 model, which provides high quality instruction tuned model and can summarize text out of the box, I will explore the full fine tuning approach and evaluate the result using Rouge metrics, then perform Parameter efficient Fine Tuning (PEFT), evaluate the result model and see the benefits of PEFT.

# Content
- 1.Load require dependencies,Dataset and LLM
  - 1.1-load the dataset and LLM
  - 1.2-test the model with zero shot inference


- 2.Perform Full Fine Tuning
  - 2.1-process the Dialogue summary dataset
  - 2.2-Fine-tune the model with perprocessed dataset
  - 2.3-Evaluate the model quality with Human Evaluate
  - 2.4-Evaluate the model quality with Rouge metrics.
- 3.Perform Parameter Efficient Fine Tuning
  - 3.1-Setup the PEFt/LoRa model for fine-tuning
  - 3.2-Train PEFT adapter
  - 3.3-Evaluate model quality with Human Evaluate
  - 3.4-Evaluate model quality with Rouge Metrics


In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
  torch==1.13.1 \
  torchdata==0.5.1 --quiet
%pip install \
  transformers==4.27.2 \
  datasets==2.11.0 \
  evaluate==0.4.0 \
  rouge_score==0.1.2 \
  loralib==0.1.0 \
  peft==0.3.0 --quiet

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m97.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m2.4 MB/s[0m eta [36m0:00:0

# Importing

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM , AutoTokenizer , GenerationConfig , TrainingArguments , Trainer
import torch
import numpy as np
import pandas as pd
import time
import evaluate

# 1.1 Load dataset and LLM

[DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset, it contains 10,000+ Dialogues with the corresponding manually labeled summarize and topics.

In [2]:
hugging_face_model_name =  'google/flan-t5-base'
hugging_face_data_name  =  'knkarthick/dialogsum'


#loading data
data = load_dataset(hugging_face_data_name)
# load model and tokenizer
original_model =AutoModelForSeq2SeqLM.from_pretrained(hugging_face_model_name)
tokenizer      =AutoTokenizer.from_pretrained(hugging_face_model_name)




  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# showing data
data

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

pull out the number of parameters and find out, how many of them are trainable

In [4]:
def print_the_number_of_trainable_parameters(model):
  trainable = 0
  all_params= 0

  for _, params in model.named_parameters():
    all_params+= params.numel()
    if params.requires_grad:
      trainable +=params.numel()
  return f'Trainable model parameters :{trainable}\nAll parameters :{all_params}\npercentage of trainable parameters :{100* trainable/all_params:.2f}%'




print(print_the_number_of_trainable_parameters(original_model))

Trainable model parameters :247577856
All parameters :247577856
percentage of trainable parameters :100.00%


# 1.2 test the model with zero shot inference

you can say that the model struggles to summarize the dialogue compared to the base line summary, but it does pull out important information from the text, which indicate that the model can be fine-tuned to the task at hand.

In [5]:
index = 200

zero_dialogue = data['test'][index]['dialogue']
zero_summary  = data['test'][index]['summary']

zero_prompt = f'''
summarize the following conversation.

{zero_dialogue}

summary:
 '''



input   = tokenizer(zero_prompt , return_tensors='pt')
generate= original_model.generate(input['input_ids'] , max_new_tokens=200)[0]
output  = tokenizer.decode(generate , skip_special_tokens=True)


dash_line = '-'.join('' for i in range(100))

# show results
print(dash_line)
print(f'Prompt:\n{zero_prompt}')
print(dash_line)
print(f'Human summary:\n{zero_summary}')
print(dash_line)
print(f'Model generate:\n{output}')



---------------------------------------------------------------------------------------------------
Prompt:

summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

summary:
 
------------------------------------------------------------------------

# 2.Perform Full Fine Tuning

# 2.1-process the dialogue summaray dataset

we need convert the data dialogue-summart(prompt-respons) pairs into explici instructions for the LLM.prepend an instruction to the start of the Dialogue with ` Summarize the following conversation ` and to the start of the summary with ` Summary ` as the following


Training Prompt(dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```


Training Response(summary):

`Both Chris and Antje participated in the conversation.`

the preprocess prompt and response data into tokens and pull up inputs_ids


In [6]:
def tokenize_function(example):


  start_prompt = 'Summarize the following conversation.\n\n'
  start_summary= '\n\nSummary: '
  prompt =  [start_prompt + dialogue + start_summary for dialogue in example['dialogue']]

  example['input_ids'] = tokenizer(prompt , return_tensors='pt' , padding ='max_length' , truncation=True).input_ids
  example['labels']    = tokenizer(example['summary'] , return_tensors='pt' , padding ='max_length' , truncation=True).input_ids

  return example



tokenized_data = data.map(tokenize_function , batched=True)



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]



In [7]:
tokenized_data=tokenized_data.remove_columns(['id', 'topic', 'dialogue', 'summary',])
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})

we will subsample the dataset to save some time

---



In [8]:
tokenized_data = tokenized_data.filter(lambda example , index : index % 100==0 , with_indices=True)
tokenized_data



Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})

# 2.2-Fine tune the model with preprocessed data


Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model

In [11]:
output_dir = f'/content/sample_data/dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir       = output_dir ,
    learning_rate    = 1e-5 ,
    num_train_epochs = 1 ,
    weight_decay     = 0.01 ,
    logging_steps    = 1,
    max_steps        = 1
)


trainer = Trainer(
    model         = original_model ,
    args          = training_args ,
    train_dataset = tokenized_data['train'],
    eval_dataset  = tokenized_data['validation']
)



In [14]:
# trainer.train()

Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model**

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [18]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [19]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_the_number_of_trainable_parameters(peft_model))

Trainable model parameters :3538944
All parameters :251116800
percentage of trainable parameters :1.41%


In [21]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_data["train"],
)

In [None]:
peft_trainer.train()

# peft_model_path="./peft-dialogue-summary-checkpoint-local"

# peft_trainer.model.save_pretrained(peft_model_path)
# tokenizer.save_pretrained(peft_model_path)



In [None]:
peft_trainer.evaluate()