# Fine-Tune a Gen AI Model for Dialogue Summarization

In this notebook, we will fine-tune an existing LLM from Hugging Face (FLAN-T5) for enhanced dialogue summarization. <br>
To improve the inferences, we will perform Full-fine tuning and Parameter Efficient Fine Tuning (PEFT) and evaluate both with ROUGE metrics

In [1]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 \
#         datasets==2.11.0 \
#             evaluate==0.4.0 \
#                 rouge_score==0.1.2 \
#                     loralib==0.1.1 \
#                         peft==0.3.0 --quiet

# %pip install accelerate -U
# %pip install -U huggingface_hub

In [2]:
# %pip install -U datasets
# %pip install --upgrade transformers

In [3]:
# %pip install --update ipywidgets

In [4]:
# import statements

from datasets import load_dataset
import torch
import time
import pandas as pd
import numpy as np
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, Trainer, TrainingArguments
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


### Load Dataset and LLM

We are going to make the same summarization LLM using same dataset (DialogSum) from Hugging Face

In [5]:
dataset = load_dataset("knkarthick/dialogsum")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [6]:
print(dataset['test'][0])



Loading the pre-trained FLAN-T5 model and its tokenizer from Hugging Face. 

In [7]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)  #specifying memory type to use the small version
tokenizer = AutoTokenizer.from_pretrained(model_name)

To pull out the number of model parameters and find the trainable parameters, I found the following function from StackOverflow

In [8]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    
    for _, param in model.named_parameters():
        all_model_params += param.numel()

        if param.requires_grad:
            trainable_model_params += param.numel()
    
    return f"Trainable model parameters: {trainable_model_params}\nAll model parameters: {all_model_params}\nPercentage of trainable params to all params: {(trainable_model_params/all_model_params)*100}%"

print(print_number_of_trainable_model_parameters(original_model))

Trainable model parameters: 247577856
All model parameters: 247577856
Percentage of trainable params to all params: 100.0%


### Testing the Model with Zero Shot Inferencing

In [9]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs['input_ids'],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))

print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL_GENERATION:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

Not able to summarize the conversation as before.

## Perform Full Fine-Tuning

We need to convert the dialog-summary pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with Summarize the following converstion and to the start of the summary with Summary.

In [10]:
def tokenize_function(example):
    start_prompt = "Summarize the following conversation. \n\n"
    end_prompt = "\n\nSummary"
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors='pt').input_ids
    example['label'] = tokenizer(example['summary'], padding='max_length', truncation=True, return_tensors='pt').input_ids

    return example

# dataset contains 3 splits: train, test and validation 
# the tokenize)function code is handling all the data across all splits in batches

tokenize_datasets = dataset.map(tokenize_function, batched=True)
tokenize_datasets = tokenize_datasets.remove_columns(['id','topic','dialogue','summary'])

Map: 100%|██████████| 1500/1500 [00:00<00:00, 1925.25 examples/s]


Subsampling the dataset

In [11]:
tokenize_datasets = tokenize_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter: 100%|██████████| 1500/1500 [00:00<00:00, 2605.42 examples/s]


In [12]:
# Check the shapes of all three parts of the dataset

print(f"Shapes of the datasets:")
print(f"Training: {tokenize_datasets['train'].shape}")
print(f"Validation: {tokenize_datasets['validation'].shape}")
print(f"Testing: {tokenize_datasets['test'].shape}")

print(tokenize_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Testing: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'label'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'label'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'label'],
        num_rows: 15
    })
})


### Fine-Tune the model with the preprocessed dataset

Utilizing the Hugging Face Trainer class and passing the preprocessed dataset with reference to original model. To minimize the compute resources training the model with minimal argument values.

In [13]:
# creating the model

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenize_datasets['train'],
    eval_dataset=tokenize_datasets['validation']
)

max_steps is given, it will override any value given in num_train_epochs


: 

In [14]:
# start the training process

trainer.train()