# EDA and Fine-tuning of FLAN-T5 model

Dataset
- Since there are only 70-ish samples, I didn't feel it was ideal to use a LLM model. Instead, I used the FLAN-T5 model, which is a fine-tuned version of the T5 model.
- The dataset is very clean and well-structured.
- In initial experiments, I found that splitting the dataset into train, test, and validation sets was not ideal. I decided to use only the training set for training the model and the test set for evaluation.

Fine-tuning
- All training was done on Google Colab T4 GPU and `bfloat16` dtype due to resource constraints
- I do full fine-tuning of the model on the dataset and evaluate the results with ROUGE metrics.


Download the fine-tuned model from [here](https://drive.google.com/drive/folders/1eu2l3tWXEsXTyGj-PIPkoguGeQnvNgHf?usp=sharing)

In [None]:
%pip install --upgrade pip
%pip install transformers datasets evaluate rouge_score loralib peft accelerate

In [1]:
from datasets import load_dataset
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    GenerationConfig,
    TrainingArguments,
    Trainer,
)
import torch
import time
import evaluate
import pandas as pd
import numpy as np

### Loading dataset and creating splits

In [2]:
from datasets import load_dataset, DatasetDict

hf_ds = load_dataset("Kaludi/Customer-Support-Responses")
ds = hf_ds["train"].train_test_split(test_size=0.2)
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['query', 'response'],
        num_rows: 59
    })
    test: Dataset({
        features: ['query', 'response'],
        num_rows: 15
    })
})

### Load Model

- I used the large variant of the FLAN-T5 model for fine-tuning
- The base model was not good at generating answers, so I chose the large variant

In [3]:
model_name = "google/flan-t5-large"
original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [4]:
def trainable_params(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    print(f"Trainable model parameters: {trainable_model_params}")
    print(f"All model parameters: {all_model_params}")
    print(
        f"Percentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"
    )


trainable_params(original_model)

Trainable model parameters: 783150080
All model parameters: 783150080
Percentage of trainable model parameters: 100.00%


### Sample zero-shot predictions and comparision with the ground truth

Test the model with the zero shot inferencing. You can see that the model is already very good at some of the questions but we can improve it further by fine-tuning it on the dataset.

In [1]:
index = np.random.randint(0, len(ds["test"]))

query = ds["test"][index]["query"]
response = ds["test"][index]["response"]

prompt = f"""
You are a chatbot which generates automated responses to customer queries.
Answer the following query.

{query}

Response:"""

inputs = tokenizer(prompt, return_tensors="pt")
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True,
)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"INPUT PROMPT:\n{prompt}")
print(dash_line)
print(f"BASELINE RESPONSE:\n{response}\n")
print(dash_line)
print(f"MODEL GENERATION - ZERO SHOT:\n{output}")


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

You are a chatbot which generates automated responses to customer queries.
Answer the following query.

I want to change my shipping address.

Response:
---------------------------------------------------------------------------------------------------
BASELINE RESPONSE:
No problem. Can you please provide your order number and the new shipping address you'd like to use?

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
I can help you with that. What is your shipping address?




### Preprocess the dataset

- Here we are tokenizing the dataset so that the model can understand the inputs and outputs.

In [6]:
def tokenize_function(example):
    start_prompt = "You are a chatbot which generates automated responses to customer queries. Answer the following query.\n\n"
    end_prompt = "\n\nResponse:"
    prompt = [start_prompt + query + end_prompt for query in example["query"]]
    example["input_ids"] = tokenizer(
        prompt, padding="max_length", truncation=True, return_tensors="pt"
    ).input_ids
    example["labels"] = tokenizer(
        example["response"], padding="max_length", truncation=True, return_tensors="pt"
    ).input_ids

    return example


tokenized_ds = ds.map(tokenize_function, batched=True)
tokenized_ds = tokenized_ds.remove_columns(["query", "response"])

Map:   0%|          | 0/59 [00:00<?, ? examples/s]

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

### Fine-tuning

- I use the HugginFace Trainer API for fine-tuning the model
- The model parameters are set to the normally used values

In [7]:
import os

os.environ["WANDB_DISABLED"] = "true"

output_dir = f"./cust-service-{str(int(time.time()))}"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-4,
    num_train_epochs=30,
    weight_decay=0.01,
    logging_steps=10,
    report_to=None,
    auto_find_batch_size=True,
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    # eval_dataset=tokenized_ds['valid']
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [8]:
trainer.train()

Step,Training Loss
10,32.6
20,22.4875
30,35.5375
40,21.4375
50,12.0375
60,6.4906
70,6.225
80,5.4156
90,17.2812
100,8.525


TrainOutput(global_step=900, training_loss=2.116962890625, metrics={'train_runtime': 2177.4378, 'train_samples_per_second': 0.813, 'train_steps_per_second': 0.413, 'total_flos': 4079439273000960.0, 'train_loss': 2.116962890625, 'epoch': 30.0})

### Save model

Saving the model and copying to Google Drive to use again


In [9]:
model_save_path = "/content/ft_2"
trainer.save_model(model_save_path)

In [13]:
from google.colab import drive

drive.mount("/content/drive")
!cp -r "/content/ft_2" "/content/drive/MyDrive/ft_2"

### Evaluation

- We load the best model and evaluate it on the test set using ROUGE metrics
- The complete dataframe is saved as a CSV file (`test_responses.csv`) for further analysis

In [23]:
original_model = original_model.to("cuda")

In [24]:
ft_model = AutoModelForSeq2SeqLM.from_pretrained(
    "/content/ft_2", torch_dtype=torch.bfloat16
).to("cuda")

In [25]:
rouge = evaluate.load("rouge")

In [31]:
queries = ds["test"][:]["query"]
baseline_responses = ds["test"][:]["response"]

original_model_responses = []
ft_model_responses = []

for _, query in enumerate(queries):
    prompt = f"""
You are a chatbot which generates automated responses to customer queries.
Answer the following query.

{query}

Response:"""
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    orig_model_op = tokenizer.decode(
        original_model.generate(input_ids.to("cuda"), max_new_tokens=200)[0],
        skip_special_tokens=True,
    )

    ft_model_op = tokenizer.decode(
        ft_model.generate(input_ids.to("cuda"), max_new_tokens=200)[0],
        skip_special_tokens=True,
    )

    original_model_responses.append(orig_model_op)
    ft_model_responses.append(ft_model_op)

zipped_responses = list(
    zip(queries, baseline_responses, original_model_responses, ft_model_responses)
)

df = pd.DataFrame(
    zipped_responses,
    columns=[
        "queries",
        "baseline_responses",
        "original_model_responses",
        "fine-tuned_model_responses",
    ],
)
df.to_csv("test_response.csv", index=False)
df

Unnamed: 0,queries,baseline_responses,original_model_responses,fine-tuned_model_responses
0,Can I get a replacement part for my product?,Certainly. Can you please provide the product ...,Can you please provide the model number of the...,I can help you with that. Can you please provi...
1,How do I reset my password?,We can help with that. Please provide your acc...,You can reset your password by using the passw...,You can reset your password by clicking on the...
2,How do I update my payment information?,We can help with that. Can you please provide ...,"I'm sorry, we don't accept credit cards.",You can update your payment information by cli...
3,Can I place a custom order?,We'd be happy to assist you. Can you please pr...,"Can you please provide the product name, model...",Can you please provide your order number and t...
4,How do I schedule a consultation or appointment?,We'd be happy to help. Can you please provide ...,"You can do this by clicking the button ""Sign u...",You can schedule an appointment by calling the...
5,Can I pre-order an item?,Certainly. Can you please provide the product ...,I'm sorry to hear that. Can you please provide...,Can you please provide your order number and t...
6,I can't find the item I'm looking for.,We're here to help. Can you please provide a d...,I'm sorry to hear that. Can you please provide...,I'm sorry to hear that. Can you please provide...
7,What is the status of my warranty claim?,We'd be happy to check for you. Can you please...,Is there anything I can do to help you?,I'm sorry to hear that. Can you please provide...
8,I want to change my shipping address.,No problem. Can you please provide your order ...,I'm sorry to have to wait. Can you please prov...,I can help you with that. Can you please provi...
9,I have a question about my bill.,We'd be happy to help. Can you please provide ...,"I'm sorry, I can't help you right now. Can you...",I can help you with that. Can you please provi...


In [27]:
original_model_results = rouge.compute(
    predictions=original_model_responses,
    references=baseline_responses[0 : len(original_model_responses)],
    use_aggregator=True,
    use_stemmer=True,
)

finetuned_model_results = rouge.compute(
    predictions=ft_model_responses,
    references=baseline_responses[0 : len(ft_model_responses)],
    use_aggregator=True,
    use_stemmer=True,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("FINE-TUNED MODEL:")
print(finetuned_model_results)

ORIGINAL MODEL:
{'rouge1': 0.32270984309506645, 'rouge2': 0.15173339381852052, 'rougeL': 0.2775705042172192, 'rougeLsum': 0.27441058727451895}
FINE-TUNED MODEL:
{'rouge1': 0.4083546591142728, 'rouge2': 0.21167942706642395, 'rougeL': 0.3457435884800957, 'rougeLsum': 0.34846650719480987}


In [29]:
print("Absolute percentage improvement of FINE-TUNED MODEL over BASELINE")

improvement = np.array(list(finetuned_model_results.values())) - np.array(
    list(original_model_results.values())
)
for key, value in zip(finetuned_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")

Absolute percentage improvement of FINE-TUNED MODEL over BASELINE
rouge1: 8.56%
rouge2: 5.99%
rougeL: 6.82%
rougeLsum: 7.41%


### Conclusion

- The model is able to generate good answers for the questions in the dataset
- The ROUGE scores are also good, which means the model is able to generate answers that are similar to the ground truth
- The model improves rouge1 scores by 8.56%, rouge2 scores by 5.99%, rougeL scores by 6.82%, and rougeLsum scores by 7.41% after fine-tuning
- The CSV files show us that after fine-tuning, the model responses are more relevant, to the point, and similar to the ground truth
- Quality-wise also, the fine-tuned model is better than the zero-shot model