# Expriment with FLAN_T5_ Fine-Tune a Generative AI Model for Dialogue Summarization

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [1]:
!pip install --upgrade pip
!pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

!pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet



In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [ubuntu_dialogs_corpus](https://huggingface.co/datasets/ubuntu_dialogs_corpus) Hugging Face dataset. 

Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.

### Using v2 of the ubuntu_dialogs_corpus

preprocesed during download for steming and lammetization   
Source https://github.com/rkadlec/ubuntu-ranking-dataset-creator  
!git clone https://github.com/rkadlec/ubuntu-ranking-dataset-creator.git .  
./ubuntu-ranking-dataset-creator/src/./generate.sh   
!cp ./ubuntu-ranking-dataset-creator/src/*.csv  ../Datasets/Ubuntu-corpus  

In [3]:
huggingface_dataset_name = "ubuntu_dialogs_corpus"

train = load_dataset("csv", data_files = "./train.csv")
#test = load_dataset("csv", data_files = "./test.csv")
#valid = load_dataset("csv", data_files = "./valid.csv")

type(train)

Found cached dataset csv (/home/jupyter/.cache/huggingface/datasets/csv/default-d8175c05d28261e1/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

datasets.dataset_dict.DatasetDict

In [4]:
train

DatasetDict({
    train: Dataset({
        features: ['Context', 'Utterance', 'Label'],
        num_rows: 416487
    })
})

In [5]:
train['train'].features

{'Context': Value(dtype='string', id=None),
 'Utterance': Value(dtype='string', id=None),
 'Label': Value(dtype='float64', id=None)}

In [6]:
train_corpus = pd.DataFrame(train['train'])

train_corpus.columns = ['context', 'response', 'label']

In [7]:
train_corpus = train_corpus[train_corpus["label"] == 1]

In [8]:
train_corpus.head()

Unnamed: 0,context,response,label
0,i think we could import the old comments via r...,yes. same binary packages. __eou__,1.0
3,interesting __eou__ grub-install worked with /...,i fully endorse this suggestion </quimby> __eo...,1.0
6,"edd will turn up here soon too, btw __eou__ __...","well, it's not strictly in the gnome list, but...",1.0
8,and because Python gives Mark a woody __eou__ ...,ouch :-/ __eou__,1.0
9,thanks a lot! __eou__ __eot__ you're welcome _...,trying to do it now __eou__,1.0


# Ubuntu conversation

v2 added differentiation between the end of an utterance (__eou__) and end of turn (__eot__). In the original dataset, we concatenated all consecutive utterances by the same user in to one utterance, and put __EOS__ at the end. Here, we also denote where the original utterances were (with __eou__). Also, the terminology should now be consistent between the training and test set (instead of both __EOS__ and </s>)

Source: https://github.com/rkadlec/ubuntu-ranking-dataset-creator

## context
The context consists of the sequence of utterances appearing in the conversation prior to the `response` or `response`

In [9]:
train_corpus.context[0].split('__eot__')

['i think we could import the old comments via rsync, but from there we need to go via email. I think it is easier than caching the status on each bug and than import bits here and there __eou__ ',
 ' it would be very easy to keep a hash db of message-ids  __eou__ sounds good __eou__ ',
 ' ok __eou__ perhaps we can ship an ad-hoc apt_prefereces __eou__ ',
 ' version? __eou__ ',
 ' thanks __eou__ ',
 ' not yet __eou__ it is covered by your insurance? __eou__ ',
 " yes __eou__ but it's really not the right time :/ __eou__ with a changing house upcoming in 3 weeks __eou__ ",
 ' you will be moving into your house soon? __eou__ posted a message recently which explains what to do if the autoconfiguration does not do what you expect __eou__ ',
 ' how urgent is #896? __eou__ ',
 ' not particularly urgent, but a policy violation __eou__ ',
 ' i agree that we should kill the -novtswitch __eou__ ',
 ' ok __eou__ ',
 ' would you consider a package split a feature? __eou__ ',
 ' context? __eou__ ',

## response

The `response` or `response`  is a target (output) response which we aim to correctly identify. 

In [10]:
train_corpus.response[0]

'yes. same binary packages. __eou__'

## Label

The `flag` or `label` is a Boolean variable indicating whether or not the response was the actual next response after the given context.

In [11]:
train_corpus.label[0]

1.0

## Another example

In [12]:
train_corpus.context[8].split('__eot__')

['and because Python gives Mark a woody __eou__ ',
 ' I thought it gave him a warty __eou__ ',
 ' watch out, it probably makes all your files writable or something __eou__ warty base ... has that not been set with priorities? __eou__ ',
 ' debootstrap __eou__ do you think we need ACPI fan module support in d-i? __eou__ certainly some nCipher people are a bit worried about the whole thing ... __eou__ ',
 ' yeah, we\'ve been making far too many "feature" changes too close to the release __eou__ ',
 " AIUI we've fixed the really broken bits __eou__ ",
 ' they end up getting *used* for things __eou__ ',
 " whoa, that's a stupid idea __eou__ ",
 ' huh?  it uses it by default? __eou__ ',
 ' dude, LANG=en_GB.UTF-8 gnome-terminal starts up a terminal in en_GB __eou__ ',
 ' What is the "Run command as a login shell" option set to? __eou__ ',
 " I started one, then immediately in that terminal ran 'LANG=en_GB.UTF-8 gnome-terminal'. it did not work properly. __eou__ ",
 " I thought that was Mithr

In [13]:
train_corpus.response[3].split('__eot__')

['i fully endorse this suggestion </quimby> __eou__ how did your reinstall go? __eou__']

## Sample questions with responses in dataset

In [14]:
sample_questions=[
    "How do you install X11?",
    "Can grub-install work with ext3",
    "How do you start gnome-terminal in a terminal"
]
    

## Experimentation v1

In [15]:
import pandas as pd
from datasets import DatasetDict, Dataset
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from evaluate import load

### Exprimenting with only 10% of the data

In [16]:
fraction_of_dataset = .10

total_len = len(train_corpus)*fraction_of_dataset
train_len = int(total_len * 0.70)
test_len = int(total_len * 0.30)


# Create train, validation, and test sets
train_set = train_corpus[:train_len]
test_set = train_corpus[train_len:train_len + test_len]

# Convert to Dataset
train_dataset = Dataset.from_pandas(train_set)
test_dataset = Dataset.from_pandas(test_set)


dataset = DatasetDict({
    'train': train_dataset,
    'test': test_dataset 
})


In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'response', 'label', '__index_level_0__'],
        num_rows: 14559
    })
    test: Dataset({
        features: ['context', 'response', 'label', '__index_level_0__'],
        num_rows: 6239
    })
})

## Experimentation: Goal is to create a support bot

# Load the Model

In [18]:
# Tokenize
model_name='google/flan-t5-large'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto",torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=512)

# Test the Model with Zero Shot Inferencing

In [19]:
original_model.eval()

def test_model_with_zero_shot(question, model_eval):
    test_inputs = tokenizer(question, return_tensors="pt")

    ## Asign Inputs to GPU
    test_inputs = {key: value.to('cuda') for key, value in test_inputs.items()}
    test_outputs = model_eval.generate(**test_inputs,
                            min_length=64,
                            max_new_tokens=256,
                            length_penalty = 1.8,
                            repetition_penalty=2.1,
                            num_beams = 6,
                            no_repeat_ngram_size=3,
                            temperature = 0.9,
                            early_stopping=True)
    print("Question:",question,"\n")
    print("Model Output:",tokenizer.batch_decode(test_outputs, skip_special_tokens=True),"\n")
    

In [20]:
for sample in sample_questions:
    test_model_with_zero_shot(sample,original_model)

Question: How do you install X11? 

Model Output: ['Go to the X11 download page and click on the link that says "Install X11.dll". Then follow the on-screen prompts to install. Once you\'re done, you\'ll be able to start using x11 without having to worry about it being installed in your system.'] 

Question: Can grub-install work with ext3 

Model Output: ["grub-install can't be installed on ext3 because it doesn't know how to read the filesystem. Is there a way to get it to work with ext3, or is it just not possible to install Grub-Install on e:program files (x86)"] 

Question: How do you start gnome-terminal in a terminal 

Model Output: ['Press  Enter and type gnome-terminal into the Terminal window. It will open in a new terminal window. You can also press Ctrl+Alt+T to launch it from a command prompt, or by double-clicking on the terminal icon in the top-left corner of the screen.'] 



In [21]:
def tokenize_function(example):
    #start_prompt = 'Act like a chat agent in a turn conversation similar to this dialogue between two individuals where one individual is looking to resolve Ubuntu technical issues. For every new request, provide a response\n\n'
    start_prompt = """
    Q&A: Learn from the [dialogue] on how to respond to questions in the Ubuntu forum. Your Goal is to respond to questions in a conversational manner and finally provide correct [response]\n\n
    __eou__ represents the end of utterance and __eot__ represents end of turn in the the conversation, after which the next person speaks
    """
    end_prompt = '\n\nCorrect response to technical question: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["context"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["response"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['response', 'context', 'label',])

Map:   0%|          | 0/14559 [00:00<?, ? examples/s]

Map:   0%|          | 0/6239 [00:00<?, ? examples/s]

In [22]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)
print(tokenized_datasets['train'][:1])

Shapes of the datasets:
Training: (14559, 3)
Test: (6239, 3)
DatasetDict({
    train: Dataset({
        features: ['__index_level_0__', 'input_ids', 'labels'],
        num_rows: 14559
    })
    test: Dataset({
        features: ['__index_level_0__', 'input_ids', 'labels'],
        num_rows: 6239
    })
})
{'__index_level_0__': [0], 'input_ids': [[1593, 184, 188, 10, 4001, 45, 8, 784, 22233, 12220, 908, 30, 149, 12, 3531, 12, 746, 16, 8, 22998, 5130, 5, 696, 17916, 19, 12, 3531, 12, 746, 16, 3, 9, 3634, 138, 3107, 11, 2031, 370, 2024, 784, 60, 7, 5041, 7, 15, 908, 3, 834, 834, 15, 1063, 834, 834, 5475, 8, 414, 13, 3, 5108, 663, 11, 3, 834, 834, 15, 32, 17, 834, 834, 5475, 414, 13, 919, 16, 8, 8, 3634, 6, 227, 84, 8, 416, 568, 12192, 3, 23, 317, 62, 228, 4830, 8, 625, 2622, 1009, 3, 52, 7, 63, 29, 75, 6, 68, 45, 132, 62, 174, 12, 281, 1009, 791, 5, 27, 317, 34, 19, 1842, 145, 212, 8509, 8, 2637, 30, 284, 8143, 11, 145, 4830, 14120, 270, 11, 132, 3, 834, 834, 15, 1063, 834, 834, 3, 834, 

### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [30]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=16,
 target_modules=["q", "v"],
 lora_dropout=0.01,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM,
)

### prepare int-8 model for training

In [31]:
model_int8 = prepare_model_for_int8_training(original_model)

### add LoRA adaptor

In [32]:
model_int8_lora = get_peft_model(model_int8, lora_config)
model_int8_lora.print_trainable_parameters()

trainable params: 4718592 || all params: 787868672 || trainable%: 0.5989059049678777


### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [33]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=2,
    logging_steps=30,
    save_strategy = "no"
)
    
trainer = Seq2SeqTrainer(
    model=model_int8_lora,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    #eval_dataset=tokenized_datasets["test"]
)

original_model.config.use_cache = False

In [34]:
# Train model
trainer.train()



Step,Training Loss
30,30.2748
60,26.2815
90,4.8472
120,1.7606
150,0.2535
180,0.1709
210,0.1659
240,0.1549
270,0.1764
300,0.1558


TrainOutput(global_step=7280, training_loss=0.4046094639615698, metrics={'train_runtime': 5005.2678, 'train_samples_per_second': 5.817, 'train_steps_per_second': 1.454, 'total_flos': 6.75323135876137e+16, 'train_loss': 0.4046094639615698, 'epoch': 2.0})

In [35]:
# Save our LoRA model & tokenizer results
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
trainer.model.base_model.save_pretrained(peft_model_id)

### Evaluate & run Inference with LoRA FLAN-T5

We are going to use evaluate library to evaluate the rogue score. We can run inference using PEFT and transformers. For our FLAN-T5 large model

In [36]:
from peft import PeftModel, PeftConfig
# Load peft config for pre-trained checkpoint etc. 

peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
test_model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

## Load the Lora model
test_model = PeftModel.from_pretrained(test_model, peft_model_id, device_map='auto')
test_model.eval()

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 1024)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 1024)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.01, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1024, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embeddi

## Testing with sample questions

In [37]:
for sample in sample_questions:
    test_model_with_zero_shot(sample,test_model)

Question: How do you install X11? 

Model Output: ['Install X11 on your computer using the following steps: Click on the "Install" button and follow the on-screen prompts to install X11. Once you\'re done, you\'ll be greeted by a window asking you if you want to accept the license agreement. If you don\'t wish to accept, skip to the next step. You\'ll need to restart your computer in order to complete the installation.'] 

Question: Can grub-install work with ext3 

Model Output: ["no grub-install can't work with ext3 because it doesn't know how to read the partition table. Is there a way to fix this? Thanks! It's working for me on /dev/sda1 and /usr/local/share/grub_install."] 

Question: How do you start gnome-terminal in a terminal 

Model Output: ['open a terminal and type gnome-terminal into it, then press ctrl+alt+x to start the terminal. You can also right-click on the terminal and select "Open Terminal" from the menu that pops up. It\'s in the top-left corner of the screen.'] 
