# Fine-Tune a Generative AI Model for seq to seq generation

In this notebook, we are fine-tuning an existing LLM from Hugging Face for enhanced seq2seq generation. we are using the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can generate text out of the box. To improve the inferences, we are using  full fine-tuning approach and evaluate the results with ROUGE metrics. Then we are performing Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Create dataset in hugging face from original huge dataset](#1.2)
  - [ 1.3 - Load Dataset and LLM](#1.3)
  - [ 1.4 - Test the Model with Zero Shot Inferencing](#1.4)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [1]:
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Collecting datasets==2.17.0
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets==2.17.0)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.17.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets==2.17.0)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets


Import Necessary dependencies

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch
import time
import evaluate
import pandas as pd
import numpy as np

#1.2 - Create dataset in hugging face from original huge dataset

Connect to hugging face

In [None]:
!pip install -q datasets
!huggingface-cli login

Collecting dataset from tural/stanford_alpaca which has 52K rows and modifiying to our needs to 3k rows and spliting it to train,validation and test datasets

In [None]:
from datasets import load_dataset, DatasetDict
# Load the ALPaCA dataset
dataset = load_dataset('tural/stanford_alpaca')

# Get the train split
train_data = dataset['train']

# Randomly select 3000 samples for training
remaining_data = train_data.shuffle(seed=42).select(range(3000))

# Randomly sample 250 rows for the validation split
valid_data = remaining_data.shuffle(seed=42).select(range(250))

# Filter out the validation samples from the remaining data
remaining_data = remaining_data.filter(lambda example: example not in valid_data)

# Randomly sample 250 rows for the test split
test_data = remaining_data.shuffle(seed=42).select(range(250))

# Filter out the test samples from the remaining data
remaining_data = remaining_data.filter(lambda example: example not in test_data)

# Combine the splits into a single dataset dictionary
split_datasets = DatasetDict({
    'train': remaining_data,
    'validation': valid_data,
    'test': test_data
})

# Print the lengths of train, validation, and test sets to verify
print("Train set size:", len(split_datasets['train']))
print("Validation set size:", len(split_datasets['validation']))
print("Test set size:", len(split_datasets['test']))

# Now you can use this dataset for your task


Uploaded modified dataset back to our personal hugging face hub

In [None]:
# Assuming you have split_datasets containing all splits (train, validation, test)

# Now you can use this dataset for your task

# Push the entire split_datasets to the Hugging Face Hub
split_datasets.push_to_hub("updated_alpaca_dataset_3k")

<a name='1.2'></a>
### 1.3 - Load Dataset and LLM

Here we are experimenting with the [Stanford Alpaca Dataset](https://huggingface.co/datasets/Aishwarya30998/updated_alpaca_dataset_3k) Hugging Face dataset. It contains 3000+ instructions with the corresponding manually labeled inputs and outputs.


Stanford Alpaca Dataset has 52k Rows of data from which we have taken only 3K and further divided them into Training, Testing and Validation sets

In [3]:
huggingface_dataset_name = "Aishwarya30998/updated_alpaca_dataset_3k"

dataset = load_dataset(huggingface_dataset_name)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/61.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/59.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2500 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/250 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 2500
    })
    validation: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 250
    })
    test: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 250
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace.

here we are using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [4]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


In PyTorch, the .numel() method is used to compute the number of elements in a tensor. The name "numel" stands for "number of elements". It returns the total number of elements in the tensor, which is equal to the product of the sizes of all dimensions of the tensor.

<a name='1.3'></a>
### 1.4 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to give output to the Instruction compared to the baseline Input, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [6]:
# Access the 'train' split of the DatasetDict
train_dataset = dataset['train']
eval_dataset = dataset['validation']

# Extract the output, input, and instruction columns
output = train_dataset['output']
input_text = train_dataset['input']  # Renamed to input_text to avoid conflict with input function
instruction = train_dataset['instruction']
index = 200


# Below is the FLANT5 accepted input format for question and answer
prompt = f"""
Instruction: {instruction[0]}

Input: {input_text[0]}

Output: {output[0]}
"""

# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors='pt')

# Truncate or split the input sequence if it's longer than the maximum sequence length
max_length = tokenizer.model_max_length
if inputs["input_ids"].shape[1] > max_length:
    inputs["input_ids"] = inputs["input_ids"][:, :max_length]
    if "attention_mask" in inputs:
        inputs["attention_mask"] = inputs["attention_mask"][:, :max_length]

# Generate output
outputModel = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN instruction:\n{instruction[0]}\n')
print(dash_line)
print(f'BASELINE HUMAN input:\n{input_text[0]}\n')
print(dash_line)
print(f'BASELINE HUMAN output:\n{output[0]}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{outputModel}')


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Instruction: What would be the best type of exercise for a person who has arthritis?

Input: 

Output: For someone with arthritis, the best type of exercise would be low-impact activities like yoga, swimming, or walking. These exercises provide the benefits of exercise without exacerbating the symptoms of arthritis.

---------------------------------------------------------------------------------------------------
BASELINE HUMAN instruction:
What would be the best type of exercise for a person who has arthritis?

---------------------------------------------------------------------------------------------------
BASELINE HUMAN input:


---------------------------------------------------------------------------------------------------
BASELINE HUMAN output:
For someone with arthritis, the best type of exercise would be low-impact activities like yoga, swimming, or walking. 

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Alpaca dataset thats created in hugging face by tokenizing

Tokenize the dataset values using tokenize function

In [7]:
# for Full Fine-Tuning
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["instruction"],examples["input"], examples["output"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [None]:
#subsampling the dataset to save some time
#tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/2500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

After tokenizing, few more ids such as input_ids, attention_mask and labels are added  to the dataset

In [8]:
#checking shapes of all three datasets
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (2500, 6)
Validation: (250, 6)
Test: (250, 6)
DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2500
    })
    validation: Dataset({
        features: ['output', 'input', 'instruction', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 250
    })
    test: Dataset({
        features: ['output', 'input', 'instruction', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 250
    })
})


Splitting the dataset into train and validation dataset for training

In [9]:

# Prepare the dataset for training
train_dataset = tokenized_datasets["train"]
validation_dataset = tokenized_datasets["validation"]

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilizing the built-in Hugging Face `Trainer` class [here](https://huggingface.co/docs/transformers/main_classes/trainer). Passing the preprocessed dataset with reference to the original model.

In [10]:
#output_dir = f'./Q&A-trainig-{str(int(time.time()))}'


# Define training arguments
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="epoch",
    logging_steps=1000,
    save_steps=1000,
    eval_steps=1000,
    logging_dir="./full_finetune_logs",
    output_dir="./full_finetune_results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    warmup_steps=500,
    save_total_limit=3,
)

# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=original_model)

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=original_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)


Starting training process... and save model results as checkpoints in drive to refer them later.   
As training each time may take our time..we can simply refer checkpoints later to get finetuned model.

In [11]:
trainer.train()

model_path="./full-fine-tune-Q&A-checkpoint-local"

trainer.model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,4.5485
2,21.639300,2.843375
3,21.639300,2.766375


('./full-fine-tune-Q&A-checkpoint-local/tokenizer_config.json',
 './full-fine-tune-Q&A-checkpoint-local/special_tokens_map.json',
 './full-fine-tune-Q&A-checkpoint-local/spiece.model',
 './full-fine-tune-Q&A-checkpoint-local/added_tokens.json',
 './full-fine-tune-Q&A-checkpoint-local/tokenizer.json')

Below 3 cells of code Saves the logs, results and checkpoint files in Drive for later use

In [13]:
# above training process take an hour of time...hence checkpointing the trained model and uploading it to drive to refer later
from google.colab import drive
import shutil

# Mount Google Drive
drive.mount('/content/drive')

# Define source and destination paths
source_checkpoint_path = "/content/full-fine-tune-Q&A-checkpoint-local"
destination_checkpoint_path = "/content/drive/My Drive/full-fine-tune-Q&A-checkpoint-local"

# Copy the checkpoint directory and its contents to Google Drive
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)

Mounted at /content/drive


'/content/drive/My Drive/full-fine-tune-Q&A-checkpoint-local'

In [14]:
# Define source and destination paths
source_checkpoint_path = "/content/full_finetune_logs"
destination_checkpoint_path = "/content/drive/My Drive/full_finetune_logs"
# Copy the checkpoint file to Google Drive
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)


'/content/drive/My Drive/full_finetune_logs'

In [15]:
# Define source and destination paths
source_checkpoint_path = "/content/full_finetune_results"
destination_checkpoint_path = "/content/drive/My Drive/full_finetune_results"
# Copy the checkpoint file to Google Drive
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)

'/content/drive/My Drive/full_finetune_results'

#Perform qualitative evaluation (Human Evaluation) for the given dataset: For original model
**Qualitative Evaluation (Human Evaluation):**

To evaluate the model qualitatively, you can generate summaries for a
few examples from the validation set and examine them manually to assess the quality.
You can print out the generated summaries along with the corresponding input and target summaries.

In [16]:
# Generate and print outputs for a few examples from the validation set
num_examples_to_evaluate = 5  # Adjust this based on your preference

for example in validation_dataset.shuffle(seed=42).select(range(num_examples_to_evaluate)):
    instruction_text = example["instruction"]
    input_text = example["input"]
    target_output = example["output"]

    # Tokenize input_text and move the tensors to the same device as the model
    inputs = tokenizer(input_text, return_tensors="pt").to(trainer.args.device)

    # Generate output using the fine-tuned model
    generated_output = trainer.model.generate(**inputs)

    # Decode generated output and target output
    generated_output_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)
    target_output_text = target_output

    # Print instruction, input, target output, and generated output
    print(f"Instruction: {instruction_text}")
    print(f"Input: {input_text}")
    print(f"Target Output: {target_output_text}")
    print(f"Generated Output: {generated_output_text}")
    print("=" * 50)




Instruction: Name three tools a software engineer could use to develop a mobile application.
Input: 
Target Output: A software engineer could use tools like Android Studio, Xcode, or Flutter to develop a mobile application. These tools allow for the creation of secure and user-friendly applications, with the ability to create both iOS and Android applications.
Generated Output: The sand is a sandstone, a sandstone,
Instruction: Replace the words 'come through' in the following sentence with an appropriate phrase.
Input: Alice was determined to come through at the end.
Target Output: Alice was determined to prevail at the end.
Generated Output: The final was a final.
Instruction: Print out a biography of the current US president.
Input: 
Target Output: Joseph R. Biden Jr. is the 46th President of the United States. He was born in Scranton, Pennsylvania in 1942, and graduated from the University of Delaware and Syracuse Law School. Biden served as Delaware’s U.S. Senator from 1973 to 200

#Quantitative Evaluation (ROUGE Metric): for intial original model

You can use the ROUGE metric to quantitatively evaluate the model's performance.
The datasets library provides an easy way to compute ROUGE scores for generated summaries compared to the target summaries.

In [17]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [18]:
from rouge_score import rouge_scorer

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Initialize lists to store ROUGE scores
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

# Compute ROUGE scores for each example in the validation set
for example in validation_dataset:
    # Get input, target output, and generated output
    input_text = example["input"]
    target_output = example["output"]
    generated_output = trainer.model.generate(**tokenizer(input_text, return_tensors="pt").to(trainer.args.device))
    generated_output_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)

    # Compute ROUGE scores
    scores = scorer.score(target_output, generated_output_text)

    # Append ROUGE scores to respective lists
    rouge1_scores.append(scores['rouge1'].fmeasure)
    rouge2_scores.append(scores['rouge2'].fmeasure)
    rougeL_scores.append(scores['rougeL'].fmeasure)

# Compute average ROUGE scores
avg_rouge1 = sum(rouge1_scores) / len(rouge1_scores)
avg_rouge2 = sum(rouge2_scores) / len(rouge2_scores)
avg_rougeL = sum(rougeL_scores) / len(rougeL_scores)

# Print average ROUGE scores
print(f"Average ROUGE-1 F1 Score: {avg_rouge1}")
print(f"Average ROUGE-2 F1 Score: {avg_rouge2}")
print(f"Average ROUGE-L F1 Score: {avg_rougeL}")


Average ROUGE-1 F1 Score: 0.11408165245417824
Average ROUGE-2 F1 Score: 0.024054838145617263
Average ROUGE-L F1 Score: 0.09759613286665267


To evaluate the model and compute ROUGE metrics for both the original flan-t5-base model and the fine-tuned model, you can follow these steps:
# now using finetned model to compare with original model
Load the original flan-t5-base model and tokenizer that we have
Also Load the finetuned model and full fine tuned tokenizer trained before.
Tokenize the evaluation dataset.
Generate outputs using both models.
Compute ROUGE metrics for the generated outputs compared to the reference outputs in the dataset.

In [19]:

# Load the fine-tuned model and tokenizer
#finetuned_model_path = "./full-fine-tune-Q&A-checkpoint-local" # loading the trained model
finetuned_model_path = "/content/drive/MyDrive/full-fine-tune-Q&A-checkpoint-local"# loading trained model from drive
finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(finetuned_model_path)
finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

In [20]:
# Function to generate outputs using a model and tokenizer

def generate_output(model, tokenizer, dataset):
  generated_output = []
  for example in dataset:
    # Move example to the same device as the model
    input_text = example['instruction'] + " " + example['input']
    input_text = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate output using the model
    output = model.generate(**input_text)

    # Decode generated output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    generated_output.append(generated_text)
  return generated_output

In [21]:
# Generate summaries using the original model and tokenizer
original_output = generate_output(original_model, tokenizer, validation_dataset)

# Generate summaries using the fine-tuned model and tokenizer
finetuned_output = generate_output(finetuned_model, finetuned_tokenizer, validation_dataset)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [22]:
# Human baseline evaluation for the finetuned model
#Define a function to generate output using the fine-tuned model and tokenizer
def generate_output(model, tokenizer, dataset):
    num_examples_to_evaluate = 5  # Adjust this based on your preference

    # Loop through a few examples from the dataset
    for example in dataset.shuffle(seed=42).select(range(num_examples_to_evaluate)):
        instruction_text = example["instruction"]
        input_text = example["input"]
        target_output = example["output"]

        # Tokenize input_text and move the tensors to the same device as the model
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

        # Generate output using the fine-tuned model
        generated_output = model.generate(**inputs)

        # Decode generated output and target output
        generated_output_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)
        target_output_text = target_output

        # Print instruction, input, target output, and generated output
        print(f"Instruction: {instruction_text}")
        print(f"Input: {input_text}")
        print(f"Target Output: {target_output_text}")
        print(f"Generated Output: {generated_output_text}")
        print("=" * 50)

# Generate summaries using the fine-tuned model and tokenizer
generate_output(finetuned_model, finetuned_tokenizer, validation_dataset)

Instruction: Name three tools a software engineer could use to develop a mobile application.
Input: 
Target Output: A software engineer could use tools like Android Studio, Xcode, or Flutter to develop a mobile application. These tools allow for the creation of secure and user-friendly applications, with the ability to create both iOS and Android applications.
Generated Output: The sand is a sandstone, a sandstone,
Instruction: Replace the words 'come through' in the following sentence with an appropriate phrase.
Input: Alice was determined to come through at the end.
Target Output: Alice was determined to prevail at the end.
Generated Output: The final was a final.
Instruction: Print out a biography of the current US president.
Input: 
Target Output: Joseph R. Biden Jr. is the 46th President of the United States. He was born in Scranton, Pennsylvania in 1942, and graduated from the University of Delaware and Syracuse Law School. Biden served as Delaware’s U.S. Senator from 1973 to 200

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [23]:
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Initialize lists to store concatenated input and output texts
original_input_output = []
finetuned_input_output = []

# Iterate over each example in the validation dataset
for example, original_summary, finetuned_summary in zip(validation_dataset, original_output, finetuned_output):
    # Concatenate input and output texts for original model
    original_input_output.append(f"{example['instruction']} {example['input']}\n{original_summary}")

    # Concatenate input and output texts for fine-tuned model
    finetuned_input_output.append(f"{example['instruction']} {example['input']}\n{finetuned_summary}")

# Concatenate lists of strings into single strings
original_input_output_str = '\n'.join(original_input_output)
finetuned_input_output_str = '\n'.join(finetuned_input_output)

# Compute ROUGE scores for the original model
original_scores = scorer.score(original_input_output_str, '\n'.join([example['output'] for example in validation_dataset]))

# Compute ROUGE scores for the fine-tuned model
finetuned_scores = scorer.score(finetuned_input_output_str, '\n'.join([example['output'] for example in validation_dataset]))


In [24]:
print('original_scores:', original_scores, '\nfinetuned_scores:', finetuned_scores)

original_scores: {'rouge1': Score(precision=0.3925677563565242, recall=0.8176527643064986, fmeasure=0.5304555751321419), 'rouge2': Score(precision=0.16561102831594635, recall=0.3449747768723322, fmeasure=0.2237885462555066), 'rougeL': Score(precision=0.18077675328303996, recall=0.37652764306498543, fmeasure=0.24427384847722122)} 
finetuned_scores: {'rouge1': Score(precision=0.39089131042190556, recall=0.8186073727325922, fmeasure=0.5291225416036309), 'rouge2': Score(precision=0.16458643815201193, recall=0.3447132266874756, fmeasure=0.2227966208548733), 'rougeL': Score(precision=0.18077675328303996, recall=0.3785839672322996, fmeasure=0.24470499243570348)}


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [26]:

from peft import LoraConfig, get_peft_model, TaskType

# Define PEFT configuration
lora_config = LoraConfig(
    r=32,  # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM  # Assuming you are using a sequence-to-sequence model
)

In [27]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [28]:

peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [29]:
# Define training arguments
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="epoch",
    logging_steps=1000,
    save_steps=1000,
    eval_steps=1000,
    logging_dir="./peft_logs",
    output_dir="./peft_results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    warmup_steps=500,
    save_total_limit=3,
)

# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=peft_model)

# Initialize Trainer with PEFT model
trainer = Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)

In [30]:
# Train the PEFT model and save checkpoint for later use
trainer.train()

# Save the PEFT model and tokenizer
model_path = "./peft-Q&A-checkpoint-local"
trainer.model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,4.058375
2,16.975500,0.559469
3,16.975500,0.479266


('./peft-Q&A-checkpoint-local/tokenizer_config.json',
 './peft-Q&A-checkpoint-local/special_tokens_map.json',
 './peft-Q&A-checkpoint-local/spiece.model',
 './peft-Q&A-checkpoint-local/added_tokens.json',
 './peft-Q&A-checkpoint-local/tokenizer.json')

The below 3 cells are used to save trained model checkpoint and logs and results in drive for later use

In [31]:
#from google.colab import drive

# Mount Google Drive
#drive.mount('/content/drive')
# Define source and destination paths

import shutil

# Source and destination paths
source_checkpoint_path = "/content/peft-Q&A-checkpoint-local"
destination_checkpoint_path = "/content/drive/My Drive/peft-Q&A-checkpoint-local"

# Copy the entire directory recursively
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)


'/content/drive/My Drive/peft-Q&A-checkpoint-local'

In [32]:

# Define source and destination paths
source_checkpoint_path = "/content/peft_logs"
destination_checkpoint_path = "/content/drive/My Drive/peft_logs"
# Copy the checkpoint file to Google Drive
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)

'/content/drive/My Drive/peft_logs'

In [33]:
# Define source and destination paths
source_checkpoint_path = "/content/peft_results"
destination_checkpoint_path = "/content/drive/My Drive/peft_results"
# Copy the checkpoint file to Google Drive
shutil.copytree(source_checkpoint_path, destination_checkpoint_path)

'/content/drive/My Drive/peft_results'

In [39]:
# Source and destination paths
source_checkpoint_path = "/content/drive/MyDrive/full-fine-tune-Q&A-checkpoint-local/pytorch_model.bin"
destination_checkpoint_path = "/content/drive/MyDrive/peft-Q&A-checkpoint-local/pytorch_model.bin"

# Copy the entire directory recursively
shutil.copyfile(source_checkpoint_path, destination_checkpoint_path)


'/content/drive/MyDrive/peft-Q&A-checkpoint-local/pytorch_model.bin'

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [78]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       '/content/drive/My Drive/peft-Q&A-checkpoint-local',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

In [73]:
# Set the BOS token begining of the sequence token
bos_token = "Instruction"  # Replace "<s>" with your desired BOS token
tokenizer.bos_token = bos_token

# Update the tokenizer to reflect the change
tokenizer.add_special_tokens({"bos_token": bos_token})

1

<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model and PEFT model.

In [77]:
# Define a function to generate output using the fine-tuned model and tokenizer

def generate_output(model, tokenizer, dataset, decoder_start_token_id=None):
    num_examples_to_evaluate = 5  # Adjust this based on your preference

    generated_outputs = []  # List to store generated outputs

    # Loop through a few examples from the dataset
    for example in dataset.shuffle(seed=42).select(range(num_examples_to_evaluate)):
        instruction_text = example["instruction"]
        input_text = example["input"]
        target_output = example["output"]

        # Tokenize input_text and move the tensors to the same device as the model
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

        # Generate output using the model
        if decoder_start_token_id is not None:
            generated_output = model.generate(**inputs, decoder_start_token_id=decoder_start_token_id)
        else:
            generated_output = model.generate(**inputs)

        # Decode generated output
        generated_output_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)

        # Store generated output
        generated_outputs.append(generated_output_text)

        # Print instruction, input, target output, and generated output
        print(f"Instruction: {instruction_text}")
        print(f"Input: {input_text}")
        print(f"Target Output: {target_output}")
        print(f"Generated Output: {generated_output_text}")
        print("=" * 50)

    return generated_outputs  # Return the list of generated outputs


# Generate summaries using the fine-tuned model and tokenizer
generate_output(peft_model, tokenizer, validation_dataset, decoder_start_token_id=peft_finetuned_tokenizer.bos_token_id)


Instruction: Name three tools a software engineer could use to develop a mobile application.
Input: 
Target Output: A software engineer could use tools like Android Studio, Xcode, or Flutter to develop a mobile application. These tools allow for the creation of secure and user-friendly applications, with the ability to create both iOS and Android applications.
Generated Output: or around around around around around around around around around around around around around around around the 4-5
Instruction: Replace the words 'come through' in the following sentence with an appropriate phrase.
Input: Alice was determined to come through at the end.
Target Output: Alice was determined to prevail at the end.
Generated Output: or affiliate here or affiliate here.sourced affiliate here. affiliate links here. affiliate links here.
Instruction: Print out a biography of the current US president.
Input: 
Target Output: Joseph R. Biden Jr. is the 46th President of the United States. He was born in 

['or around around around around around around around around around around around around around around around the 4-5',
 'or affiliate here or affiliate here.sourced affiliate here. affiliate links here. affiliate links here.',
 'or around around around around around around around around around around around around around around around the 4-5',
 'here here here here here here here here here here here. This is here, here, and',
 'or around around around around around around around around around around around around around around around the 4-5']

In [79]:
# Generate summaries using the original model and tokenizer
#original_output = generate_output(original_model, tokenizer, validation_dataset)
# Generate summaries using the peft-fine-tuned model and tokenizer
#peft_finetuned_output = generate_output(peft_finetuned_model, peft_finetuned_tokenizer, validation_dataset)

# Generate summaries using the original model and tokenizer
original_output = generate_output(original_model, tokenizer, validation_dataset, decoder_start_token_id=tokenizer.bos_token_id)

# Generate summaries using the peft-fine-tuned model and tokenizer
peft_finetuned_output = generate_output(peft_model, tokenizer, validation_dataset, decoder_start_token_id=tokenizer.bos_token_id)


Instruction: Name three tools a software engineer could use to develop a mobile application.
Input: 
Target Output: A software engineer could use tools like Android Studio, Xcode, or Flutter to develop a mobile application. These tools allow for the creation of secure and user-friendly applications, with the ability to create both iOS and Android applications.
Generated Output: - The sandboxes of the sandboxes of the
Instruction: Replace the words 'come through' in the following sentence with an appropriate phrase.
Input: Alice was determined to come through at the end.
Target Output: Alice was determined to prevail at the end.
Generated Output: The s.
Instruction: Print out a biography of the current US president.
Input: 
Target Output: Joseph R. Biden Jr. is the 46th President of the United States. He was born in Scranton, Pennsylvania in 1942, and graduated from the University of Delaware and Syracuse Law School. Biden served as Delaware’s U.S. Senator from 1973 to 2009, and as Vice

In [69]:
print(finetuned_output)

['- n - n - n - n -', 'of a', '$12', 'and swimming', 'price per share', 'complexity is a complex system.', '- - - - - - - - -', 'what is the most important to the company.', '@Jay_Jay - a cup of coffee a day can', 'your umbrella.', 'The jungle is a jungle. The jungle is a jungle. The jungle is a', '- - - - - - - - -', 'customer satisfaction.', 'whisper', 'to the throne', '', 'security is a key to a secure system.', 's', 'is the most important asset in a country.', 's', 's"', '', 's are becoming more common, and the climate is becoming more temperate.', 's are a recurrence of the underlying system.', '', 'Outstanding', 's', '', 's are a hazard.', 'glimmering in the sky', ', cylinder, and cylinder', 'computing', 'Wake Up, and Get Yourself Away.', 'The elves are a swarm of elves, and the e', 's', 's.', 's are a major part of climate change.', 's.', '', ',swarmed.', '= 2 2 2 2 ', ', and a treasure hunt.', '', 'a summer a summer', 'a(2-51),2-5(2-51),3-7(2', 'Washington, D.C., and Washington

In [70]:
print(original_output)

['- The sandboxes of the sandboxes of the', 'The s.', '- The sandboxes of the sandboxes of the', 'John Davidson, a computer program that uses computer hardware.', '- The sandboxes of the sandboxes of the']


In [71]:
print(peft_finetuned_output)

['some Manual Manual Manual prévention rounds vreau Joy ultraviolet Joy ultraviolet Joy mais ultraviolet Joy meditate soothing earrings Pittsburgh', 'some bicycle vietiiffentlichkeit colon vietiiffentlichkeit colon mitigation himselfAcestestädtinclusivlusieurs pare colon Cin règle Junior', 'some Manual Manual Manual prévention rounds vreau Joy ultraviolet Joy ultraviolet Joy mais ultraviolet Joy meditate soothing earrings Pittsburgh', 'somemon metaphorbodsanct foods bone savingsKT tuturorHar Restaurant ABS Rö Parker tilemighty ausgestattetport', 'some Manual Manual Manual prévention rounds vreau Joy ultraviolet Joy ultraviolet Joy mais ultraviolet Joy meditate soothing earrings Pittsburgh']


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 Instructin and output pairs to save time).

In [80]:
#comparing rouge scores for original model, full finetuned model and peft model
from rouge_score import rouge_scorer
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Initialize lists to store concatenated input and output texts
original_input_output = []
finetuned_input_output = []
peft_finetuned_input_output=[]

# Iterate over each example in the validation dataset
for example, original_summary, finetuned_summary, peft_finetuned_summary in zip(validation_dataset, original_output, finetuned_output, peft_finetuned_output):
    # Concatenate input and output texts for original model
    original_input_output.append(f"{example['instruction']} {example['input']}\n{original_summary}")

    # Concatenate input and output texts for fine-tuned model
    finetuned_input_output.append(f"{example['instruction']} {example['input']}\n{finetuned_summary}")

    # Concatenate input and output texts for peft-fine-tuned model
    peft_finetuned_input_output.append(f"{example['instruction']} {example['input']}\n{peft_finetuned_summary}")


# Concatenate lists of strings into single strings
original_input_output_str = '\n'.join(original_input_output)
finetuned_input_output_str = '\n'.join(finetuned_input_output)
peft_finetuned_input_output_str = '\n'.join(peft_finetuned_input_output)



# Compute ROUGE scores for the original model
original_scores = scorer.score(original_input_output_str, '\n'.join([example['output'] for example in validation_dataset]))

# Compute ROUGE scores for the fine-tuned model
finetuned_scores = scorer.score(finetuned_input_output_str, '\n'.join([example['output'] for example in validation_dataset]))

# Compute ROUGE scores for the peft-fine-tuned model
peft_finetuned_scores = scorer.score(peft_finetuned_input_output_str, '\n'.join([example['output'] for example in validation_dataset]))


In [81]:
print('original_scores:', original_scores, '\nfinetuned_scores:', finetuned_scores, '\npeft_finetuned_scores:', peft_finetuned_scores)


original_scores: {'rouge1': Score(precision=0.0072645990500139705, recall=0.8387096774193549, fmeasure=0.014404432132963989), 'rouge2': Score(precision=0.003912071535022355, recall=0.45652173913043476, fmeasure=0.007757665312153676), 'rougeL': Score(precision=0.005495017230138772, recall=0.6344086021505376, fmeasure=0.010895660203139427)} 
finetuned_scores: {'rouge1': Score(precision=0.006053832541678308, recall=0.8904109589041096, fmeasure=0.012025901942645696), 'rouge2': Score(precision=0.0031669150521609537, recall=0.4722222222222222, fmeasure=0.006291635825314582), 'rougeL': Score(precision=0.004097978951289932, recall=0.6027397260273972, fmeasure=0.008140610545790935)} 
peft_finetuned_scores: {'rouge1': Score(precision=0.0072645990500139705, recall=0.8387096774193549, fmeasure=0.014404432132963989), 'rouge2': Score(precision=0.003912071535022355, recall=0.45652173913043476, fmeasure=0.007757665312153676), 'rougeL': Score(precision=0.005495017230138772, recall=0.6344086021505376, f