# PEFT Fine-Tune a Generative AI Model

In this notebook, we will perform PEFT fine-tuning with LoRa (Low-Rank Adaptation of Large Language Models), evaluate the resulting model and see that the benefits of PEFT 

<a name='1'></a>
## 1 - Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Required Dependencies

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

2025-03-25 23:11:46.618090: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

#### 1.2.1 - Model Selection

There are a lot of model candidates to be used for this task, but google/flan-t5-base  is one of the best if you need a general-purpose model that can also summarize but is multi-task capable.


For this task we are going to use [Dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 13k+ Human chat s. 

In [4]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

We will Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [base version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

- torch_dtype = torch.bfloat16 (brain floating point) is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.
- AutoModelForSeq2SeqLM is used to load any seq2seq (or encoder-decoder) architecture model, like T5 and BART, while AutoModelForCausalLM is used for auto-regressive language models like all the GPT models.

In [9]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Let pull out the number of model parameters and find out how many of them are trainable. 

In [10]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.0%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Let test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [12]:
index = 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = ('-'.join('' for x in range(100)))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

Summary:

------------------

<a name='2.1'></a>
## 2. - Preprocess the Dialog-Summary Dataset

We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [13]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [15]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [16]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


The output dataset is ready for fine-tuning.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

### **Explanation of PEFT, LoRA, and Prompt Tuning**  

#### **1️⃣ What is PEFT?**  
PEFT stands for **Parameter-Efficient Fine-Tuning**. It’s a broad category of techniques that fine-tune large models **without modifying all of their parameters**. PEFT methods aim to make fine-tuning more efficient by **reducing computational and memory requirements**.  

PEFT includes:  
- **LoRA (Low-Rank Adaptation)**  
- **Prompt tuning** (not to be confused with prompt engineering!)  



#### **2️⃣ What is LoRA?**  
**LoRA (Low-Rank Adaptation)** is a popular PEFT method that **fine-tunes large language models (LLMs) with significantly lower compute costs**.  

Here’s how it works:  
- Instead of **modifying** all parameters of the LLM during fine-tuning, LoRA **adds small trainable layers (LoRA adapters) to specific parts of the model** (e.g., attention layers).  
- These LoRA adapters are **much smaller** than the full LLM (typically only a few MBs, while LLMs are in GBs or TBs).  
- This approach allows fine-tuning **without changing the original LLM weights**.  

✅ **Key Benefit:** The original model stays the same, and **only the small LoRA adapter is trained**.  



#### **3️⃣ LoRA Adapters at Inference Time**  
- Once fine-tuning is complete, the trained **LoRA adapter is stored separately** from the original model.  
- At inference time, the **LoRA adapter is merged back into the original LLM** to handle requests.  
- **Advantage:** Multiple LoRA adapters can reuse the same LLM, meaning different tasks or use cases can run efficiently without loading multiple full models into memory.  

🔹 **Example:**  
Imagine you have an LLM serving multiple clients, each needing slightly different fine-tuning (e.g., a legal chatbot vs. a medical chatbot).  
- Instead of fine-tuning **and storing separate full models**, you **store only small LoRA adapters** for each client and reuse the original LLM.  
- This dramatically **reduces memory usage and compute costs**, especially on AWS **SageMaker Endpoints** where multiple tasks are served.  


<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*rOW5plKBuMlGgpD0SO8nZA.png" width="600"/>

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

We need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. by using PEFT/LoRA, we are freezing the underlying LLM and only training the adapter. 



### **1️⃣ `r` (Rank of Adaptation)**  
- **Definition:** This is the rank of the low-rank matrices used in the adaptation.
- **Explanation:** In LoRA, the model’s weight matrices are decomposed into two smaller matrices (low-rank approximation). The **rank (`r`)** defines the size of the **latent space** for the low-rank matrices. A higher rank can capture more complexity but also increases memory and computation.
- **Typical values:** Values like `16`, `32`, or `64` are common choices, balancing model capacity and efficiency.
- **In this case:** `r=32` means the rank of the low-rank matrix will be 32.



### **2️⃣ `lora_alpha` (Scaling Factor)**  
- **Definition:** A scaling factor applied to the low-rank matrices.
- **Explanation:** When the low-rank matrices are added to the original model weights, they are scaled by **`lora_alpha`** to control the magnitude of the adaptation. Larger values of `lora_alpha` make the LoRA adaptation more influential, while smaller values limit the influence.
- **Typical values:** It's often chosen based on experimentation, and values like `16`, `32`, or `64` are common.
- **In this case:** `lora_alpha=32` means the scaling factor will be 32. This will make the low-rank adaptation have a larger impact on the model during fine-tuning.



### **3️⃣ `target_modules` (Target Modules to Adapt)**  
- **Definition:** Specifies which parts of the model should be adapted with LoRA.
- **Explanation:** In LoRA, you typically apply the adaptation to specific layers of the model, often targeting attention layers (e.g., `q_proj` for query projection, `v_proj` for value projection in attention mechanisms). These layers are the parts of the model that benefit the most from fine-tuning, as they control the flow of information between different tokens.
- **In this case:** `target_modules=["q_proj", "v_proj"]` means that LoRA will be applied to the **query projection** and **value projection** modules in the attention mechanism.



### **4️⃣ `lora_dropout` (Dropout Rate for LoRA)**  
- **Definition:** The dropout rate applied to the low-rank matrices.
- **Explanation:** Dropout is a regularization technique that randomly "drops" (sets to zero) a proportion of units during training to prevent overfitting. The **`lora_dropout`** specifically applies dropout to the low-rank matrices. This helps the model generalize better during fine-tuning, particularly when training on limited data.
- **Typical values:** Values like `0.1` or `0.2` are often used.
- **In this case:** `lora_dropout=0.1` means there is a 10% chance that certain units in the LoRA adapters will be dropped during fine-tuning to prevent overfitting.



### **5️⃣ `bias` (Bias in Adaptation)**  
- **Definition:** Defines whether to include or exclude bias terms in the low-rank adaptation.
- **Explanation:** Many neural networks have bias terms (added to activations). This parameter allows you to control whether biases in the target layers (e.g., attention layers) should be adapted with LoRA or not.
- **Options:** 
  - `"none"`: No bias term adaptation.
  - `"all"`: Apply LoRA to bias terms as well.
  - `"lora"`: Apply LoRA to biases but in a constrained manner.
- **In this case:** `bias="none"` means **bias terms will not be adapted** using LoRA.


In [17]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Let Add LoRA adapter layers/parameters to the original LLM to be trained.

In [18]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.4092820552029972%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [19]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [20]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,51.25


('./peft-dialogue-summary-checkpoint/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint/spiece.model',
 './peft-dialogue-summary-checkpoint/added_tokens.json',
 './peft-dialogue-summary-checkpoint/tokenizer.json')

Let Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If we were preparing the model for further training, we would set `is_trainable=True`.

In [21]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       './peft-dialogue-summary-checkpoint/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)



Let Check the number of trainable parameters:

In [22]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.0%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Let Make inferences for the same example as in sections [1.3](#1.3)


In [31]:
index = 100
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids


peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)


dash_line = ('-'.join('' for x in range(100)))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}\n')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

Summary: 
------------------

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>