## PEFT (Parameter Efficient Fine-Tuning) flan-T5 LLM model

What is covered?
1. Load flan-t5 model & dialogue-summarization dataset.
2. Paramerter Efficient fine-tuning of flan-T5 model on nvidia A6000 GPU
3. Test inference of Base model and Fine-tuned model
4. Test the fine-tuned model with rough and bleu scores
5. Track the experiment with wandb (weights and biases)
6. Learn to use Paperspace Gradient service to train your model for finetuning

In [1]:
# installing dependencies
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m87.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
Successfully installed pip-23.3.2
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 0.12.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible.
torchvision 0.13.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
# import statements
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, EarlyStoppingCallback
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import wandb

wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## 1. Load the flan-t5 model and dialogue summarization dataset
1. Check the datatype of model's tensor
2. Check where exactly the model is loaded (cpu or gpu)
3. Redo the datasplits for balalanced & optimum test/validation/test split
4. Tokenize the dataset for training

In [4]:
# load dialogue-summary dataset
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
#load model and tokenzier
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

# dtype check on model tensor. You could change it to bfloat16 to reduce the memory usage. 
# Note: bfloat16 won't work on Apple Silicon Macs
dtype = next(original_model.parameters()).dtype
print(f"Tensor's dataType -->{dtype}")

#check where the model is loaded (should print either cpu or cuda)
print(f"Model is loaded on -->{next(original_model.parameters()).device}")

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)"config.json";:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading (…)"model.safetensors";:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)ration_config.json";:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)enizer_config.json";:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)"tokenizer.json";:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)al_tokens_map.json";:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Tensor's dataType -->torch.float32
Model is loaded on -->cpu


### 1.3 Redoing the datasplits for balalanced & optimum test/validation/test split 

In [5]:
#redoing the dataset split as the default one is not balanced for model training
from datasets import load_dataset, concatenate_datasets, DatasetDict

# Combine the splits (train, test, validation)
combined_dataset = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]])

# Shuffle the combined dataset
combined_dataset = combined_dataset.shuffle(seed=42)

# Split the dataset into 80% train, 10% test, 10% validation
train_test_split = combined_dataset.train_test_split(test_size=0.20)  # Splitting 20% for test+validation
test_validation_split = train_test_split['test'].train_test_split(test_size=0.5)  # Splitting the 20% into two equal halves

# Creating the final DatasetDict
final_dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': test_validation_split['test'],
    'validation': test_validation_split['train']
})

test_summaries = final_dataset['test']['summary']

### 1.4 Tokenizing the dataset for training

In [6]:
def tokenize_function(examples):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompts = [start_prompt + dialogue + end_prompt for dialogue in examples["dialogue"]]
    model_max_input_length = tokenizer.model_max_length

    # Tokenize the input dialogue text
    tokenized_inputs = tokenizer(prompts, max_length=model_max_input_length, padding="max_length", truncation=True)
    
    # Tokenize the labels for the dialogues
    tokenized_labels = tokenizer(examples["summary"], max_length=model_max_input_length, padding="max_length", truncation=True)

    # We need to replace the labels token ids of padding with -100 so they are not taken into account in the loss computation
    tokenized_labels["input_ids"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels] for labels in tokenized_labels["input_ids"]
    ]

    return {"input_ids": tokenized_inputs["input_ids"], "labels": tokenized_labels["input_ids"]}

# Tokenize the entire dataset
tokenized_datasets = final_dataset.map(tokenize_function, batched=True)

# Remove columns which are not necessary for training
columns_to_remove = ['id', 'topic', 'dialogue', 'summary']
tokenized_datasets = tokenized_datasets.remove_columns(columns_to_remove)

Map:   0%|          | 0/11568 [00:00<?, ? examples/s]

Map:   0%|          | 0/1446 [00:00<?, ? examples/s]

Map:   0%|          | 0/1446 [00:00<?, ? examples/s]

## 2. Perfrom Parameter Efficient Fine Tuning (PEFT)

PEFT, encompassing Low-Rank Adaptation (LoRA) and prompt tuning (distinct from prompt engineering), is primarily associated with LoRA. LoRA enables efficient model fine-tuning, often utilizing minimal computational resources such as a single GPU. The process retains the original Large Language Model (LLM) intact while generating a significantly smaller "LoRA adapter," typically a single-digit percentage of the original LLM's size (measured in MBs compared to GBs). For inference, this adapter is integrated with the original LLM. The advantage of LoRA lies in its ability to use multiple adapters with a single LLM, optimizing memory usage across various tasks and applications.

In [8]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"], # focusing on query and value 
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)
peft_model = get_peft_model(original_model, lora_config)

In [9]:
timestamp = str(int(time.time()))

output_dir = f'./peft-models/peft-dialogue-summary-training-{timestamp}'

# early stopping callback will help to stop the training if no siginficant reduction in error is observed.
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.009)

peft_training_args = TrainingArguments(
    report_to="wandb",
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-4, # Higher learning rate than full fine-tuning.
    num_train_epochs=5,
    logging_steps=100,
    eval_steps=100,
    max_steps=1000,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end = True,
    gradient_accumulation_steps=2,   
    max_grad_norm=1.0,
    warmup_steps=250, 
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[early_stopping_callback]
)

In [11]:
run = wandb.init(project='genai-llm', name=f'flant5-PEFT-{timestamp}')

start_time = time.time()
peft_trainer.train()
training_time = time.time() - start_time
run.log({"Training time (seconds)":training_time})
run.log({"Training configuration":peft_training_args.to_dict()})

peft_model_path=f"{output_dir}/final/"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



[34m[1mwandb[0m: Currently logged in as: [33maambekar234[0m. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Step,Training Loss,Validation Loss
100,2.0985,1.487497
200,1.5553,1.255934
300,1.3635,1.203612
400,1.3164,1.187709
500,1.331,1.177025
600,1.3017,1.162241
700,1.2784,1.161294
800,1.2807,1.153541
900,1.2642,1.152996
1000,1.2857,1.151971


('./peft-models/peft-dialogue-summary-training-1703600695/final/tokenizer_config.json',
 './peft-models/peft-dialogue-summary-training-1703600695/final/special_tokens_map.json',
 './peft-models/peft-dialogue-summary-training-1703600695/final/spiece.model',
 './peft-models/peft-dialogue-summary-training-1703600695/final/added_tokens.json',
 './peft-models/peft-dialogue-summary-training-1703600695/final/tokenizer.json')

In [12]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       peft_model_path,
                                       is_trainable=False)

In [13]:
#let's get inference from original model
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

example_record = 200
dialogue = dataset['test'][example_record]['dialogue']
generation_config = GenerationConfig(max_new_tokens=200, num_beams=1)

print(dialogue)

start_prompt = 'Summarize the following conversation.\n\n'
end_prompt = '\n\nSummary: '
prompt = start_prompt + dialogue + end_prompt


input_ids = tokenizer(prompt, return_tensors='pt').input_ids
output_tokens = original_model.generate(input_ids=input_ids, generation_config = generation_config,)
original_model_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Summary-->")
print(original_model_output)

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Summary-->
#Person1#: I'm thinking of upgrading my computer.


In [21]:
# load full fine tuned model we trained in part2
artifact = run.use_artifact('aambekar234/genai-llm/model_artifact:v0', type='model')
artifact_dir = artifact.download()
#let's get inference from original model
fullfinetuned_model = AutoModelForSeq2SeqLM.from_pretrained(artifact_dir)

[34m[1mwandb[0m: Downloading large artifact model_artifact:v0, 947.60MB. 8 files... 
[34m[1mwandb[0m:   8 of 8 files downloaded.  
Done. 0:0:0.0


In [24]:
#lets get inference from peft model
fullfinetuned_model.to("cuda:0")
output_tokens = fullfinetuned_model.generate(input_ids=input_ids.to("cuda:0"), generation_config = GenerationConfig(max_new_tokens=200, num_beams=1))
finetuned_model_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)


#lets get inference from peft model
peft_model.to("cuda:0")
output_tokens = peft_model.generate(input_ids=input_ids.to("cuda:0"), generation_config = GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("#### Human Baseline Summary -->")
print(dataset['test'][example_record]['summary'])
print("#### Summary Generated by original model->")
print(original_model_output)
print("#### Summary Generated by finetuned model->")
print(finetuned_model_output)
print("#### Summary Generated by peft model->")
print(peft_model_output)

#### Human Baseline Summary -->
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
#### Summary Generated by original model->
#Person1#: I'm thinking of upgrading my computer.
#### Summary Generated by finetuned model->
#Person2# wants to upgrade #Person2#'s system and hardware. #Person1# suggests adding a painting program to #Person2#'s software and adding a CD-ROM drive.
#### Summary Generated by peft model->
#Person2# considers upgrading #Person1#'s system and hardware. #Person1# recommends adding a painting program to #Person2#'s software. #Person2# also considers adding a CD-ROM drive.


In [26]:
from tqdm import tqdm

# to save time we will only use 150 items from test split for evaluation
dialogues = final_dataset['test']['dialogue'][:150]
print(len(dialogues))

human_baseline_summaries = final_dataset['test']['dialogue'][:150]
original_model_summaries = []
fullfinetuned_model_smmaries = []
peft_model_summaries = []

# moving model to gpu for faster inference
original_model.to("cuda:0")

for dialogue in tqdm(dialogues, desc="Generating summaries from original & finetuned models..."):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config = generation_config)
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config = generation_config)
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_text_output)
    
    fullfinetuned_model_outputs = fullfinetuned_model.generate(input_ids=input_ids, generation_config = generation_config)
    fullfinetuned_text_output = tokenizer.decode(fullfinetuned_model_outputs[0], skip_special_tokens=True)
    fullfinetuned_model_smmaries.append(fullfinetuned_text_output)

150


Generating summaries from original & finetuned models...:  31%|███       | 46/150 [01:04<02:37,  1.51s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (727 > 512). Running this sequence through the model will result in indexing errors
Generating summaries from original & finetuned models...: 100%|██████████| 150/150 [03:36<00:00,  1.44s/it]


In [30]:
import evaluate
rouge = evaluate.load('rouge')
human_baseline_summaries = test_summaries[:150]

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

fullfinetuned_model_results = rouge.compute(
    predictions=fullfinetuned_model_smmaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(fullfinetuned_model_results)
print('PEFT MODEL:')
print(peft_model_results)
run.summary["rouge_score"] = peft_model_results

ORIGINAL MODEL:
{'rouge1': 0.22947279572730211, 'rouge2': 0.0781999252930429, 'rougeL': 0.1977396569141013, 'rougeLsum': 0.19679439493636844}
Finetuned MODEL:
{'rouge1': 0.4702793069042287, 'rouge2': 0.21222551380111027, 'rougeL': 0.3730673327044676, 'rougeLsum': 0.3731581482211145}
PEFT MODEL:
{'rouge1': 0.4612038420317723, 'rouge2': 0.20200270894870215, 'rougeL': 0.37124483104115225, 'rougeLsum': 0.370299949862882}


In [31]:
bleu = evaluate.load("bleu")
    
original_model_results = bleu.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries
)

fullfinetuned_model_results = bleu.compute(
    predictions=fullfinetuned_model_smmaries,
    references=human_baseline_summaries
)

peft_model_results = bleu.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(fullfinetuned_model_results)
print('PEFT MODEL:')
print(peft_model_results)

run.summary["bleu_score"] = peft_model_results
run.finish()

ORIGINAL MODEL:
{'bleu': 0.05809275269853388, 'precisions': [0.2793687901811806, 0.11613691931540342, 0.061178731582319026, 0.0211978465679677], 'brevity_penalty': 0.721293229102033, 'length_ratio': 0.7537444933920705, 'translation_length': 3422, 'reference_length': 4540}
Finetuned MODEL:
{'bleu': 0.20580074626413755, 'precisions': [0.4581073989854819, 0.25902640560445483, 0.16485139376038396, 0.09170305676855896], 'brevity_penalty': 1.0, 'length_ratio': 1.259251101321586, 'translation_length': 5717, 'reference_length': 4540}
PEFT MODEL:
{'bleu': 0.2036505381322652, 'precisions': [0.4776618775831529, 0.2683025755424863, 0.16356410792721188, 0.08205571150939323], 'brevity_penalty': 1.0, 'length_ratio': 1.1191629955947135, 'translation_length': 5081, 'reference_length': 4540}


0,1
Training time (seconds),▁
eval/loss,█▃▂▂▂▁▁▁▁▁
eval/runtime,▁█▇▆▆▂▂▃▄▅
eval/samples_per_second,█▁▂▃▃▇▇▆▅▄
eval/steps_per_second,█▁▂▃▃▇▆▆▅▃
train/epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇███
train/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇█████
train/learning_rate,▄▇█▇▆▅▄▃▂▁
train/loss,█▃▂▁▂▁▁▁▁▁
train/total_flos,▁

0,1
Training time (seconds),1500.25268
eval/loss,1.15197
eval/runtime,40.8829
eval/samples_per_second,35.369
eval/steps_per_second,4.427
train/epoch,1.38
train/global_step,1000.0
train/learning_rate,0.0
train/loss,1.2857
train/total_flos,1.1130063814656e+16


### Conclusion
With this experiment with validated the efficiency of PEFT and its advantages over full-finetuning of LLMs. Next Article we will explore RLHF techniques for further improving LLMs with help of human feedback. 

**🌟 Connect on LinkedIn!** 

If you've found this content _useful_ and would like to explore more about **data science**, **machine learning**, and related fields, I'd be delighted to see you on my LinkedIn network. I share insights, resources, and the latest trends that could be beneficial for your learning journey.

➤ [**_Follow on LinkedIn_**](https://www.linkedin.com/in/aambekar234/)

_Your support and interaction are always appreciated._

**Best Regards,**
**Abhijeet Ambekar**