# Parameter-Efficient Fine-Tuning (PEFT) Techniques for Large Language Models (LLMs)

In this notebook, we will explore various techniques for fine-tuning large language models in a parameter-efficient way (PEFT). These methods allow us to adapt pre-trained language models to new tasks without updating all the parameters of the model, which is computationally expensive and requires a large amount of data.

PEFT strategies are crucial in scenarios where computational resources are limited, or when working with large models like GPT, BERT, or T5. We'll discuss the following techniques:

- **LoRA (Low-Rank Adaptation)**
- **Prefix Tuning**

Let's dive in!

In [None]:
# Installing the necessary libraries
!pip install -q transformers datasets
# install peft from github
!pip install -q git+https://github.com/huggingface/peft

## Low-Rank Adaptation (LoRA)
LoRA is another parameter-efficient fine-tuning technique. It reduces the rank of the model's parameter updates to achieve efficient training with fewer resources. This technique works by approximating the parameter updates in a low-dimensional subspace, rather than full-rank matrices.

### Data loading and preprocessing

In this example, we will use the samsum dataset, which consist of ~16k conversations. Each conversations comes wilt a summary. The objective is to fine-tune a model that is able to generate a summary when forwarded a diaglogue.

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("samsum", trust_remote_code=True)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

print("Sample example:")
print(dataset['train'][0])

To train a model, the text should be converted to machine-readable units, which are the token IDs. This can be done by using a tokenizer.

In this example, we'll use a small model from big science for demonstration

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="bigscience/mt0-small"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
from datasets import concatenate_datasets
import numpy as np

# Here we tokenize the dialogues, which is the input of our model
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# Here we tokenize the summary, which should be the output of our model
# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

We now will preprocess the data. It's handy to save your preprocessed data to disk for time efficiency

In [None]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs which was the dialogue
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument, which was the summary
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

### Model loading and training

Now that we have our dataset ready, we can start the fine-tuning process. First we need to load the base model.

In [None]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# when you're using a big model, you can quantisize the model  to save memory by using its
# bit configuration in the parameter setting, that is, 'load_in_4bit=True' or 'load_in_8bit=True'
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")

When you want to fine-tune a model, you have to define your fine-tune configuration and wrap the model in a peft-object

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training when you use a quatizied model
# model = repare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


Here you can see that only 22% of the parameters are being trained, which saves a lot of memory especially for bigger models!

Now we create a DataCollator, that will take care of padding the data and create batches

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

Lastly, we define the hyperparameters of our training process

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="tutorial"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
	auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Now we can finally train the model

In [None]:
trainer.train()

### Model saving and evaluation

Make sure you save your model and reload it to check whether everything works accordingly!

In [None]:
# Save our LoRA model & tokenizer results
peft_model_id="path_to_trained_model"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "path_to_trained_model"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

Try it with one example from the dataset to see if it works

In [None]:
# use the first sample of the test set
sample = dataset['test'][0]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

That's it for the LoRA fine-tuning!

## Prefix Tuning

Prefix tuning enables the model to learn a continuous task-specifc vector which are added to the beginning of the input, the prefix. In this method, only the prefix parameters are optimized, making it easy efficient for training by reducing memory and computational costs by the thousands!

### Data loading and preprocessing

We will use the financial phrasebank dataset, which contains sentiment labels for financial news sentences.


In [None]:
from datasets import load_dataset

dataset = load_dataset("financial_phrasebank", "sentences_allagree", trust_remote_code=True)
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]
del dataset["test"]

classes = dataset["train"].features["label"].names
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)

dataset["train"][0]

Again, we preprocess the data using the tokenizer. In this example, the t5-large model is used.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="t5-large"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

text_column = "sentence"
label_column = "text_label"
max_length = 128

def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

Now that the preprocessing is done, we create a data loader object to forward to the model for training

In [None]:
from transformers import default_data_collator, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader

batch_size = 8

train_dataloader = DataLoader(
    processed_datasets["train"], shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(processed_datasets["validation"], collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

### Model loading and training

After preparing the data, we can load the model and start initializing the training configuration

In [None]:
from peft import get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType
from tqdm import tqdm
import torch
import os

model_id="t5-large"

peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)

model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Setup your optimizer and learning scheduler

In [None]:
lr = 1e-2
num_epochs = 5

optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Make sure the model is set to the right device and start training the model

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")

After training the model, check how well it performs on the validation set.

In [None]:
correct = 0
total = 0
for pred, true in zip(eval_preds, dataset["validation"]["text_label"]):
    if pred.strip() == true.strip():
        correct += 1
    total += 1
accuracy = correct / total * 100
print(f"{accuracy=} % on the evaluation dataset")
print(f"{eval_preds[:10]=}")
print(f"{dataset['validation']['text_label'][:10]=}")

### Model saving and evaluation

Be sure to again save your newly trained model and reload it to check if it works properly!

You can either push it to the huggingface hub or save it locally.

In [None]:
# pushing it to the huggingface hub
from huggingface_hub import notebook_login

notebook_login()

peft_model_id = "your-name/t5-large_PREFIX_TUNING_SEQ2SEQ"
model.push_to_hub("your-name/t5-large_PREFIX_TUNING_SEQ2SEQ", use_auth_token=True)

# after pushing it, you can check whether you can load the PEFT-model
peft_model_id = "your-name/t5-large_PREFIX_TUNING_SEQ2SEQ"

config = PeftConfig.from_pretrained(peft_model_id)
peft_model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
peft_model = PeftModel.from_pretrained(model, peft_model_id)

In [None]:
# saving it locally

# save the fine-tuned parametetrs from training
model.save_pretrained("path_to_save_directory")

# load the base model, which in this case was t5-large
base_model = = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# load the PEFT model from the saved weights in "path_to_save_directory"
peft_model = PeftModel.from_pretrained(base_model, "path_to_save_directory")

Check with an example whether the model works accordingly

In [None]:
# put the model in evaluation mode so the weights don't change
peft_model.eval()

inputs = tokenizer(
    "The Lithuanian beer market made up 14.41 million liters in January , a rise of 0.8 percent from the year-earlier figure , the Lithuanian Brewers ' Association reporting citing the results from its members .",
    return_tensors="pt",
)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

## Conclusion

In this notebook, we explored several PEFT techniques for fine-tuning large language models. By only modifying a small subset of the model's parameters, these techniques allow us to adapt pre-trained models to new tasks more efficiently, without requiring extensive computational resources or massive amounts of data.
- **LoRA** reduces the rank of parameter updates, making training more efficient.
- **Prefix Tuning**, optimzing only tthe prefix parameters as only a sequence of continuous task-specific vectors are attached to the beginning of the input

These methods enable us to leverage the power of large models while minimizing the computational cost.