# Parameter efficient fine tuning with LoRA

### The goal of this project is to first evaluate a pretrained DistilBERT model on the MultiNLI dataset, and then parameter efficient fine tune it leveraging Low Rank Adaptation (LoRA)

The MultiNLI dataset is a crowd-sourced collection of 433k sentence pairs labeled for entailment, contradiction, or neutrality.  It was also used in the RepEval 2017 shared task at EMNLP. Each data point in MultiNLI has a premise and a hypothesis, and the task is to figure out the relationship between the two. The label can be one of three types:

Entailment (0) – the hypothesis clearly follows from the premise

Neutral (1) – the hypothesis could be true, but it’s not certain

Contradiction (2) – the hypothesis directly contradicts the premise

It’s a straightforward three-class classification task, but still a good challenge for testing how well a model understands language and reasoning.

# 1. Imports

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch
from peft import LoraConfig, get_peft_model, TaskType, AutoPeftModelForSequenceClassification, PeftModel
from sklearn.metrics import f1_score, precision_score, recall_score
import os
from typing import Tuple, Dict
import numpy as np
from peft import LoraConfig, get_peft_model

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
torch.cuda.is_available()

True

# 2. Load the dataset

## 2.1 Read in the dataset and display the size

In the MultiNLI dataset, there are 2 types of validation sets that can be used:

Validation Matched: This split contains validation examples from the same genres as those seen during training (e.g., government, fiction).

Validation Mismatched: This split includes examples from different genres that were not used in training (e.g., telephone conversation, travel guides).

The idea is to test both in-domain (matched) and out-of-domain (mismatched) generalization—so you can see not just how well your model performs on familiar styles of text, but also how it handles new, unseen ones.

In [3]:
# Load the MultiNLI dataset
dataset = load_dataset("multi_nli")

print("Train dataset size:", len(dataset["train"]))
print("Validation (matched) dataset size:", len(dataset["validation_matched"]))
print("Validation (mismatched) dataset size:", len(dataset["validation_mismatched"]))

Train dataset size: 392702
Validation (matched) dataset size: 9815
Validation (mismatched) dataset size: 9832


## 2.2 Show a few example rows from the training set. 

Recapping labels:
Entailment (0) – the hypothesis clearly follows from the premise

Neutral (1) – the hypothesis could be true, but it’s not certain

Contradiction (2) – the hypothesis directly contradicts the premise


In [4]:
label_map = {0: "Entailment", 1: "Neutral", 2: "Contradiction"}

def show_one_example_per_label(dataset_split, offset=0):
    seen_labels = set()
    subset = dataset_split.select(range(offset, len(dataset_split)))
    
    for example in subset:
        label = example["label"]
        if label in [0, 1, 2] and label not in seen_labels:
            print(f"\nLabel: {label} ({label_map[label]})")
            print("Premise:", example["premise"])
            print("Hypothesis:", example["hypothesis"])
            seen_labels.add(label)
        if len(seen_labels) == 3:
            break

# Load and use
dataset = load_dataset("multi_nli")
show_one_example_per_label(dataset["train"], offset=550)


Label: 2 (Contradiction)
Premise: so i i know the people at TI who are doing this and i heard about it so i called them and ask if i could could participate and uh
Hypothesis: I don't know anyone at TI and that's okay because I don't want to participate anyway.

Label: 1 (Neutral)
Premise: We also show how the advocate component of statewide websites promotes effective representation by sharing legal resources and expertise - generally a function of legal work supervisors.
Hypothesis: Legal work supervisors are usually tasked with sharing legal advice and resources.

Label: 0 (Entailment)
Premise: The New Territories can be explored by taking the Kowloon Canton Railway (KCR), which makes 10 stops between the station in Kowloon and Sheung Shui, the last stop before entering China.
Hypothesis: The Kowloon Canton Railway makes over five stops between the station and Sheung Shui.


# 3. Define the model name and load the tokenizer

Why this model and tokenizer?

The distilbert-base-uncased model and its tokenizer are well-suited for this task for several reasons:

1) Lightweight yet effective: As a distilled version of BERT, DistilBERT retains about 97% of BERT’s performance on tasks like natural language inference (NLI), while being 40% smaller and 60% faster. This makes it an efficient choice for fine-tuning without compromising much on accuracy.

2) Uncased variant: The model treats uppercase and lowercase letters the same, which reduces vocabulary size and simplifies training. This is particularly useful for NLI, where case sensitivity is typically not essential.

3) Seamless integration with Hugging Face tools: It is fully compatible with Hugging Face's transformers library, including AutoTokenizer and AutoModelForSequenceClassification, making setup and fine-tuning straightforward.

4) Pretrained on diverse data: The model has been pretrained on a large and varied text corpus (including Wikipedia and BookCorpus), providing a strong foundation for transfer learning on datasets like MultiNLI.

5) Resource-friendly: With fewer parameters than BERT, DistilBERT runs efficiently on limited hardware, making it ideal for quick experiments, smaller machines, or educational setups.

In [5]:

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 4. Preprocess the data

This step gets the dataset ready for training by tokenizing the premise–hypothesis pairs using the chosen tokenizer. It takes care of padding, truncation, and renames the label to labels, which is what the Hugging Face Trainer expects. The .map() function applies this to the full dataset, and we remove all original columns since they’re no longer needed for training — this keeps the dataset clean and avoids passing unnecessary data to the model.

In [6]:
# Define preprocessing function
def preprocess_function(examples):
    tokenized_inputs = tokenizer(
        examples["premise"],
        examples["hypothesis"],
        padding="max_length",
        truncation=True
    )
    # Use 'labels' key as Trainer expects this key
    tokenized_inputs["labels"] = examples["label"]
    return tokenized_inputs

# Map the function over the dataset, remove columns
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# Print a sample of the tokenized dataset
print(tokenized_datasets["train"][0])

{'input_ids': [101, 17158, 2135, 6949, 8301, 25057, 2038, 2048, 3937, 9646, 1011, 4031, 1998, 10505, 1012, 102, 4031, 1998, 10505, 2024, 2054, 2191, 6949, 8301, 25057, 2147, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# 5. Downsample the dataset for faster training

In [7]:
# Downsample the dataset for faster training
train_dataset = tokenized_datasets["train"].select(range(50000))
eval_dataset = tokenized_datasets["validation_matched"].select(range(4000))

print("Train subset size:", len(train_dataset))
print("Eval subset size:", len(eval_dataset))

Train subset size: 50000
Eval subset size: 4000


# 6. Load the pretrained model

In [8]:
# Load the pretrained model
base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
base_model = base_model.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu')) # Leverage cuda if gpu is available, else use cpu

# Display the pretrained model
base_model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


On inspecting the model output above, it is seen that the model follows the standard DistilBertForSequenceClassification setup. It has 6 transformer layers (half of BERT’s), which keeps it compact without losing too much performance. The embedding layer uses a vocab size of 30,522 (so it can handle that many unique tokens) and positional encodings up to 512 tokens to help the model keep track of word order.

On top of the base encoder, there’s a simple classification head — dropout, a linear "pre-classifier", and a final layer that maps to 3 output classes for the MultiNLI task. It uses GELU as the activation function, which is smoother than ReLU and tends to work better in transformers. Dropout and LayerNorm are used throughout for regularization and training stability.

Overall, it's a lightweight, efficient setup that’s a great fit for NLI.

# 7. Evaluate the pre-trained model


In this step, the model’s performance is evaluated using macro-averaged F1 score, precision, and recall. The predicted class for each example is obtained by applying argmax to the model’s output logits, and these predictions are compared to the ground truth labels. Macro averaging ensures that all classes are treated equally, which is important for a dataset like MultiNLI where the label distribution may not be perfectly balanced.

In [9]:
# Function to compute evaluation metrics
def compute_metrics(eval_pred: Tuple[np.ndarray, np.ndarray]) -> Dict[str, float]:
    """
    Compute macro-averaged F1 score, precision, and recall for classification predictions.

    Parameters
    ----------
    eval_pred : Tuple[np.ndarray, np.ndarray]
        A tuple containing:
        - predictions: array-like of shape (n_samples, n_classes), raw model outputs (logits).
        - labels: array-like of shape (n_samples,), ground truth class labels.

    Returns
    -------
    Dict[str, float]
        A dictionary with macro-averaged evaluation metrics:
        - "f1_score": float
        - "precision": float
        - "recall": float
    """
    predictions, labels = eval_pred
    preds = torch.argmax(torch.tensor(predictions), dim=-1).cpu()
    f1 = f1_score(labels, preds, average='macro')
    precision = precision_score(labels, preds, average='macro')
    recall = recall_score(labels, preds, average='macro')
    return {"f1_score": f1, "precision": precision, "recall": recall}

The following step runs a quick evaluation of the base DistilBERT model before any fine-tuning, just to get a baseline on how well it performs out of the box. The Trainer is set up to skip training and only do evaluation on the MultiNLI validation set.

The batch size is kept small to avoid memory issues, and logging is minimal since the goal here is just to get a reference point. Even though evaluation_strategy="epoch" isn’t really needed without training, it keeps things consistent with the rest of the pipeline.

Running this helps show how much the pretrained model already understands the task — and gives something solid to compare against once fine-tuning is done.

In [10]:
# Evaluate the pretrained base model
training_args_base = TrainingArguments(
    output_dir="./base_model_eval",
    per_device_eval_batch_size=8,
    do_train=False,
    do_eval=True,
    evaluation_strategy="epoch",
    logging_steps=10,
    report_to="none"
)

trainer_base = Trainer(
    model=base_model,
    args=training_args_base,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("Evaluating pretrained base model...")
results_base = trainer_base.evaluate()
print(results_base)

  trainer_base = Trainer(


Evaluating pretrained base model...


{'eval_loss': 1.0992169380187988, 'eval_model_preparation_time': 0.001, 'eval_f1_score': 0.195431192931623, 'eval_precision': 0.32689638378180935, 'eval_recall': 0.3362731580642897, 'eval_runtime': 45.1009, 'eval_samples_per_second': 88.69, 'eval_steps_per_second': 11.086}


In [11]:
# Confirm the model is running on cuda
print("Model is running on:", next(trainer_base.model.parameters()).device)

Model is running on: cuda:0


# 8. Implement parameter efficient fine-tuning with Low Rank Adaptation (LoRA)

This step sets up LoRA to fine-tune only a small, targeted subset of the base DistilBERT model, making training more efficient. 


## 8.1 Define target modules from DIstilBERT for LoRA
The target_modules focus on the query, key, value, and output projection layers of the self-attention block in the last transformer layer (layer.5). Since attention layers play a key role in how the model processes sentence pairs, modifying them directly makes sense for a task like NLI — and narrowing it to just one layer keeps the update lightweight.




In [12]:
target_modules = [
    "distilbert.transformer.layer.5.attention.q_lin",
    "distilbert.transformer.layer.5.attention.k_lin",
    "distilbert.transformer.layer.5.attention.v_lin",
    "distilbert.transformer.layer.5.attention.out_lin"
]



## 8.2 Define the LoRA configuration

The LoraConfig defines how LoRA will be applied. The rank r=64 and lora_alpha=16 control the low-rank adaptation strength, while a small dropout (0.05) adds regularization. use_rslora=True enables a more memory-efficient variant of LoRA (rank-stabilized LoRA), which is helpful when working with limited GPU resources.

bias="none" means LoRA skips modifying bias terms — a common choice to keep things simple unless there's a strong reason to include them.

In [13]:



# Define LoRA configuration
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    r=64,  
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules= target_modules,  
    bias="none",
    use_rslora=True,  
)





## 8.3 Apply PEFT

The model is wrapped with get_peft_model to inject the LoRA layers, and it’s moved to the GPU if available. Finally, printing the trainable parameters confirms that only a small fraction of the full model is being updated — a big win in terms of speed and memory, without compromising much on performance.

In [14]:
model = base_model
# Apply PEFT
peft_model = get_peft_model(model, peft_config)
# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Print trainable parameters
peft_model.print_trainable_parameters()

trainable params: 986,115 || all params: 67,941,894 || trainable%: 1.4514


## 8.4 Run PEFT

This setup fine-tunes the PEFT (LoRA-injected) model on the MultiNLI dataset using Hugging Face’s Trainer, with training arguments tuned for efficiency and stability — especially when running on a GPU:

1) evaluation_strategy="epoch" and save_strategy="epoch" ensure the model is evaluated and checkpointed at the end of every epoch. Since training on a GPU is relatively fast, epoch-level evaluation provides a good balance between performance tracking and speed.

1) load_best_model_at_end=True automatically restores the best checkpoint based on validation loss, which is especially useful when training for many epochs — it saves having to manually track which checkpoint did best.

3) learning_rate=2e-5 is a solid starting point for fine-tuning with LoRA, and works well with GPU-accelerated training where updates are fast and stable.

4) num_train_epochs=12 is intentionally set a bit high — LoRA only updates a small number of parameters, so more epochs are usually needed for the model to fully adapt.

5) weight_decay=0.01 adds light regularization, helping avoid overfitting even on high-capacity hardware like GPUs.

6) per_device_train_batch_size=8 and per_device_eval_batch_size=8 are conservative and help avoid OOM errors. On a GPU, this keeps training smooth without pushing the limits — especially useful if other processes are sharing the GPU.

7) dataloader_num_workers=10 helps fully utilize the CPU to keep the GPU fed with data, reducing bottlenecks in the training loop.

8) Logging every 10 steps gives frequent updates without spamming, and report_to="none" keeps the run lightweight unless logging integrations are explicitly needed.

Running on a GPU makes this setup more efficient overall, but the structure still prioritizes stability and good generalization — which is especially important in a PEFT workflow where you’re only fine-tuning a small part of the model.






In [16]:
training_args = TrainingArguments(
    output_dir="./peft_multi_nli_results",
    evaluation_strategy="epoch",   
    save_strategy="epoch",         
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=12,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    report_to="none",
    dataloader_num_workers=10,
)

# Prepare the Trainer for PEFT training
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)
trainer.label_names = ["labels"]

# Fine-tune the PEFT (LoRA) model
trainer.train()

  trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,F1 Score,Precision,Recall
1,0.8525,0.869497,0.597022,0.599923,0.597384
2,0.7968,0.82398,0.623062,0.625834,0.623797
3,0.8417,0.798245,0.644088,0.644385,0.644189
4,0.7551,0.794242,0.646046,0.648953,0.646444
5,0.7815,0.785159,0.651598,0.651919,0.651671
6,0.7253,0.771869,0.659214,0.659477,0.659287
7,0.5977,0.773765,0.658147,0.65978,0.65863
8,0.779,0.77265,0.661238,0.663629,0.661598
9,0.785,0.770835,0.656224,0.66047,0.657407
10,0.735,0.767022,0.661701,0.664589,0.662606


TrainOutput(global_step=75000, training_loss=0.786420759938558, metrics={'train_runtime': 11464.7455, 'train_samples_per_second': 52.334, 'train_steps_per_second': 6.542, 'total_flos': 8.12994637824e+16, 'train_loss': 0.786420759938558, 'epoch': 12.0})

The training shows a steady improvement across the 12 epochs. The validation F1 score starts around 0.59 and climbs to ~0.665 by the end, with precision and recall tracking closely — suggesting consistent gains in overall performance.

The validation loss gradually decreases, which is a good sign that the model is generalizing better over time, without major overfitting. Even though there’s some fluctuation in training loss (as expected with small batches and LoRA), the upward trend in F1 is clear.

Notably, the biggest jumps in performance happen in the first 6–7 epochs, after which the gains start to level off — so in a future run, early stopping around epoch 10 might be a good option. Overall, the model shows solid learning dynamics and benefits from longer training, which aligns with expectations for PEFT setups where only a small subset of parameters is being updated.

# 9. Save the PEFT adapters for future use


In [17]:
peft_model.save_pretrained("./peft_multi_nli")
print("PEFT adapters saved to ./peft_multi_nli")

PEFT adapters saved to ./peft_multi_nli


With training complete, the LoRA-adapted model shows consistent gains in performance over the base model, especially in terms of F1 score and overall stability across epochs. By saving only the adapter weights instead of the full model, the result is a lightweight, modular checkpoint that’s easy to share or reuse — while still capturing all the task-specific learning. This setup makes fine-tuning both efficient and practical, especially for larger models or when working with limited compute.

# 10. Next Steps

1) Try QLoRA, AdaLoRA

2) Use larger models like RoBERTa or BERT-large.