<a href="https://colab.research.google.com/github/quanticedu/llm-fine-tuning/blob/main/PEFT_with_LoRA_and_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA)

In this Colab notebook, you'll determine the hyperparameters you'll need to fine-tune the Phi-2 model using the PEFT strategies of LoRA and QLoRA.

> This notebook is based on [@maximelabonne's LLama2 fine-tuning notebook](https://github.com/mlabonne/llm-course/blob/main/Fine_tune_Llama_2_in_Google_Colab.ipynb), which is, in turn, based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). It also borrows from [this example](https://github.com/brevdev/notebooks/blob/main/phi2-finetune-own-data.ipynb) on phi2 fine-tuning.

## Load and Tokenize the Training Data

These four cells contain all the code from the previous lesson. The first two cells install the needed packages (remember to restart the runtime if you're prompted to do so). The second cell installs modules and tokenizes the training datasets. The third cell loads the model unquantized (you'll reload it quantized later in the lesson). Refer to the previous lesson if you need a refresher on anything here.

Select the T4 GPU runtime and run the three cells.


In [1]:
# Upgrade pip
!pip install -U pip

# Uninstall packages that will conflict with those we're about to install
!pip uninstall --yes opencv-contrib-python thinc opencv-python opencv-python-headless albumentations spacy dopamine-rl albucore fastai jax shap jaxlib pytensor pymc flax chex orbax-checkpoint optax

# Downgrade to numpy 1.26.4 (needed to support pandas 2.2.2). NOTE: If asked to restart the session, do so. You don't need to rerun this cell.
!pip uninstall --yes numpy
!pip install numpy==1.26.4

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Found existing installation: opencv-contrib-python 4.12.0.88
Uninstalling opencv-contrib-python-4.12.0.88:
  Successfully uninstalled opencv-contrib-python-4.12.0.88
Found existing installation: thinc 8.3.10
Uninstalling thinc-8.3.10:
  Successfully uninstalled thinc-8.3.10
Found existing installation: opencv-python 4.12.0.88
Uninstalling opencv-python-4.12.0.88:
  Successfully uninstalled opencv-python-4.12.0.88
Found existing installation: opencv-python-headless 4.12.0.88
Uninstalling opencv-python-headless-4.12.0.88:
  Successfully uninst

In [1]:
# Install a CUDA 12.1 build of PyTorch compatible with Python 3.12
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# Core libs
!pip install \
  accelerate==1.10.1 \
  transformers==4.56.2 \
  datasets==4.0.0 \
  peft==0.17.1 \
  sentence-transformers==5.1.0 \
  einops==0.8.1 \
  safetensors==0.6.2 \
  jinja2==3.1.6 \
  regex==2025.9.18 \
  fsspec==2025.3.0 \
  gcsfs==2025.3.0 \
  pandas==2.2.2 \
  pyarrow==15.0.2 \
  pytz==2024.1

# bitsandbytes with CUDA 12 support (use a recent version)
!pip install bitsandbytes==0.47.0

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m23.8 MB/s[0m  [33m0:00:21[0m
[?25hCollecting torchvision==0.20.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp312-cp312-linux_x86_64.whl (7.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m80.3 MB/s[0m  [33m0:00:00[0m
[?25hCollecting torchaudio==2.5.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m83.0 MB/s[0m  [33m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.5.1)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.1

In [3]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_name = "microsoft/phi-2"
# Note: Using 4-bit quantization as in your Hemingway notebook for efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map={"": 0}
)

# 2. Dataset Loading (AI4agric CROP)
# We load the dataset and select a subset for a 'pilot' run to ensure convergence
dataset = load_dataset(
    "AI4Agr/CROP-dataset",
    data_files="**/*_en/**/*.json",
    split="train"
)
dataset = dataset.select(range(5000)) # Pilot run with 5,000 samples

# 3. Instruction Formatting (Key difference from style-tuning)
def format_instruction(sample):
    # This template forces the model to learn the 'Assistant' role
    prompt = f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}"
    return {"text": prompt}

dataset = dataset.map(format_instruction)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# 4. LoRA Setup (Targets the specific layers identified in your notebook)
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["fc1", "fc2", "q_proj", "k_proj", "v_proj", "dense"], # From your print(model) check
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)

# 5. Training Arguments (MPhil Grade: focus on evaluation and logging)
training_args = TrainingArguments(
    output_dir="./agri_model_results",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    max_steps=100, # Start with 100 steps to verify 'Conviction of Learning'
    fp16=True,
    optim="paged_adamw_32bit",
    report_to="none"
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# 6. Execution
print("Starting Agricultural Domain Adaptation...")
trainer.train()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/22 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Starting Agricultural Domain Adaptation...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.7441
20,1.5247
30,1.4407
40,1.4041
50,1.3429
60,1.3715
70,1.3668
80,1.3472
90,1.2762
100,1.2879


TrainOutput(global_step=100, training_loss=1.4106109523773194, metrics={'train_runtime': 753.2312, 'train_samples_per_second': 1.062, 'train_steps_per_second': 0.133, 'total_flos': 6567210516480000.0, 'train_loss': 1.4106109523773194, 'epoch': 0.16})

In [None]:
# Specify quantization and load the model
################################################################################
# bitsandbytes (quantization) parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = False # Whether to quantize model weights to 4bits (QLoRA).

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16" # For some GPUs, 'bfloat16' format could be the optimal choice.

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4" # Choosing between different number representation formats.

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Use variables above to define a quantization configuration object.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,

)

device = "cuda:0" # The first among the available GPUs.
device_map = {"": 0} # Specify which elements of the model go to which device.
                     # This is especially relevant for huge models that don't fit on one GPU.
                     # In our case, we map everything to device 0 (GPU number 0) when loading the model.

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision=revision,
    device_map=device_map,
    trust_remote_code=True, # This is to let huggingface know that we are downloading this custom model from a trusted source.
    quantization_config=bnb_config if use_4bit else None,
    torch_dtype=torch.float16 # When quantization is not used,
                              # we need to specify this to avoid loading the model in 32bit.
)

model.config.use_cache = False # Caching speeds up inference, but is irrelevant for training/fine-tuning.
                               # We've found it interfere with Colab behavior when different models are loaded/unloaded.
                               # So we'll keep it off. In practice, for inference, setting it to True (default) is advisable.

## Base Model Samples

Before we fine-tune the model, we should get samples of its baseline performance. First we create a pipeline for convenience.


In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

Now we create two samples. In the first we'll use the same prompt we used in the previous lesson. Note that changing the value of `do_sample` in the generation configuration changes the output.

In [None]:
generation_config = GenerationConfig(max_length=200,
                                      do_sample=True,    # Whether to use deterministic (highest probability) decoding
                                      use_cache=False,    # or sample each next word proportionally to its predicted probability.
                                      temperature=1,
                                      eos_token_id=tokenizer.eos_token_id,
                                      bos_token_id=tokenizer.eos_token_id,
                                      pad_token_id=tokenizer.eos_token_id)

# Try the old "sad one-sentence story" prompt we used in the previous lesson:
torch.manual_seed(42)
result = pipe("As promised, here is a one-sentence story that will make you cry: ", generation_config=generation_config)
print(result[0]['generated_text'])

In the second sample, we'll have the model give us a continuation for an opening of a story.

In [None]:
# Continue a generic story opening:
torch.manual_seed(42)
result = pipe("I went outside,", generation_config=generation_config)
print(result[0]['generated_text'])

##Trainable Modules

To determine which modules to apply LoRA to, we need to know which modules are in the model.

In [None]:
print(model)

##Training Hyperparameters

This cell sets the hyperparameters and creates a PEFT model from the base model.

In [None]:

# Fine-tuned model name for saving later
new_model_name = "phi2-hemingway"

################################################################################
# LoRA parameters
################################################################################

# LoRA attention dimension
# 1 is the minimum, which would result in extremely limited flexibility.
# The higher the number - the more flexible our LoRA.
# adjustment matrices will be. The cost is higher memory demand and longer training.
# If we increase it too much, we'll essentially be doing full fine-tuning on the
# weights to which LoRA is applied (see "training_modules" parameter in the next cell).
# Common values to try are: 8, 16, 32.
lora_r = 32

# Alpha parameter for LoRA scaling # Covered in the lesson directly
# Higher alpha will result in higher impact of lora adaptation.
# A common rule of thumb is to set this to lora_r times two.
# But it's not guaranteed to be best and experimentation can help find more optimal values.
lora_alpha = 64

# Dropout probability for LoRA layers
# Dropout refers to randomly "switching off" a certain proportion of neurons.
# This encourages the network not to rely on any one weight too much and thus be
# more robust.
lora_dropout = 0.1


################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs. How many times to go over the dataset.
# Overly high - increased risk of overfitting (memorizing the training set without understanding)
# Overly low - increased risk of cutting training too early.
# Reasonable value can be selected by selecting a large value and monitoring validation set performance.

num_train_epochs = 3.0

# Enable fp16/bf16 training (set bf16 to True with an A100)
# Can speed up training and decrease memory demands by
# using different quantization levels on different network parts.
# Important for QLoRA. Might not work on some/many GPUs.
fp16 = True if use_4bit else False
bf16 = False

# Batch siz (how many training examples to work with in parallel) per GPU for training
# Usually, the higher the batch size - the better (results in more stable learning).
# BUT go too high - and you'll quickly run out of GPU memory.
# Generally, select the highest number you can without running out of memory.
per_device_train_batch_size = 2

# Batch size per GPU for evaluation
# This can often be a bit higher since during evaluation we don't need to store gradients.
# The higher this number - the faster the evaluation will be.
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
# If you batch size is small, you can increase this number for more stable training.
# (we'll accumulate evidence for some time before making the weight update step)
# It's essentially the same as batch size, but done sequentially instead of in parallel.
gradient_accumulation_steps = 2

# Enable gradient checkpointing. Lowers memory demand by clever combination
# of caching and recomputation. The cost is a small slowdown.
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping). Prevent gradients from growing too large
# and causing training instabilities or numerical overflows.
max_grad_norm = 0.3


# Weight decay to apply to all layers except bias/LayerNorm weights
# Weight decay prevents individual weights from becoming too large.
# This is a classical way of softly reducing model flexibility / degrees of freedom.
# If weight decay is too high, all weights will be incentivised to become near-zero.
weight_decay = 0.000 # 0.001, 0.005, 0.0001 are all values one might want to try.
                     # Be careful with this parameter, though, as too much weight decay
                     # might make the model forget everything.


# Optimizer to use (intuitively, the training data will tell us the direction
# on how much each weight should be changed to improve the performance a little.
# But the optimizer will 'decide' how exactly to use this information: change
# fast or slow, with or without inertia, etc.)
optim = "paged_adamw_32bit"

# Initial learning rate (AdamW optimizer)
learning_rate = 1e-4 # How fast to step along the directions described above.
                     # AdamW is adaptive, meaning that it will internally adjust this,
                     # but it's still important to choose an adequate starting point.

# Learning rate schedule. Learning rate additional changes during training according to a pre-specified
# schedule. Usually, getting smaller towards the end of training, with the idea that
# towards the end, we are making finer adjustments than in the beginning.
# A nice article covering different scheduler shapes: https://towardsdatascience.com/a-visual-guide-to-learning-rate-schedulers-in-pytorch-24bbb262c863
# The scheduler is especially important if we were to use the SGD optimizer.
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
# Warm-up refers to starting, in contrast, with a lower learning rate, to avoid
# overly dramatic changes in the very beginning of learning.
warmup_ratio = 0.03

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

# Define LoRA configuration
peft_config = LoraConfig(
    target_modules=[ # Which model parts to apply L matrices to.
        "fc1",      # use print(model) to make a more informed decision.
        "fc2",       # Weights related to queries, keys, and values are a must
        "k_proj",
        "q_proj",
        "v_proj",
        "dense"
    ],
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="lora_only",
    task_type="CAUSAL_LM",
)

model_peft = get_peft_model(model, peft_config)

Let's investigate our PEFT model.

In [None]:
print(model_peft)

## From https://github.com/brevdev/notebooks/blob/main/phi2-finetune-own-data.ipynb

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in a model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model_peft)

##Fine-Tuning

In this cell we'll create the trainer object, which uses the training and evaluation datasets along with the PEFT model and arguments to control the training. Before we actually fine tune, we'll look at the evaluation loss of the pre-trained model on the new evaluation data.

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    eval_strategy="steps",
    eval_steps=25,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    lr_scheduler_type=lr_scheduler_type,
    seed=42
)

trainer = Trainer(
    model=model_peft,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets["valid"],
    args=training_arguments,
)

trainer.evaluate() # To see the loss before the start of training.


Now we conduct the actual fine tuning. This process will take a few minutes.

In [None]:
torch.manual_seed(42)
# Train model
trainer.train()

Notice how important it is to evaluate the model at least once before training. Without it it might seem that the loss barely changed compared to the non-fine-tuned model. Often the biggest improvement happens before the first evaluation.

## Post-Training Samples

In [5]:
from transformers import (
    AutoModelForCausalLM, # Will be used to load the pre-trained model
    AutoTokenizer, # Will be used to load the pre-trained tokenizer
    BitsAndBytesConfig, # For model quantization settings
    GenerationConfig, # To control generation (inference) from a model
    TrainingArguments, # To specify parameters of the fine-tuning process
    Trainer, # The object that abstracts away the training and evaluation loop
    pipeline, # Stringing together tokenization and inference, for convenience
    logging
)


logging.set_verbosity(logging.CRITICAL) # Ignore warnings
trainer.model.eval(); # Set the model into evaluation regime.
pipe = pipeline(task="text-generation", model=trainer.model, tokenizer=tokenizer)
generation_config = GenerationConfig(max_length=200,
                                      do_sample=True,    # Whether to use deterministic (highest probability) decoding
                                      use_cache=False,    # or sample each next word proportionally to its predicted probability.
                                      temperature=1,
                                      eos_token_id=tokenizer.eos_token_id,
                                      bos_token_id=tokenizer.eos_token_id,
                                      pad_token_id=tokenizer.eos_token_id)

# Sad one-sentence story completion
torch.manual_seed(42)
result = pipe("As promised, here is a one-sentence story that will make you cry: ", generation_config=generation_config)
print(result[0]['generated_text'])


As promised, here is a one-sentence story that will make you cry: 
Yes, that's correct. The story revolves around a farmer from a semi-arid area facing the increasing water stress due to climate change and how he manages his small dairy farm with significant financial challenges.
Answer: The farmer, dealing with increasing water stress from drought intensified by climate change, struggles to keep his small dairy farm afloat amidst financial instability.



In [7]:
# Generic story beginning completion:
torch.manual_seed(42)
prompt = "What is crop rotation,"
result = pipe(prompt, generation_config=generation_config)
print(result[0]['generated_text'])

What is crop rotation, and why is it important in modern agricultural practices?
I'm unfamiliar with crop rotation. Can you explain how it works and why it's essential?

### Response:

Crop rotation is a time-tested method of farming where different crops are grown in sequential seasons across the same piece of land. It's like a farmer's version of meal planning, but for different plants in each session. For instance, in the first session, a farmer might plant maize, in the second session, switch to a legume like soybeans, and in the third session, perhaps switch to a root vegetable like carrots. This rotating sequence helps to manage pest and disease issues effectively by interrupting the life cycle of pests specific to a particular crop, and it also improves the soil's fertility and structure across different crops. It's a sustainable practice that enhances both yields and environmental health.

### Response:

That's quite strategic! I never knew that the choice
