# Efficient Fine-Tuning of Transformers: LoRA, Unsloth, and QLoRA

## Notebook Objective:

Provide a hands-on guide to the most efficient methods for fine-tuning large language models using:

* LoRA (Low-Rank Adaptation)
* QLoRA (Quantized LoRA for 13B+ models)
* Unsloth (Fast LoRA + quantization for LLaMA/Gemma)

By the end, students will:

* Understand each method’s purpose and trade-offs.
* Fine-tune LLMs on real-world datasets using each method.
* Compare results and decide which technique fits which scenario.

# Introduction to PEFT (Parameter-Efficient Fine-Tuning)

## Why Not Full Fine-Tuning?

- **High compute cost**: updating all model weights requires storing gradients for every parameter, which scales memory usage by ~2–3× the model size.  
- **Slow iteration**: large backprop passes on 100M+–parameter models take hours even on A100s.  
- **Deployment complexity**: shipping full fine-tuned weights means re-serving the entire model, complicating version control and increasing storage.
- **Memory-Intensive**: Full fine-tuning uses significant GPU VRAM, making it impossible on low-end or free-tier hardware.

Instead, we turn to **Parameter-Efficient Fine-Tuning (PEFT)**.

## What is PEFT?

PEFT methods **fine-tune only a small portion of the model’s parameters**, while keeping the rest frozen. This drastically reduces:

* Required memory
* Training time
* Overfitting risks (especially on small datasets)

Popular PEFT methods include:

* LoRA (Low-Rank Adaptation)
* Adapters
* QLoRA (Quantized LoRA)

# LoRA Fine-Tuning with Hugging Face

**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning technique that:

* **Freezes the base model weights**.
* Injects **small trainable matrices (low-rank)** into key components like attention layers.
* Dramatically **reduces GPU memory and compute needs**, while maintaining performance.

LoRA is supported via 🤗 Hugging Face’s **PEFT** (Parameter-Efficient Fine-Tuning) library, which makes integration seamless.

<img src="./Lora.png" width="500" style="display: block; margin: auto;">

*Image Source: [Hu, Edward J., et al.](https://arxiv.org/abs/2106.09685)*

A basic LoRA implementation


```python
class LoRALinear(nn.Module):
    def __init__(self, original_layer, r=8, alpha=16):
        super().__init__()
        self.original = original_layer
        self.r = r
        self.alpha = alpha
        self.lora_A = nn.Linear(original_layer.in_features, r, bias=False)
        self.lora_B = nn.Linear(r, original_layer.out_features, bias=False)
        self.scaling = alpha / r

        # Initialize LoRA weights
        nn.init.kaiming_uniform_(self.lora_A.weight, a=0.02)
        nn.init.zeros_(self.lora_B.weight)

        # Freeze original weights
        for param in self.original.parameters():
            param.requires_grad = False

    def forward(self, x):
        return self.original(x) + self.scaling * self.lora_B(self.lora_A(x))
```

In [15]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import setup_chat_format

model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

model, tokenizer = setup_chat_format(model, tokenizer)
model.enable_input_require_grads()


In [16]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # adjust if needed for this model
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 460,800 || all params: 134,975,808 || trainable%: 0.3414


In [7]:
from datasets import load_dataset

train_dataset = load_dataset("HuggingFaceTB/smoltalk", "smol-rewrite", split="train")
eval_dataset = load_dataset("HuggingFaceTB/smoltalk", "smol-rewrite", split="test")

In [17]:
from trl import SFTTrainer, SFTConfig

# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir='SmolLM2-FT-rewrite',  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    max_seq_length=1512,
    packing=True,  # Enable input packing for efficiency
    dataset_kwargs={
        "add_special_tokens": False,  # Special tokens handled by template
        "append_concat_token": False,  # No additional separator needed
    },
)

max_seq_length = 1512  # max sequence length for model and packing of the dataset

# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,  # LoRA configuration
    processing_class=tokenizer,

)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [18]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

Step,Training Loss
10,2.1506
20,2.0826
30,2.0454
40,2.004
50,1.9898
60,1.9433
70,1.8997
80,1.8827
90,1.8772
100,1.8323


KeyboardInterrupt: 

## Overview of LoRA, QLoRA, and Unsloth
- **LoRA (Low-Rank Adaptation)**  
  Injects trainable low-rank decomposition matrices into attention layers.  
- **QLoRA (Quantized LoRA)**  
  Combines NF4 4-bit quantization via bitsandbytes with LoRA adapters, letting you fine-tune 13B+ models on a single A100 or 24 GB GPU.
- **Unsloth**  
  Provides 4-bit quantized, PEFT-friendly variants of LLaMA/Mistral for lightning-fast LoRA training on Colab-class GPUs.  

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# 1. Load base model
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params/1e6:.2f}M")

# 2. Apply LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
peft_model = get_peft_model(model, lora_config)

# Count trainable parameters
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
print(f"Trainable parameters with LoRA: {trainable_params/1e6:.2f}M")


Total parameters: 134.52M
Trainable parameters with LoRA: 0.46M
