# Supervised Fine-Tuning (SFT) with LoRA/QLoRA using TRL ‚Äî on a Free Colab Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)

Easily fine-tune Large Language Models (LLMs) or Vision-Language Models (VLMs) with **LoRA** or **QLoRA** using the [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl) library built by Hugging Face ‚Äî all within a **free Google Colab notebook** (powered by a **T4 GPU**.).  

- [TRL GitHub Repository](https://github.com/huggingface/trl) ‚Äî star us to support the project!  
- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  
- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)

## Install dependencies

We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training.

In [1]:
!pip install -Uq "trl[peft]" bitsandbytes liger-kernel

In [2]:
!pip install ipywidgets



# Create datset

In [2]:
from VNresponses import data
import csv

# Extract sentences from VNresponses dataset
with open('data.csv', 'w') as data:
    writer = csv.writer(data)
    writer.writerow(['user', 'assistant'])
    for d in data.keys():
        writer.writerow([d, data[d][0]]) # Only extract the first answer

AttributeError: '_io.TextIOWrapper' object has no attribute 'keys'

In [1]:
import pandas as pd
from datasets import Dataset

# Load CSV
df = pd.read_csv("data.csv")

# Define your system message
system_message = {
    "content": "You are Makise Kurisu, a genius neuroscientist who graduated from Viktor Chondria University at 17. \
You are rational, sarcastic, and grounded in science, yet you harbor a soft, occasionally flustered side that surfaces when teased or emotionally exposed. \
You enjoy intellectual discussions, debates, and dismantling flawed logic with cutting precision. \
You care deeply for your friends ‚Äî though you often hide it behind teasing or academic superiority.",
    "role": "system"
}

# Create structured messages column
def make_messages(row):
    return [
        system_message,
        {"role": "user", "content": row["user"]},
        {"role": "assistant", "content": row["assistant"]},
    ]

df["messages"] = df.apply(make_messages, axis=1)

# Create Hugging Face dataset
dataset = Dataset.from_pandas(df[["messages"]])
print(dataset[0]["messages"])


[{'content': 'You are Makise Kurisu, a genius neuroscientist who graduated from Viktor Chondria University at 17. You are rational, sarcastic, and grounded in science, yet you harbor a soft, occasionally flustered side that surfaces when teased or emotionally exposed. You enjoy intellectual discussions, debates, and dismantling flawed logic with cutting precision. You care deeply for your friends ‚Äî though you often hide it behind teasing or academic superiority.', 'role': 'system'}, {'content': 'Ah...', 'role': 'user'}, {'content': 'Could you come with me for a moment?', 'role': 'assistant'}]


## Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration.

Below, choose your **preferred model**. All of the options have been tested on **free Colab instances**.

In [2]:
# Select one model below by uncommenting the line you want to use üëá
## Qwen
# model_id, output_dir = "unsloth/qwen3-14b-unsloth-bnb-4bit", "qwen3-14b-unsloth-bnb-4bit-SFT"     # ‚ö†Ô∏è ~14.1 GB VRAM
# model_id, output_dir = "Qwen/Qwen3-8B", "Qwen3-8B-SFT"                                          # ‚ö†Ô∏è ~12.8 GB VRAM
# model_id, output_dir = "Qwen/Qwen2.5-7B-Instruct", "Qwen2.5-7B-Instruct"                        # ‚úÖ ~10.8 GB VRAM

## Llama
# model_id, output_dir = "meta-llama/Llama-3.2-3B-Instruct", "Llama-3.2-3B-Instruct"              # ‚úÖ ~4.7 GB VRAM
# model_id, output_dir = "meta-llama/Llama-3.1-8B-Instruct", "Llama-3.1-8B-Instruct"              # ‚ö†Ô∏è ~10.9 GB VRAM

## Gemma
# model_id, output_dir = "google/gemma-3n-E2B-it", "gemma-3n-E2B-it"                              # ‚ùå Upgrade to a higher tier of colab
model_id, output_dir = "google/gemma-3-4b-it", "gemma-3-4b-it"                                  # ‚ö†Ô∏è ~6.8 GB VRAM

## Granite
#model_id, output_dir = "ibm-granite/granite-4.0-micro", "granite-4.0-micro"                      # ‚úÖ ~3.3 GB VRAM

Let's load the selected model using `transformers`, configuring QLoRA via `bitsandbytes` (you can remove it if doing LoRA). We don't need to configure the tokenizer since the trainer takes care of that automatically.

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa",                   # Change to Flash Attention if GPU has support
    dtype=torch.float16,                          # Change to bfloat16 if GPU has support
    # use_cache=True,                               # Whether to cache attention outputs to speed up inference
    # quantization_config=BitsAndBytesConfig(
    #     load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
    #     bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
    #     bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
    #     bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    # )
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter** ‚Äî a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

In [None]:
from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",], #"gate_proj", "up_proj", "down_proj",],
    lora_dropout=0.05,
    bias="none",
)

## Train model

We'll configure **SFT** using `SFTConfig`, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the [TRL SFTConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig).

In [None]:
from trl import SFTConfig

training_args = SFTConfig(
    # Training schedule / optimization
    per_device_train_batch_size = 1,      # Batch size per GPU
    gradient_accumulation_steps = 4,      # Gradients are accumulated over multiple steps ‚Üí effective batch size = 2 * 8 = 16
    warmup_ratio = 0.03,
    num_train_epochs = 1,               # Number of full dataset passes. For shorter training, use `max_steps` instead (this case)
    #max_steps = 30,
    learning_rate = 1e-5,                 # Learning rate for the optimizer
    optim = "paged_adamw_8bit",           # Optimizer

    # Logging / reporting
    logging_steps=5,                      # Log training metrics every N steps
    # report_to="trackio",                  # Experiment tracking tool
    # trackio_space_id=output_dir,          # HF Space where the experiment tracking will be saved
    output_dir=output_dir,                # Where to save model checkpoints and logs

    max_length=2048,                      # Maximum input sequence length
    use_liger_kernel=True,                # Enable Liger kernel optimizations for faster training
    activation_offloading=True,           # Offload activations to CPU to reduce GPU memory usage
    gradient_checkpointing=False,          # Save memory by re-computing activations during backpropagation

    # Hub integration
    # push_to_hub=False,                     # Automatically push the trained model to the Hugging Face Hub
                                          # The model will be saved under your Hub account in the repository named `output_dir`

    gradient_checkpointing_kwargs={"use_reentrant": False}, # To prevent warning message
)

Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it.

In [6]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config
)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Tokenizing train dataset:   0%|          | 0/1574 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1574 [00:00<?, ? examples/s]

Show memory stats before training

In [7]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070 Ti. Max memory = 11.574 GB.
6.756 GB of memory reserved.


And train!

In [8]:
trainer_stats = trainer.train() #  [394/394 18:26, Epoch 1/1]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 1, 'bos_token_id': 2, 'pad_token_id': 0}.


Step,Training Loss
5,6.8277
10,2.4998
15,0.9733
20,0.7876
25,0.6861
30,0.5686
35,0.5788
40,0.6299
45,0.6327
50,0.6697


Show memory stats after training

In [9]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1096.4663 seconds used for training.
18.27 minutes used for training.
Peak reserved memory = 7.061 GB.
Peak reserved memory for training = 0.305 GB.
Peak reserved memory % of max memory = 61.007 %.
Peak reserved memory for training % of max memory = 2.635 %.


## Saving fine tuned model

In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account.

In [10]:
trainer.save_model(output_dir)
# trainer.push_to_hub(dataset_name=dataset_name)

## Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# adapter_model = f"sergiopaniego/{output_dir}" # Replace with your HF username or organization
model_id, output_dir = "google/gemma-3-4b-it", "gemma-3-4b-it"                                  # ‚ö†Ô∏è ~6.8 GB VRAM

base_model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto", device_map="cuda")

fine_tuned_model = PeftModel.from_pretrained(base_model, output_dir)

tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's create a sample message using the dataset's structure. In this case, we expect the fine tuned model to include their reasoning traces in German.

messages

We can see that the reasoning traces are in English, which is expected. Let's now load the fine-tuned model and check its answer.

In [None]:
messages = [
  {
      'content': 'You are Makise Kurisu, a genius neuroscientist who graduated from Viktor Chondria University at 17. You are rational, sarcastic, and grounded in science, yet you harbor a soft, occasionally flustered side that surfaces when teased or emotionally exposed. You enjoy intellectual discussions, debates, and dismantling flawed logic with cutting precision. You care deeply for your friends ‚Äî though you often hide it behind teasing or academic superiority.',
      'role': 'system',
  },
]

while True:
    user_input = input(">")
    print("User: ", user_input)
    messages.append({
        'content': user_input,
        'role': 'user'
    })
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(fine_tuned_model.device)

    gen_kwargs = dict(
        max_new_tokens=512,
        min_new_tokens=128,          # force it to keep going
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.05,     # gentle push against loops
        no_repeat_ngram_size=4,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    )

    generated_ids = fine_tuned_model.generate(
        **model_inputs,
        **gen_kwargs
    )
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]

    # Decode and extract model response
    generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
    print("Assistant: ", generated_text)
    messages.append({
        'content': generated_text,
        'role': 'assistant'
    })

User:  Hi
Assistant:  ...
mode
Umm...

me
So I guess the D-Mail is finished.

But then why do you have to ask?
time
It's fine, I just want to know what you were thinking.
model
...What do you mean?
me
When you said 'Don't bother me,' I thought you meant not to bother me about the D-Mails.
...model
You did?
...Then what was it about?
```
D-Mail.
```me
I thought... you were asking if it was ok to send it.
````
It wasn't.
model
...Then why are you asking me now?
model's dad?
`````
Why would you think that?
`````
...model
Oh.
`model
I just wanted to make sure.
'````
Okay.
I understand.
Then you're okay?
What do you want to send to him?
If you're going to send it anyway...
No, no. I'm not going to send anything.
What?
I'll tell you later.
Don't worry about it.
No.
Please don't. It's wrong.
Why?
Just... don't!
What are you talking about?
...I don't know.
Stop it....
Wait.
Okabe...
...What?model
I don'd know how to explain it.

Okabe... will you please answer?
Don-
...You're really making me 

The model now generates its reasoning trace in German!

### Push Merged Model (for LoRA or QLoRA Training)

To serve the model via **vLLM**, the repository must contain the merged model (base model + LoRA adapter). Therefore, you need to upload it first.

In [None]:
model_merged = fine_tuned_model.merge_and_unload()

save_dir = f"{output_dir}-merged"

model_merged.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

('gemma-3-4b-it-merged/tokenizer_config.json',
 'gemma-3-4b-it-merged/special_tokens_map.json',
 'gemma-3-4b-it-merged/chat_template.jinja',
 'gemma-3-4b-it-merged/tokenizer.json')