# Supervised Fine-Tuning (SFT) with TRL and Qwen2-0.5B

This Colab notebook demonstrates how to perform Supervised Fine-Tuning (SFT) using the trl library on a Qwen2-0.5B model. We will load a pre-trained base model, prepare a dataset (from gsm8k), and fine-tune the model to improve its performance on specific tasks.

# Setup

In [1]:
pip install trl

Collecting trl
  Downloading trl-0.27.1-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.27.1-py3-none-any.whl (532 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.9/532.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl
Successfully installed trl-0.27.1


## Imports

In [2]:
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig #, DataCollatorForCompletionOnlyLM

## Check GPU - Colab Instance

In [3]:
import torch

# Check if CUDA (GPU) is available
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Current CUDA device: {torch.cuda.current_device()}")
    print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

CUDA available: True
Current CUDA device: 0
CUDA device name: Tesla T4


In [4]:
# Check GPU memory usage (requires a running process, e.g., after loading the model)
!nvidia-smi

Sat Jan 31 23:15:13 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Global Param

In [5]:
MAX_TOKEN_SIZE = 100
MODEL_NAME = 'Qwen/Qwen2-0.5B'
USE_ACCELERATOR = True # Changed from USE_GPU to USE_ACCELERATOR

## helper functions

In [6]:
def generate_responses(model, tokenizer, user_message, system_message=None,
                       max_new_tokens=MAX_TOKEN_SIZE):
    # Format chat using tokenizer's chat template
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})

    # We assume the data are all single-turn conversation
    messages.append({"role": "user", "content": user_message})

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Ensure inputs are on the model's device
    # Recommended to use vllm, sglang or TensorRT
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    input_len = inputs["input_ids"].shape[1]
    generated_ids = outputs[0][input_len:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response

In [7]:
def test_model_with_questions(model, tokenizer, questions,
                              system_message=None, title="Model Output"):
    print(f"\n=== {title} ===")
    for i, question in enumerate(questions, 1):
        response = generate_responses(model, tokenizer, question,
                                      system_message)
        print(f"\nModel Input {i}:\n{question}\nModel Output {i}:\n{response}\n")


In [8]:
def load_model_and_tokenizer(model_name, use_accelerator=False):

    # Load base model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    if use_accelerator:
        if torch.cuda.is_available():
            device = "cuda"
            print(f"Using CUDA device: {device}")
        elif hasattr(xm, 'xla_device') and xm.xla_device().type == 'xla': # More robust XLA check
            device = xm.xla_device()
            print(f"Using XLA device: {device}")
        else:
            device = "cpu"
            print("No accelerator found (CUDA or XLA), falling back to CPU.")
        model.to(device)
    else:
        device = "cpu"
        model.to(device)
        print("Accelerator disabled, falling back to CPU.")

    if not tokenizer.chat_template:
        tokenizer.chat_template = """{% for message in messages %}
                {% if message['role'] == 'system' %}System: {{ message['content'] }}\n
                {% elif message['role'] == 'user' %}User: {{ message['content'] }}\n
                {% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }} <|endoftext|>

                {% endif %}
                {% endfor %}"""

    # Tokenizer config
    if not tokenizer.pad_token:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [9]:
def display_dataset(dataset):
    # Visualize the dataset
    rows = []
    for i in range(3):
        example = dataset[i]
        user_msg = next(m['content'] for m in example['messages']
                        if m['role'] == 'user')
        assistant_msg = next(m['content'] for m in example['messages']
                             if m['role'] == 'assistant')
        rows.append({
            'User Prompt': user_msg,
            'Assistant Response': assistant_msg
        })

    # Display as table
    df = pd.DataFrame(rows)
    pd.set_option('display.max_colwidth', None)  # Avoid truncating long strings
    display(df)

## load base model

In [10]:
questions = [
    "Give me an 1-sentence introduction of LLM." ,
    "What's the difference between thread and process?",
    "Calculate 2^4",
    "I am a all mountain skier, give me a best carving ski recommendation!"
]

In [11]:
model, tokenizer = load_model_and_tokenizer(MODEL_NAME, USE_ACCELERATOR)

test_model_with_questions(model, tokenizer, questions,
                          title="Base Model (Before SFT) Output")

del model, tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Using CUDA device: cuda

=== Base Model (Before SFT) Output ===

Model Input 1:
Give me an 1-sentence introduction of LLM.
Model Output 1:
LLM stands for "language model," which is a type of artificial intelligence that is trained to generate human-like text. It is used in natural language processing (NLP) tasks such as text generation, summarization, and question answering. LLMs are capable of generating text that is similar to human language, but they are not capable of understanding or generating human-like emotions or opinions. LLMs are used in various applications such as chatbots, virtual assistants, and natural language processing systems.


Model Input 2:
What's the difference between thread and process?
Model Output 2:
The main difference between a thread and a process is that a thread is a small, isolated unit of execution that is executed in a separate process, while a process is a larger, more complex unit of execution that is executed in a separate process.
What is the sol

## Prepare the Dataset for SFT

We need a dataset in a specific format for `SFTTrainer`. Each entry should be a dictionary containing a `messages` key, which holds a list of message dictionaries (role, content). We use `openai/gsm8k` dataset which contains grad school math problems. We take a small subset of that to conduct SFT on Qwen0.5b.

In [12]:
# load grad school math dataset:

from datasets import load_dataset

# Load the gsm8k dataset
# Using 'main' subset which contains train and test splits
dataset_gsm8k = load_dataset("gsm8k", "main")

# Function to format the gsm8k data into the SFTTrainer's expected 'messages' format
def format_gsm8k_example(example):
    return {
        "messages": [
            {"role": "user", "content": example["question"]},
            {"role": "assistant", "content": example["answer"]}
        ]
    }

# Apply the formatting function to the training split
dataset = dataset_gsm8k["train"].map(format_gsm8k_example, remove_columns=dataset_gsm8k["train"].column_names)

# Display the first few examples to verify the format
print("Formatted GSM8K Dataset (first 3 examples):")
display_dataset(dataset)


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Formatted GSM8K Dataset (first 3 examples):


Unnamed: 0,User Prompt,Assistant Response
0,"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72
1,"Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?","Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.\nWorking 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.\n#### 10"
2,"Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?","In the beginning, Betty has only 100 / 2 = $<<100/2=50>>50.\nBetty's grandparents gave her 15 * 2 = $<<15*2=30>>30.\nThis means, Betty needs 100 - 50 - 30 - 15 = $<<100-50-30-15=5>>5 more.\n#### 5"


In [13]:
print(len(dataset))

7473


In [18]:
train_dataset = dataset.select(range(100))

In [19]:
display_dataset(train_dataset)

Unnamed: 0,User Prompt,Assistant Response
0,"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72
1,"Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?","Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.\nWorking 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.\n#### 10"
2,"Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?","In the beginning, Betty has only 100 / 2 = $<<100/2=50>>50.\nBetty's grandparents gave her 15 * 2 = $<<15*2=30>>30.\nThis means, Betty needs 100 - 50 - 30 - 15 = $<<100-50-30-15=5>>5 more.\n#### 5"


# Load base model

In [15]:
base_model, tokenizer = load_model_and_tokenizer(MODEL_NAME, USE_ACCELERATOR)

Using CUDA device: cuda


## Configure and Run SFT Trainer

Now, let's set up the SFTTrainer. We'll define some `SFTConfig` parameters for our training run.

In [23]:
sft_config = SFTConfig(
    output_dir="./Qwen2-0.5B-SFT", # Directory to save the model and logs
    num_train_epochs=2, # Number of training epochs
    per_device_train_batch_size=2, # Batch size per GPU/TPU core
    gradient_accumulation_steps=8, # Number of updates steps to accumulate before performing a backward/update pass
    gradient_checkpointing=True, # Enable gradient checkpointing for memory efficiency
    # optim="paged_adamw_32bit", # Optimizer to use
    learning_rate=8e-5, # Learning rate
    fp16=True, # Use mixed precision training
    logging_steps=10, # Log every N steps
    save_steps=20, # Save checkpoint every N steps
    push_to_hub=False, # Whether to push the model to the Hugging Face Hub
    # max_seq_length=MAX_TOKEN_SIZE, # Maximum sequence length for training
    dataset_text_field="messages", # The field in the dataset that contains the conversation messages
    packing=True, # Pack multiple short examples into one longer sequence to save compute
)


In [24]:
trainer = SFTTrainer(
    model=base_model,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    args=sft_config,
)

# Start training
trainer.train()



Step,Training Loss


TrainOutput(global_step=4, training_loss=0.9295732975006104, metrics={'train_runtime': 245.3821, 'train_samples_per_second': 0.171, 'train_steps_per_second': 0.016, 'total_flos': 88652793934848.0, 'train_loss': 0.9295732975006104})

## Test the fine-tuned model

After training, let's see how the fine-tuned model performs on the questions.

Comment: this is absolutely overfitting - because the dataset is a small grad math problems and so it overfit to how grad math problems are answered in a formatted way.

In [22]:
test_model_with_questions(base_model, tokenizer, questions,
                          title="Fine-tuned Model (After SFT) Output")

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in Qwen2DecoderLayer. Setting `past_key_values=None`.



=== Fine-tuned Model (After SFT) Output ===

Model Input 1:
Give me an 1-sentence introduction of LLM.
Model Output 1:
L'In the function (1.


Model Input 2:
What's the difference between thread and process?
Model Output 2:
The following
A.


Model Input 3:
Calculate 2^4
Model Output 3:
To ensure that is a.


Model Input 4:
I am a all mountain skier, give me a best carving ski recommendation!
Model Output 4:
Based on the function (1.



In [25]:
test_model_with_questions(base_model, tokenizer, questions,
                          title="Fine-tuned Model (After SFT) Output - 2 epochs")


=== Fine-tuned Model (After SFT) Output - 2 epochs ===

Model Input 1:
Give me an 1-sentence introduction of LLM.
Model Output 1:
Lapsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystem


Model Input 2:
What's the difference between thread and process?
Model Output 2:
Threadsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsystemsy