<a href="https://colab.research.google.com/github/Vinooj/llm-fine_tuning-experiments/blob/main/qwen3__4b_grpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
# This code block checks if the environment is Google Colab and installs the necessary libraries (unsloth and vllm) using pip if not in Colab.
# The %%capture magic command suppresses the output of the installation process.
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!uv pip install transformers==4.55.4

### Unsloth

Goal: To convert `Qwen3-4B-Base` into a reasoning model via GRPO by using OpenR1's Math dataset.

We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

### GRPO chat template
Since we're using a base model, we should set a chat template. You can make your own chat template as well!
1. DeepSeek uses `<think>` and `</think>`, but this is **not** necessary - you can customize it however you like!
2. A `system_prompt` is recommended to at least guide the model's responses.

In [None]:
reasoning_start = "<start_working_out>" # Acts as <think>
reasoning_end   = "<end_working_out>"   # Acts as </think>
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""
system_prompt

We create a simple chat template below. Notice `add_generation_prompt` includes prepending `<start_working_out>` to guide the model to start its reasoning process.

In [None]:
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

# Replace with out specific template:
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

Let's see how our chat template behaves on an example:

In [None]:
tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION>"},
    {"role" : "user", "content" : "What is 2+2?"},
], tokenize = False, add_generation_prompt = True)

### Pre fine-tuning for formatting
We now use a subset of NVIDIA's [Open Math Reasoning dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning) which was filtered to only include high quality DeepSeek R1 traces.

We'll only filter ~59 or so examples to first "prime" / pre fine-tune the model to understand our custom GRPO formatting.

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
    ["expected_answer", "problem", "generated_solution"]
]

# Try converting to number - if not, replace with NaN
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
# Select only numbers
dataset = dataset.iloc[np.where(is_number)[0]]

dataset

We have to format the dataset to follow our GRPO style formatting:

In [None]:
def format_dataset(x):
    expected_answer = x["expected_answer"]
    problem = x["problem"]

    # Remove generated <think> and </think>
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # Strip newlines on left and right
    thoughts = thoughts.strip()
    # Add our custom formatting
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + expected_answer + solution_end
    return [
        {"role" : "system",    "content" : system_prompt},
        {"role" : "user",      "content" : problem},
        {"role" : "assistant", "content" : final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_dataset, axis = 1)

Check to see if it worked:

In [None]:
tokenizer.apply_chat_template(dataset["Messages"][0], tokenize = False)

Let's truncate the pre fine-tuning dataset to `max_seq_length/2` since we don't want too long reasoning traces.

Note this might take 2 minutes!

In [None]:
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x)))

dataset = dataset.loc[dataset["N"] <= max_seq_length/2].copy()
dataset.shape

We then tokenize the messages and convert it to a Hugging Face compatible dataset format:

In [None]:
from datasets import Dataset

dataset["text"] = tokenizer.apply_chat_template(dataset["Messages"].values.tolist(), tokenize = False)
dataset = Dataset.from_pandas(dataset)
dataset

Let's now pre fine-tune the model so it follows our custom GRPO formatting!

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
trainer.train()

Let's check if the model has learnt to follow the custom format:

In [None]:
text = tokenizer.apply_chat_template(
    dataset[0]["Messages"][:2],
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 0,
    max_new_tokens = 1024,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

Yes it did follow the formatting! Great! Let's remove some items before the GRPO step

In [None]:
del dataset
torch.cuda.empty_cache()
import gc
gc.collect()

### Data Prep
<a name="Data"></a>

We're using Hugging Face's [Open R1 Math dataset](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed). You can also utilize OpenAI's famous [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)

In [None]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

Let's look at the first row:

In [None]:
dataset[0]["prompt"]

In [None]:
dataset[0]["solution"]

In GSM8K, ee notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

In [None]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

Let's map the dataset! and see the first row:

In [None]:
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
dataset[0]

We create a regex format to match the reasoning sections and answers:

In [None]:
import re

# Add optional EOS token matching
solution_end_regex = r"</SOLUTION>[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)
match_format

We verify it works:

In [None]:
match_format.findall(
    "Let me think!<end_working_out>"\
    f"<SOLUTION>\n2\n</SOLUTION>",
)

In [None]:
match_format.findall(
    "<start_working_out>Let me think!<end_working_out>"\
    f"<SOLUTION>  2  </SOLUTION>\n\n",
)

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [None]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [None]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!

        # No need to reward <start_working_out> since we always prepend it!
        # score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

In [None]:
match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("<SOLUTION>  0.34  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  123,456  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  -0.234  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>17</SOLUTION>"))

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via `float` and sees if it's the same.

In [None]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

In [None]:
tokenized = dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Filter only samples smaller than 90% max length
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
max_prompt_length = maximum_length + 1 # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 100,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",

    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,

    # For optional training + evaluation
    # train_dataset = new_dataset["train"],
    # eval_dataset = new_dataset["test"],
)
trainer.train()

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Verify LoRA is actually trained!

In [None]:
from safetensors import safe_open

tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
    # Verify both A and B are non zero
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()
        assert(n_zeros.item() != tensor.numel())

Now we load the LoRA and test:

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>


# Task
Explain the provided notebook code step-by-step, focusing on concepts like LoRA, GRPO, reward functions, and data preparation, to help a user new to these topics understand the process of fine-tuning a language model.

## Explain lora

### Subtask:
Add markdown cells explaining what LoRA is and why it's used in this context.


**Reasoning**:
I need to insert a markdown cell explaining LoRA before the code cell with id DkIvEkIIkEyB.



This code block is specifically for Google Colab and Kaggle environments. It uses `uv pip install` to install the required libraries, including `unsloth`, `vllm`, `numpy`, `torchvision`, `bitsandbytes`, and `xformers`. It also checks for the presence of a Tesla T4 GPU to install specific versions of `vllm` and `triton` for optimal performance. Finally, it installs a specific version of the `transformers` library.

This code block initializes the base language model (`unsloth/Qwen3-4B-Base`) and its tokenizer using `FastLanguageModel.from_pretrained`.

*   `max_seq_length`: Sets the maximum sequence length the model can handle.
*   `lora_rank`: Defines the rank for the LoRA adapters, influencing the number of trainable parameters.
*   `load_in_4bit = False`: Specifies that the model should be loaded in 16-bit precision for LoRA training (4-bit is for inference).
*   `fast_inference = True`: Enables vLLM for faster inference.
*   `max_lora_rank`: Sets the maximum LoRA rank allowed.
*   `gpu_memory_utilization`: Controls the GPU memory allocation for vLLM.

After loading the model, `FastLanguageModel.get_peft_model` is used to prepare the model for PEFT (Parameter-Efficient Fine-Tuning) with LoRA.

*   `r`: Sets the LoRA rank.
*   `target_modules`: Specifies which layers of the model will have LoRA adapters injected.
*   `lora_alpha`: A scaling factor for the LoRA updates.
*   `use_gradient_checkpointing = "unsloth"`: Enables gradient checkpointing to reduce memory usage during training.
*   `random_state`: Sets a random seed for reproducibility.

This code block defines the custom tokens and the system prompt that will be used to structure the input and output during the GRPO training.

*   `reasoning_start`, `reasoning_end`: Tokens to mark the beginning and end of the model's step-by-step reasoning process.
*   `solution_start`, `solution_end`: Tokens to mark the beginning and end of the final solution.
*   `system_prompt`: A guiding instruction for the model, explaining its role and how to format its output using the defined tokens.

This code block defines a custom chat template using the Jinja templating language. This template structures the conversation turns between the system, user, and assistant roles, ensuring that the model receives input and generates output in a consistent format that aligns with the GRPO training objective. It also prepends the `reasoning_start` token when generating a new response to encourage the model to begin with its thought process.

This code block demonstrates how the custom chat template defined in the previous cell formats a sample conversation. It applies the template to a list of messages with different roles and shows the resulting string that will be fed into the model. The `add_generation_prompt = True` argument ensures that the `reasoning_start` token is added at the end of the prompt for the model to start generating its response.

This code block loads a subset of the NVIDIA Open Math Reasoning dataset from the Hugging Face Hub using `load_dataset`. It then converts the dataset to a pandas DataFrame and filters it to include only examples where the `expected_answer` can be converted to a number. This filtered dataset will be used for the initial pre-fine-tuning step to teach the model the desired output format.

This code block defines a function `format_dataset` that takes a row from the dataset and formats it into the desired GRPO chat template structure. It extracts the problem, expected answer, and generated solution (reasoning), cleans up the reasoning by removing existing `<think>` tokens, and then constructs the final prompt and assistant response using the custom `reasoning_start`, `reasoning_end`, `solution_start`, and `solution_end` tokens. The function then applies this formatting to the entire dataset, creating a new "Messages" column containing the formatted conversations.

This code block checks the output of the `format_dataset` function by applying the tokenizer's chat template to the first example in the "Messages" column of the dataset. This helps verify that the dataset has been formatted correctly according to the custom chat template.

This code block calculates the number of tokens for each formatted message in the dataset and stores it in a new column "N". It then filters the dataset to keep only those examples where the number of tokens is less than or equal to half of the `max_seq_length`. This truncation step is performed to prevent excessively long reasoning traces during the pre-fine-tuning, which helps manage memory usage and training efficiency.

This code block prepares the dataset for training with the `trl` library's `SFTTrainer`. It applies the chat template to the "Messages" column to create a "text" column containing the fully formatted prompt and response as a single string. Then, it converts the pandas DataFrame back into a Hugging Face `Dataset` object, which is the required format for the `SFTTrainer`.

This code block sets up the `SFTTrainer` for the pre-fine-tuning phase.

*   `model`, `tokenizer`: The LoRA-enabled model and its tokenizer.
*   `train_dataset`: The formatted dataset prepared in the previous steps.
*   `args`: An `SFTConfig` object containing various training parameters:
    *   `dataset_text_field`: Specifies the column in the dataset that contains the training text.
    *   `per_device_train_batch_size`, `gradient_accumulation_steps`: Control the effective batch size for training.
    *   `warmup_steps`: Number of steps for a learning rate warmup.
    *   `num_train_epochs`: Number of full passes over the training data.
    *   `learning_rate`: The learning rate for the optimizer.
    *   `logging_steps`: How often to log training progress.
    *   `optim`: The optimizer to use.
    *   `weight_decay`: L2 regularization parameter.
    *   `lr_scheduler_type`: Type of learning rate scheduler.
    *   `seed`: Random seed for reproducibility.
    *   `report_to`: Where to report training metrics (e.g., "none" to disable).

This code block starts the training process for the pre-fine-tuning phase using the `trainer.train()` method. This step trains the LoRA adapters on the formatted dataset to teach the model the desired output structure and the use of the custom GRPO tokens.

This code block tests if the model has learned to follow the custom GRPO format after the pre-fine-tuning. It constructs a prompt using the chat template and then uses `model.generate` to produce a response. The `TextStreamer` is used to display the generated text as it's being produced. The goal is to see if the model starts its response with the `reasoning_start` token and attempts to follow the defined structure.

This code block cleans up memory after the pre-fine-tuning step. It deletes the `dataset` variable and uses `torch.cuda.empty_cache()` and `gc.collect()` to free up GPU memory and clear the garbage collector, which is important before moving to the next training phase.

This code block loads the main dataset for GRPO training, which is the "open-r1/DAPO-Math-17k-Processed" dataset from the Hugging Face Hub. It specifies the "en" configuration and loads the "train" split. This dataset contains a larger collection of math problems and solutions that will be used for the main GRPO training.

This code block displays the "prompt" field of the first example in the loaded dataset. This shows the input question for a math problem.

This code block displays the "solution" field of the first example in the loaded dataset. This shows the expected solution, which includes both the reasoning steps and the final answer.

This code block defines a function `extract_hash_answer`. Although commented out, the original intention was likely to extract the final answer from solutions that used a "####" marker, as is common in datasets like GSM8K. However, for the Open R1 dataset used here, the function simply returns the entire solution string as the "answer".

This code block maps the dataset to apply the formatting for the GRPO training. It creates a "prompt" field containing the system prompt and user question in the chat template format. It also creates an "answer" field by applying the `extract_hash_answer` function to the "solution" field. Finally, it displays the first example of the processed dataset to show the new structure.

This code block defines a regular expression pattern `match_format` to extract the final answer from the model's generated output based on the custom GRPO tokens.

*   `rf"{reasoning_end}.*?"`: Matches the `reasoning_end` token followed by any characters (non-greedily).
*   `rf"{solution_start}(.+?){solution_end_regex}"`: Matches the `solution_start` token, captures any characters in between (the solution) non-greedily using `(.+?)`, and then matches the `solution_end_regex`.
*   `solution_end_regex`: This part is defined separately to optionally match the end-of-sequence token (`eos_token`) after the `solution_end` token.
*   `[\s]{{0,}}$`: Matches zero or more whitespace characters at the end of the string.
*   `flags = re.MULTILINE | re.DOTALL`: Enables multiline matching and allows the dot (`.`) to match newline characters.

This regex is crucial for extracting the model's predicted answer from its structured output during GRPO training.

This code block verifies that the `match_format` regular expression works correctly by applying it to a sample string that follows the expected GRPO output format. It uses `findall` to extract the captured group, which should be the content between the `solution_start` and `solution_end` tokens.

This code block further verifies the `match_format` regular expression with another sample string, including leading/trailing whitespace around the solution and newline characters. This ensures the regex is robust to variations in whitespace.

This code block defines the `match_format_exactly` reward function. This function is used in GRPO to reward the model based on whether its generated response exactly matches the expected format defined by the custom tokens and their order. If the `match_format` regex successfully finds a match in the completion, the model receives a score of 3.0; otherwise, it receives 0. This encourages the model to adhere strictly to the desired output structure.

This code block defines the `match_format_approximately` reward function. This function provides a partial reward if the model's output includes some of the required formatting tokens, even if it doesn't match the exact structure. It checks for the presence of `reasoning_end`, `solution_start`, and `solution_end` tokens. The model gets a positive score (0.5) for each token present exactly once and a negative score (-1.0) otherwise. This helps guide the model towards the correct format even when it doesn't get it perfectly right.

This code block defines the `check_answer` reward function. This function focuses on evaluating the correctness of the extracted answer from the model's completion.

*   It first attempts to extract the answer using the `match_format` regex.
*   If no answer is extracted, the model is penalized (-2.0).
*   If an answer is extracted, it compares it to the true answer from the dataset.
*   An exact match gets the highest reward (5.0).
*   A match after stripping whitespace gets a slightly lower reward (3.5).
*   If the extracted answer and true answer can be converted to numbers, it calculates a ratio and rewards the model based on how close the ratio is to 1.0.
*   Incorrect numerical answers or non-numerical answers that don't match are penalized.

This reward function directly incentivizes the model to produce the correct final answer.

This code block defines a regular expression `match_numbers` specifically to extract potential numerical answers from the model's output, even if they are not perfectly enclosed within the `solution_start` and `solution_end` tokens or include commas. It looks for a number (potentially negative, with decimals and commas) after the `solution_start` token. The `print` statements demonstrate how this regex extracts numbers from various sample strings.

This code block defines the `check_numbers` reward function, which is similar to `check_answer` but specifically focuses on extracting and comparing numerical answers using the `match_numbers` regex.

*   It attempts to extract a number using `match_numbers`.
*   If no number is extracted, it penalizes the model (-2.5).
*   If a number is extracted, it attempts to convert both the extracted guess and the true answer to floating-point numbers, handling commas in the guess.
*   It rewards the model (3.5) if the converted numbers are equal and penalizes it (-1.5) otherwise.
*   A print statement is included to periodically show the question, true answer, model response, and extracted answer during training, which helps in monitoring the training process.

This reward function provides a separate signal to the model based on the numerical correctness of its output.

This code block calculates the token length of each prompt in the dataset after applying the chat template and adding the generation prompt. It then determines the 90th percentile of these lengths. This `maximum_length` is used to filter out the longest 10% of prompts to prevent excessive sequence lengths during training, which can lead to memory issues or truncation. The dataset is then filtered to include only prompts shorter than or equal to this maximum length.

This code block sets up the configurations for the main GRPO training.

*   `max_prompt_length`: Set based on the calculated maximum length from the dataset analysis.
*   `max_completion_length`: Calculated as the remaining length after the prompt, up to the `max_seq_length`.
*   `vllm_sampling_params`: Defines the parameters for sampling completions from the model using vLLM during the GRPO process (e.g., temperature, top_p, stop tokens).
*   `training_args`: A `GRPOConfig` object containing various training parameters specific to GRPO, such as learning rate, weight decay, optimizer, batch size, number of generations per prompt, and the maximum number of training steps. It also includes optional parameters for evaluation.

This code block initializes and starts the GRPO training process using the `GRPOTrainer`.

*   `model`, `processing_class`: The LoRA-enabled model and its tokenizer.
*   `reward_funcs`: A list of the reward functions defined earlier (`match_format_exactly`, `match_format_approximately`, `check_answer`, `check_numbers`). The GRPO trainer will use these functions to calculate rewards for the model's generated completions.
*   `args`: The `GRPOConfig` containing the training parameters.
*   `train_dataset`: The dataset prepared for GRPO training.

The `trainer.train()` method starts the GRPO algorithm, which iteratively generates completions, calculates rewards based on the provided functions, and updates the model's parameters to maximize the expected reward. The markdown cell above this provides an example of the training log, showing how the reward should ideally increase over time.

This code block demonstrates how to perform inference with the base model *before* applying the GRPO training. It takes a simple question ("What is the sqrt of 101?") and uses `model.fast_generate` with specified `SamplingParams` to get the model's response. This serves as a baseline to compare against the model's performance after GRPO training.

This code block saves the trained LoRA adapters to a local directory named "grpo_saved_lora" using `model.save_lora`. This saves only the small LoRA weights, not the entire base model, making it efficient for storing and sharing the fine-tuned changes.

This code block verifies that the LoRA adapters were saved correctly and that they contain non-zero values (meaning the training actually updated the weights). It uses `safetensors` to open the saved adapter file and checks that the number of zero elements in the weight tensors is not equal to the total number of elements.

This code block demonstrates how to perform inference using the model with the *trained* LoRA adapters loaded.

*   It constructs the input prompt using the chat template, including the system prompt and the user question.
*   It uses `model.fast_generate` for inference.
*   Crucially, it passes `model.load_lora("grpo_saved_lora")` to the `lora_request` parameter. This tells vLLM to apply the saved LoRA weights during generation, allowing you to see the effect of the GRPO training.

The output of this cell should show the model attempting to use the GRPO format and providing a more reasoned answer compared to the inference before training.

This markdown cell provides commentary on the inference results, noting that the model's reasoning has improved after training, even if not always perfectly correct, due to the limited training time in this example. It suggests that extending the training time and sequence length would likely lead to better performance.

This markdown cell introduces the concept of saving the model in different formats for deployment and sharing, specifically focusing on saving to `float16` or `int4` formats and pushing to the Hugging Face Hub.

This code block provides commented-out examples of how to save the fine-tuned model in different merged formats (`merged_16bit` and `merged_4bit`) and how to push these merged models to the Hugging Face Hub. It also shows how to save and push just the LoRA adapters as a fallback. These options are useful for deploying the model for inference with libraries that support merged models or LoRA adapters.

This markdown cell explains how to save the model in the GGUF format, which is compatible with `llama.cpp` and enables running the model on various hardware platforms, including CPUs. It mentions that Unsloth natively supports GGUF conversion and lists some common quantization methods like `q8_0`, `q4_k_m`, and `q5_k_m`, recommending the K_M methods for better trade-offs between size and performance. It also links to an Ollama notebook for finetuning and auto-exporting to Ollama.

This code block provides commented-out examples of how to save the fine-tuned model in different GGUF quantization methods (`q8_0`, `f16`, `q4_k_m`) locally and how to push these GGUF files to the Hugging Face Hub. It also shows how to save and push multiple GGUF quantization methods at once for efficiency. You would uncomment and modify these lines based on your desired GGUF format and Hugging Face repository.

This final markdown cell concludes the notebook, providing instructions on how to use the saved GGUF file with `llama.cpp`. It also includes links to the Unsloth Discord server for support and community interaction, as well as links to other relevant Unsloth notebooks and documentation.