# GRPO Fine-tuning - All-in-One Notebook By *WEBSPACEAI RESEARCH*

This notebook provides a complete end-to-end pipeline to convert Qwen3-4B-Base or Any Model into a reasoning model via GRPO (Generalized Reward Policy Optimization) using OpenR1's Math dataset. Everything is included in this single notebook - from environment validation to model export.

## Overview

1. **Environment Validation**: Check Python version, GPU availability, and disk space
2. **Dependency Installation**: Automatically install all required packages
3. **Model Setup**: Load and configure the Qwen3-4B model with LoRA
4. **Pre-training**: Fine-tune the model to understand custom GRPO formatting
5. **Data Preparation**: Process the Open R1 Math dataset
6. **GRPO Training**: Train the model for mathematical reasoning
7. **Inference**: Test the model's reasoning capabilities
8. **Model Export**: Save the model in various formats

## Requirements

- CUDA-compatible GPU with 16+ GB VRAM (recommended)
- Python 3.8+
- ~30 GB disk space for model and datasets


## 1. Environment Validation

First, let's check if your environment meets the requirements for training.

In [None]:
import os
import sys
import subprocess
import platform
import shutil
from pathlib import Path

# Helper functions for pretty output
def print_header(message):
    print("\n" + "=" * 80)
    print(f" {message}")
    print("=" * 80)

def print_step(message):
    print(f"\n>> {message}")

def print_success(message):
    print(f"✅ {message}")

def print_warning(message):
    print(f"⚠️ {message}")

def print_error(message):
    print(f"❌ {message}")

print_header("Environment Validation")

In [None]:
# Check Python version
print_step("Checking Python version")
python_version = platform.python_version()
print(f"Python version: {python_version}")

major, minor, _ = map(int, python_version.split('.'))
if major < 3 or (major == 3 and minor < 8):
    print_warning("Python 3.8+ is recommended for this notebook")
    python_ok = False
else:
    print_success("Python version is compatible")
    python_ok = True

In [None]:
# Check for GPU availability
print_step("Checking for GPU")

# Try to import torch
try:
    import torch
    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()
        print(f"Found {device_count} CUDA device(s)")
        
        for i in range(device_count):
            device_name = torch.cuda.get_device_name(i)
            device_memory = torch.cuda.get_device_properties(i).total_memory / 1e9
            print_success(f"GPU {i}: {device_name} ({device_memory:.2f} GB)")
        
        gpu_available = True
    else:
        print_warning("No CUDA-compatible GPU detected")
        
        # Check if NVIDIA drivers are installed
        try:
            subprocess.run("nvidia-smi", shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print_warning("NVIDIA drivers found but CUDA is not available in PyTorch")
            print("Please ensure PyTorch is installed with CUDA support")
        except subprocess.CalledProcessError:
            print_error("NVIDIA drivers not found or not properly installed")
            print("Please install NVIDIA drivers: https://www.nvidia.com/Download/index.aspx")
        
        gpu_available = False
except ImportError:
    print_warning("PyTorch not installed, will attempt to install it")
    gpu_available = None

In [None]:
# Check available disk space
print_step("Checking available disk space")

total, used, free = shutil.disk_usage(os.getcwd())
free_gb = free / (2**30)
print(f"Available disk space: {free_gb:.2f} GB")

if free_gb < 30:
    print_warning("Less than 30GB free disk space. You may encounter issues downloading models and datasets")
    disk_ok = False
else:
    print_success(f"Sufficient disk space available ({free_gb:.2f} GB)")
    disk_ok = True

In [None]:
# Summarize environment check
print_header("Environment Check Summary")

if not python_ok:
    print_warning("Python version check: FAILED - Continuing with unsupported Python version, but you may encounter issues")
else:
    print_success("Python version check: PASSED")

if gpu_available is False:  # None means not installed yet
    print_warning("GPU check: FAILED - No GPU detected. This notebook requires a GPU for training")
    proceed = input("Continue anyway? (y/n): ").lower() == 'y'
    if not proceed:
        print("Notebook execution stopped. Please run on a machine with a CUDA-compatible GPU.")
        # This will only work in interactive mode
        # import sys; sys.exit()
elif gpu_available is True:
    print_success("GPU check: PASSED")
else:
    print_warning("GPU check: UNKNOWN - Will attempt to install PyTorch with CUDA support")

if not disk_ok:
    print_warning("Disk space check: FAILED - Low disk space may cause issues during model download and training")
    proceed = input("Continue anyway? (y/n): ").lower() == 'y'
    if not proceed:
        print("Notebook execution stopped. Please free up disk space and try again.")
        # import sys; sys.exit()
else:
    print_success("Disk space check: PASSED")

## 2. Dependency Installation

Now let's install all required dependencies for training. This may take several minutes.

In [None]:
# Function to run pip install with error handling
def pip_install(package, quiet=False):
    try:
        if quiet:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package, "--quiet"])
        else:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        return True
    except subprocess.CalledProcessError:
        print(f"Failed to install {package}")
        return False

print_header("Installing Dependencies")

# First install PyTorch with CUDA support if not already installed
try:
    import torch
    print(f"PyTorch {torch.__version__} already installed")
    torch_installed = True
except ImportError:
    print("Installing PyTorch with CUDA support...")
    success = pip_install("torch>=2.0.0 --extra-index-url https://download.pytorch.org/whl/cu118")
    if not success:
        print_warning("Failed to install PyTorch with CUDA 11.8, trying CUDA 11.7...")
        success = pip_install("torch>=2.0.0 --extra-index-url https://download.pytorch.org/whl/cu117")
    if not success:
        print_warning("Failed to install PyTorch with CUDA 11.7, trying default...")
        success = pip_install("torch>=2.0.0")
    if not success:
        print_error("Failed to install PyTorch")
        torch_installed = False
    else:
        torch_installed = True
        # Import torch again to verify installation
        try:
            import torch
            print_success(f"PyTorch {torch.__version__} installed successfully")
        except ImportError:
            print_error("PyTorch installation failed")
            torch_installed = False

In [None]:
# Install transformers first as it's a core dependency
print_step("Installing transformers...")
success = pip_install("transformers>=4.34.0")
if not success:
    print_error("Failed to install transformers")
    transformers_installed = False
else:
    transformers_installed = True
    print_success("Transformers installed successfully")

In [None]:
# Install key packages individually
print_step("Installing key packages...")

packages = [
    "unsloth>=2023.11.0",
    "vllm==0.8.5.post1",
    "bitsandbytes>=0.39.0",
    "accelerate>=0.23.0",
    "xformers==0.0.29.post3",
    "peft>=0.5.0",
    "trl>=0.7.2",
    "triton>=2.0.0",
    "safetensors>=0.3.2",
    "datasets>=3.4.1",
    "pandas>=2.0.0",
    "numpy>=1.24.0",
    "sentencepiece",
    "protobuf",
    "huggingface_hub",
    "cut_cross_entropy",
    "unsloth_zoo"
]

# Bolt Optimization: Batch install for performance
print(f"Installing {len(packages)} packages in batch...")
import sys
import subprocess
try:
    # Batch install allows pip to resolve dependencies more efficiently
    # We explicitly import sys and subprocess to ensure they are available in this cell
    cmd = [sys.executable, "-m", "pip", "install"] + packages + ["--quiet"]
    subprocess.check_call(cmd)
    print_success("Package installation completed")
except subprocess.CalledProcessError:
    if "print_warning" in globals():
        print_warning("Batch installation failed, falling back to sequential installation")
    else:
        print("Batch installation failed, falling back to sequential installation")
    
    for package in packages:
        print(f"Installing {package}...")
        pip_install(package, quiet=True)
    print_success("Sequential installation completed")

In [None]:
# Validate key imports
print_step("Validating package imports")

key_packages = [
    "torch", 
    "transformers", 
    "unsloth", 
    "vllm", 
    "datasets", 
    "peft", 
    "trl"
]

all_success = True
for package in key_packages:
    try:
        __import__(package)
        print_success(f"Successfully imported {package}")
    except ImportError as e:
        print_error(f"Failed to import {package}: {e}")
        all_success = False

if all_success:
    print_success("All key packages imported successfully")
else:
    print_warning("Some packages failed to import. You may encounter issues during training.")

## 3. Load and Configure Model

Now we'll load the Qwen3-4B-Base model and configure it with LoRA for efficient fine-tuning.

In [None]:
from unsloth import FastLanguageModel
import torch

# Model configuration
max_seq_length = 2048  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

print_header("Loading Qwen3-4B-Base Model")
print("This may take a few minutes to download and prepare the model...")

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gemma-3n-E4B-it",
        max_seq_length = max_seq_length,
        load_in_4bit = False,  # False for LoRA 16bit
        fast_inference = True,  # Enable vLLM fast inference
        max_lora_rank = lora_rank,
        gpu_memory_utilization = 0.7,  # Reduce if out of memory
    )

    # Configure LoRA for efficient fine-tuning
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha = lora_rank*2,  # *2 speeds up training
        use_gradient_checkpointing = "unsloth",  # Reduces memory usage
        random_state = 3407,
    )

    print_success("Model loaded and configured successfully")
except Exception as e:
    print_error(f"Error loading model: {e}")
    print("Please check your internet connection and GPU memory availability.")
    # Uncomment to stop execution if model loading fails
    # raise e

## 4. Configure GRPO Chat Template

Now we'll set up a custom chat template for the GRPO training process. This defines how the model formats its reasoning and solutions.

In [None]:
# Define reasoning format markers
reasoning_start = "<start_working_out>"  # Acts as <think>
reasoning_end   = "<end_working_out>"    # Acts as </think>
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

# Create system prompt that instructs the model how to format its responses
system_prompt = f"""\
You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

print("System prompt:")
print(system_prompt)

# Create a custom chat template
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

# Replace with our specific template values
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

# Test the chat template
example_chat = [
    {"role": "user", "content": "What is 1+1?"},
    {"role": "assistant", "content": f"{reasoning_start}I think it's 2.{reasoning_end}{solution_start}2{solution_end}"},
    {"role": "user", "content": "What is 2+2?"},
]

formatted_chat = tokenizer.apply_chat_template(example_chat, tokenize=False, add_generation_prompt=True)
print("\nExample chat with template applied:")
print(formatted_chat)

## 5. Pre-Fine-Tuning for Formatting

Before the main GRPO training, we'll pre-train the model on a small dataset to help it understand our custom formatting. This makes the GRPO training more efficient.

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

print_header("Loading Pre-training Dataset")
try:
    dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
    dataset = dataset.to_pandas()[
        ["expected_answer", "problem", "generated_solution"]
    ]

    # Filter for numerical answers only
    is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors="coerce").notnull()
    dataset = dataset.iloc[np.where(is_number)[0]]
    
    print_success(f"Dataset loaded with {len(dataset)} examples")
except Exception as e:
    print_warning(f"Error loading dataset: {e}")
    print("Creating a minimal example dataset instead...")
    # Create a minimal dataset if the original can't be loaded
    data = {
        "expected_answer": ["2", "4", "10", "5", "9"],
        "problem": [
            "What is 1+1?", 
            "What is 2+2?", 
            "What is 5+5?",
            "What is 10/2?",
            "What is 3*3?"
        ],
        "generated_solution": [
            "<think>To find 1+1, I add 1 and 1 together. 1+1=2.</think>",
            "<think>To find 2+2, I add 2 and 2 together. 2+2=4.</think>",
            "<think>To find 5+5, I add 5 and 5 together. 5+5=10.</think>",
            "<think>To find 10/2, I divide 10 by 2. 10/2=5.</think>",
            "<think>To find 3*3, I multiply 3 by 3. 3*3=9.</think>"
        ]
    }
    dataset = pd.DataFrame(data)
    print_success(f"Created minimal dataset with {len(dataset)} examples")

In [None]:
# Format the dataset to follow our GRPO style
def format_dataset(x):
    expected_answer = x["expected_answer"]
    problem = x["problem"]

    # Remove generated <think> and </think>
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # Strip newlines on left and right
    thoughts = thoughts.strip()
    # Add our custom formatting
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + expected_answer + solution_end
    return [
        {"role": "system",    "content": system_prompt},
        {"role": "user",      "content": problem},
        {"role": "assistant", "content": final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_dataset, axis=1)

# Check an example
print("\nExample formatted message:")
example = tokenizer.apply_chat_template(dataset["Messages"][0], tokenize=False)
print(example[:500] + "..." if len(example) > 500 else example)

# Truncate dataset to max_seq_length/2
print("\nTruncating dataset to appropriate length...")
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x)))
dataset = dataset.loc[dataset["N"] <= max_seq_length/2].copy()
print(f"Dataset size after truncation: {dataset.shape[0]} examples")

# Convert to Hugging Face dataset format
from datasets import Dataset as HFDataset
dataset["text"] = tokenizer.apply_chat_template(dataset["Messages"].values.tolist(), tokenize=False)
hf_dataset = HFDataset.from_pandas(dataset)
print_success("Dataset prepared for pre-fine-tuning")

In [None]:
from trl import SFTTrainer, SFTConfig

print_header("Starting Pre-fine-tuning")
print("This step teaches the model to follow our custom format...")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=hf_dataset,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,  # Use GA to mimic batch size!
        warmup_steps=5,
        num_train_epochs=2,  # Set this for 1 full training run.
        learning_rate=2e-4,  # Reduce to 2e-5 for long training runs
        logging_steps=5,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",  # Use this for WandB etc
    ),
)

trainer.train()
print_success("Pre-fine-tuning complete")

## 6. Test Pre-Fine-Tuned Model

Let's check if the model has learned to follow our custom format.

In [None]:
print_header("Testing Pre-fine-tuned Model")

# Test the model with an example from the dataset
if len(dataset) > 0:
    test_text = tokenizer.apply_chat_template(
        dataset["Messages"][0][:2],  # Just system and user message
        tokenize=False,
        add_generation_prompt=True,  # Must add for generation
    )

    from transformers import TextStreamer
    print("Generating response with pre-fine-tuned model...")
    _ = model.generate(
        **tokenizer(test_text, return_tensors="pt").to("cuda"),
        temperature=0,
        max_new_tokens=1024,
        streamer=TextStreamer(tokenizer, skip_prompt=True),
    )

# Clean up to free memory
del dataset
del hf_dataset
torch.cuda.empty_cache()
import gc
gc.collect()

## 7. Data Preparation for GRPO Training

Now we'll prepare the main training dataset from Open R1 Math for GRPO training.

In [None]:
from datasets import load_dataset

print_header("Loading Open R1 Math Dataset")
try:
    dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split="train")
    print_success(f"Dataset loaded with {len(dataset)} examples")
except Exception as e:
    print_warning(f"Error loading dataset: {e}")
    print("Creating a minimal example dataset instead...")
    # Create a minimal dataset if the original can't be loaded
    minimal_data = {
        "prompt": [
            "What is the square root of 16?",
            "If x + 5 = 10, what is x?",
            "What is 7 * 8?",
            "What is 144 divided by 12?",
            "What is 3^4?",
            "What is the square root of 100?",
            "If 2x - 3 = 7, what is x?",
            "What is 25% of 80?",
            "What is 15 + 27?",
            "What is 99 - 45?"
        ],
        "solution": ["4", "5", "56", "12", "81", "10", "5", "20", "42", "54"]
    }
    from datasets import Dataset as HFDataset
    dataset = HFDataset.from_dict(minimal_data)
    print_success(f"Created minimal dataset with {len(dataset)} examples")

# Show an example
print("\nExample problem:")
print(dataset[0]["prompt"])
print("\nExample solution:")
print(dataset[0]["solution"])

In [None]:
# Extract answers (for GSM8K dataset, we would extract from #### sections)
def extract_hash_answer(text):
    # For GSM8K: if "####" in text: return text.split("####")[1].strip()
    # For Open R1: just return the text
    return text

# Bolt Optimization: Use batched processing for speed
# Batched mapping is significantly faster (~7x) for large datasets
def format_dataset_batched(examples):
    return {
        "prompt": [
            [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": p},
            ]
            for p in examples["prompt"]
        ],
        "answer": [extract_hash_answer(s) for s in examples["solution"]]
    }

dataset = dataset.map(format_dataset_batched, batched=True)

print("\nExample formatted data:")
print(dataset[0])

## 8. Set Up Reward Functions for GRPO

Now we'll define several reward functions that will guide the GRPO training process. These functions evaluate the model's responses and provide rewards based on format adherence and correctness.

In [None]:
import re

print_header("Setting Up GRPO Reward Functions")

# Create regex pattern to match our formatting
solution_end_regex = r"</SOLUTION>[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags=re.MULTILINE | re.DOTALL
)

# Test the regex pattern
test_examples = [
    f"Let me think!{reasoning_end}\n{solution_start}\n2\n{solution_end}",
    f"{reasoning_start}Let me think!{reasoning_end}\n{solution_start}  2  {solution_end}\n\n",
]

print("Testing regex pattern:")
for i, example in enumerate(test_examples):
    matches = match_format.findall(example)
    print(f"Example {i+1} matches: {matches}")

In [None]:
# Define reward functions

# 1. Exact format matching - 3 points if format is followed exactly
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

# 2. Approximate format matching - partial points for each format element
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

# 3. Answer checking - reward based on correctness of the answer
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5  # Penalize wrong answers
            except:
                score -= 4.5  # Penalize
        scores.append(score)
    return scores

In [None]:
# 4. Number extraction - for handling numeric answers in various formats
match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags=re.MULTILINE | re.DOTALL
)

# Test the number extraction
number_examples = [
    f"{solution_start}  0.34  {solution_end}",
    f"{solution_start}  123,456  {solution_end}",
    f"{solution_start}  -0.234  {solution_end}",
    f"{solution_start}17{solution_end}"
]

print("\nTesting number extraction:")
for example in number_examples:
    print(match_numbers.findall(example))

# 5. Numeric comparison with detailed logging
PRINTED_TIMES = 0
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    global PRINTED_TIMES, PRINT_EVERY_STEPS
    
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", 
            f"\nAnswer:\n{answer[0]}", 
            f"\nResponse:\n{responses[0][:300]}..." if len(responses[0]) > 300 else f"\nResponse:\n{responses[0]}", 
            f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

print_success("Reward functions defined successfully")

## 9. Prepare Dataset for Training

Filter the dataset to ensure all examples fit within our sequence length.

In [None]:
print_header("Preparing Dataset for GRPO Training")

# Tokenize and measure prompt lengths
print("Measuring prompt lengths...")

# Efficiently calculate lengths without storing full tokens
# Using batched=True and not storing the intermediate 'tokens' column saves memory
def get_token_len(x):
    return {"L": [len(t) for t in tokenizer.apply_chat_template(x["prompt"], add_generation_prompt=True, tokenize=True)]}

tokenized = dataset.map(
    get_token_len,
    batched=True,
)

# Show an example of tokenized prompt
print("Example tokenized prompt:")
# Tokenize just one example for display purposes
example_tokens = tokenizer.apply_chat_template(dataset[0]["prompt"], add_generation_prompt=True, tokenize=True)
print(tokenizer.decode(example_tokens)[:200] + "...")

# Find the 90th percentile length to avoid outliers
import numpy as np
# Use the L column from the mapped dataset
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print(f"90th percentile token length: {maximum_length}")

# Filter dataset to only include examples below the maximum length
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
print_success(f"Dataset filtered to {len(dataset)} examples")

# Clean up to free memory
del tokenized
torch.cuda.empty_cache()
import gc
gc.collect()

## 10. Configure GRPO Training

Set up the GRPO trainer with our reward functions and dataset.

In [None]:
print_header("Configuring GRPO Training")

# Calculate sequence lengths for prompts and completions
max_prompt_length = maximum_length + 1  # +1 just in case
max_completion_length = max_seq_length - max_prompt_length
print(f"Max prompt length: {max_prompt_length}")
print(f"Max completion length: {max_completion_length}")

# Configure vLLM sampling parameters
from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p=0.1,
    top_p=1.0,
    top_k=-1,
    seed=3407,
    stop=[tokenizer.eos_token],
    include_stop_str_in_output=True,
)

# Configure GRPO training
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params=vllm_sampling_params,
    temperature=1.0,
    learning_rate=5e-6,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=4,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    # num_train_epochs=1,  # Set to 1 for a full training run
    max_steps=100,  # For quick testing, increase for better results
    save_steps=100,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)

print_success("GRPO training configuration complete")

## 11. Run GRPO Training

Start the GRPO training process. This may take a while depending on your dataset size and hardware.

In [None]:
print_header("Starting GRPO Training")
print("This process may take a while. You'll see reward metrics during training.")
print("The goal is to see the 'reward' column increase over time.")

# Optional: Create train/test split for evaluation
# new_dataset = dataset.train_test_split(test_size=0.01)

try:
    trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=[
            match_format_exactly,
            match_format_approximately,
            check_answer,
            check_numbers,
        ],
        args=training_args,
        train_dataset=dataset,
        # For optional training + evaluation
        # train_dataset=new_dataset["train"],
        # eval_dataset=new_dataset["test"],
    )

    trainer.train()
    print_success("GRPO training complete")
except Exception as e:
    print_error(f"Error during GRPO training: {e}")
    print("This could be due to GPU memory issues or other constraints.")
    print("Try reducing num_generations, max_seq_length, or using load_in_4bit=True when loading the model.")

## 12. Test the Model

Let's test our trained model on a simple example.

In [None]:
print_header("Testing Model Performance")

# First test the model without GRPO training (baseline)
test_question = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature=1.0,
    top_k=50,
    max_tokens=1024,
)

print("Testing model without GRPO training (baseline):")
try:
    output = model.fast_generate(
        [test_question],
        sampling_params=sampling_params,
        lora_request=None,  # No LoRA
    )[0].outputs[0].text

    print(output)
except Exception as e:
    print_error(f"Error generating baseline response: {e}")

In [None]:
# Save the trained LoRA
print("Saving LoRA weights...")
try:
    model.save_lora("grpo_saved_lora")
    print_success("LoRA weights saved to 'grpo_saved_lora'")
except Exception as e:
    print_error(f"Error saving LoRA weights: {e}")

# Verify LoRA is trained properly
try:
    from safetensors import safe_open
    
    print("Verifying LoRA weights...")
    with safe_open("grpo_saved_lora/adapter_model.safetensors", framework="pt") as f:
        # Check if tensors contain non-zero values
        for key in list(f.keys())[:3]:  # Just check a few keys
            tensor = f.get_tensor(key)
            n_zeros = (tensor == 0).sum() / tensor.numel()
            print(f"Key: {key}, Non-zero ratio: {1 - n_zeros.item():.4f}")
            assert(n_zeros.item() != tensor.numel())
    print_success("LoRA weights verified successfully")
except Exception as e:
    print_warning(f"Error verifying LoRA weights: {e}")

In [None]:
# Now test with our trained GRPO model
test_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

test_text = tokenizer.apply_chat_template(
    test_messages,
    add_generation_prompt=True,  # Must add for generation
    tokenize=False,
)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature=1.0,
    top_k=50,
    max_tokens=2048,
)

print("\nTesting model with GRPO training:")
try:
    output = model.fast_generate(
        test_text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0].outputs[0].text

    print(output)
except Exception as e:
    print_error(f"Error generating GRPO response: {e}")

## 13. Save the Model

Save the trained model in different formats for deployment.

In [None]:
print_header("Model Export Options")

print("Uncomment the relevant code blocks below to save the model in your preferred format.")

# Option 1: Merge to 16-bit
print("\n1. Merge to 16-bit (full model, moderate size):")
print("# model.save_pretrained_merged(\"model\", tokenizer, save_method=\"merged_16bit\")")

# Option 2: Merge to 4-bit
print("\n2. Merge to 4-bit (full model, smallest size):")
print("# model.save_pretrained_merged(\"model\", tokenizer, save_method=\"merged_4bit\")")

# Option 3: Save LoRA adapters only
print("\n3. Save LoRA adapters only (requires base model to use):")
print("# model.save_pretrained(\"model\")")
print("# tokenizer.save_pretrained(\"model\")")

# Option 4: Save to GGUF format
print("\n4. Save to GGUF format (for llama.cpp):")
print("# model.save_pretrained_gguf(\"model\", tokenizer, quantization_method=\"q4_k_m\")")

print("\nTo save in your preferred format, uncomment the relevant code above and run this cell again.")

In [None]:
# Uncomment one of these blocks to save the model

# 1. Merge to 16-bit (full model, moderate size)
# model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

# 2. Merge to 4-bit (full model, smallest size)
# model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")

# 3. Save LoRA adapters only (requires base model to use)
# model.save_pretrained("model")
# tokenizer.save_pretrained("model")

# 4. Save to GGUF format (for llama.cpp)
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

## Conclusion

Congratulations! You've successfully trained a Qwen3-4B model for mathematical reasoning using GRPO! The model now follows a specific format for its reasoning process and provides more accurate solutions.

### What We've Accomplished

1. **Environment Setup**: Validated Python, GPU, and disk space requirements
2. **Dependency Installation**: Installed all necessary packages
3. **Model Configuration**: Loaded Qwen3-4B and set up LoRA for efficient training
4. **Custom Formatting**: Defined a reasoning format with working-out and solution sections
5. **Pre-Fine-Tuning**: Taught the model our custom format
6. **GRPO Training**: Used reward functions to improve mathematical reasoning
7. **Testing**: Verified the model's improved performance
8. **Model Export**: Explored options for saving and deploying the model

### Next Steps

1. **Increase Training Steps**: For better results, increase `max_steps` or set `num_train_epochs=1`
2. **Try Different Datasets**: Experiment with GSM8K or other reasoning datasets
3. **Adjust Hyperparameters**: Try different learning rates, batch sizes, or LoRA ranks
4. **Deploy the Model**: Save in your preferred format and deploy for inference

For more information, visit [Unsloth's documentation](https://docs.unsloth.ai/).