# GRPO Training Test: Qwen3-4B-Thinking-2507

Tests Group Relative Policy Optimization (GRPO) reinforcement learning with Unsloth on Qwen3-4B-Thinking-2507.

**Key features tested:**
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- GRPOTrainer with thinking-aware reward function
- Rewards self-questioning reasoning in `<think>` blocks
- Post-training inference verification

**GRPO Overview:**
GRPO is a reinforcement learning method that optimizes language models using relative policy gradients. It compares multiple completions per prompt and learns from their relative rewards.

**Thinking Reward:**
The reward function evaluates:
- Presence of `<think>...</think>` tags
- Quality and length of reasoning
- Bonus for self-questioning (question marks in thinking)

**Important:** This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [None]:
# Environment Setup
import os

# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'

from dotenv import load_dotenv
load_dotenv()

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

In [None]:
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
    dtype=None,
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")

In [None]:
# Apply LoRA adapters for GRPO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

In [None]:
# Create minimal synthetic prompt dataset for GRPO (5 prompts)
# GRPO requires prompts only - completions are generated during training

prompts = [
    "Explain the concept of recursion in programming.",
    "What are the benefits of using version control?",
    "Describe how a hash table works.",
    "What is the difference between a stack and a queue?",
    "Explain what an API is to a beginner.",
]

# Format prompts for GRPO (requires "prompt" field)
dataset = Dataset.from_dict({
    "prompt": [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": p}],
            tokenize=False,
            add_generation_prompt=True
        ) for p in prompts
    ]
})

print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")

In [None]:
# Define thinking-aware reward function
# Rewards self-questioning reasoning in <think> blocks
import re

def thinking_reward_fn(completions, prompts=None, **kwargs):
    """
    Reward function that evaluates thinking quality.
    - Rewards presence of <think>...</think> tags
    - Rewards longer, more detailed reasoning
    - Bonus for self-questioning (question marks in thinking)
    """
    rewards = []
    for completion in completions:
        # Check for thinking tags
        has_thinking = "<think>" in completion or "</think>" in completion
        
        if has_thinking:
            # Extract thinking content
            think_match = re.search(r'<think>(.*?)</think>', completion, re.DOTALL)
            if not think_match and '</think>' in completion:
                # Handle case where <think> is implicit (from template)
                thinking_content = completion.split('</think>')[0]
            else:
                thinking_content = think_match.group(1) if think_match else ""
            
            # Reward based on thinking quality
            thinking_words = len(thinking_content.split())
            
            # Bonus for self-questioning indicators
            question_marks = thinking_content.count('?')
            has_self_questions = question_marks >= 1
            
            if thinking_words < 10:
                reward = 0.3  # Minimal thinking
            elif thinking_words < 30:
                reward = 0.7 + (0.1 if has_self_questions else 0)
            else:
                reward = 1.0 + (0.1 if has_self_questions else 0)
        else:
            reward = -1.0  # No thinking tags
        
        rewards.append(reward)
    
    return rewards

print("Thinking-aware reward function defined")
print("Rewards: thinking quality + self-questioning bonus")

In [None]:
# GRPO Training Configuration (minimal steps for testing)
grpo_config = GRPOConfig(
    output_dir="outputs_grpo_qwen_think_test",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    max_steps=2,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=1e-5,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_completion_length=128,  # Increased for thinking content
    num_generations=2,
    beta=0.1,
    seed=42,
)

# Initialize GRPO Trainer
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_funcs=thinking_reward_fn,
)

print("Starting GRPO training with thinking rewards (2 steps)...")
trainer_stats = trainer.train()
print(f"GRPO training completed!")

In [None]:
# Post-training inference test
FastLanguageModel.for_inference(model)

test_prompt = "Explain what machine learning is in simple terms."
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Parse thinking vs response
if '</think>' in response:
    parts = response.split('</think>', 1)
    thinking = parts[0].split('<think>')[-1] if '<think>' in parts[0] else parts[0]
    final_resp = parts[1].strip() if len(parts) > 1 else ""
else:
    thinking, final_resp = "", response

print("=" * 60)
print("GRPO Training Pipeline Test (Thinking Mode) PASSED")
print("=" * 60)
print(f"\nTHINKING: {thinking[:200]}..." if len(thinking) > 200 else f"\nTHINKING: {thinking}")
print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}")

## Test Complete

The GRPO Training Pipeline test with thinking rewards has completed successfully. The kernel will now shut down to release all GPU memory.

### What Was Verified
- FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration for RL training
- Synthetic prompt dataset creation
- Thinking-aware reward function (rewards reasoning quality + self-questioning)
- GRPOTrainer training loop (2 steps)
- Post-training inference with thinking output

### GRPO Concepts with Thinking
- **Thinking Reward**: Evaluates `<think>` content quality
- **Self-Questioning Bonus**: Extra reward for question marks in reasoning
- **KL Penalty (beta)**: Prevents policy from diverging too far from reference

### Ready for Production
If this test passed, your environment is ready for:
- GRPO training with thinking-focused reward models
- RLHF pipelines that optimize reasoning quality
- Chain-of-thought preference optimization

In [None]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)