# Online Reinforcement Learning with GRPO Tutorial

This notebook demonstrates how to use Group Relative Policy Optimization (GRPO) for online reinforcement learning to improve model performance on mathematical reasoning tasks.

## What is Online RL with GRPO?

GRPO (Group Relative Policy Optimization) is an online RL method that:
- Generates multiple responses per prompt
- Uses a reward function to score responses
- Updates the policy based on relative performance within the group
- Requires no separate reward model training

## Key Components:
1. **Reward Function**: Evaluates response quality (e.g., math accuracy)
2. **Multiple Generations**: Creates several responses per prompt
3. **Relative Ranking**: Compares responses within each group
4. **Policy Updates**: Improves the model based on relative performance

## Use Case: Mathematical Reasoning
We'll train a model to better solve math problems using reward signals from correct/incorrect answers.

---
*Based on Lesson 7 from DeepLearning.AI's "Post-training LLMs" course*

## Setup and Imports

In [None]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

import sys
import os
import re
import pandas as pd
from tqdm import tqdm

# Add the src directory to the path
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

from utils.model_utils import load_model_and_tokenizer, generate_responses
from training.rl_trainer import RLTrainingPipeline
from evaluation.metrics import compute_math_accuracy
from datasets import load_dataset
import torch

## Configuration

In [None]:
# Configuration
USE_GPU = False  # Set to True if you have a GPU available
MAX_TRAIN_SAMPLES = 10  # Small number for demonstration
MAX_EVAL_SAMPLES = 5    # Small number for evaluation

# Model and dataset configuration
BASE_MODEL = "HuggingFaceTB/SmolLM2-135M-Instruct"  # Instruction-tuned model
MATH_DATASET = "openai/gsm8k"  # Grade school math dataset
DATASET_SUBSET = "main"

# GRPO training parameters
NUM_GENERATIONS = 4  # Number of responses per prompt
LEARNING_RATE = 5e-6  # Lower learning rate for RL

# System prompt for math problems
MATH_SYSTEM_PROMPT = (
    "You are a helpful assistant that solves problems step-by-step. "
    "Always include the final numeric answer inside \\boxed{}."
)

print(f"Configuration:")
print(f"- Base model: {BASE_MODEL}")
print(f"- Dataset: {MATH_DATASET}")
print(f"- Train samples: {MAX_TRAIN_SAMPLES}")
print(f"- Eval samples: {MAX_EVAL_SAMPLES}")
print(f"- Generations per prompt: {NUM_GENERATIONS}")
print(f"- Learning rate: {LEARNING_RATE}")

## Step 1: Understanding the Reward Function

Let's first understand how the reward function works for mathematical reasoning.

In [None]:
# Import the reward function
from training.rl_trainer import RLTrainingPipeline

# Test the reward function with examples
print("=== REWARD FUNCTION EXAMPLES ===")

# Positive example (correct answer)
correct_completion = [[{"role": "assistant", 
                      "content": "Let me solve this step by step. First, I calculate 5 × 8 = 40. Then I add 12: 40 + 12 = 52. Therefore, the answer is \\boxed{52}."}]]
ground_truth_correct = ["52"]

reward_correct = RLTrainingPipeline.math_reward_function(correct_completion, ground_truth_correct)
print(f"Correct answer reward: {reward_correct[0]}")
print(f"Response: {correct_completion[0][0]['content'][:100]}...")
print()

# Negative example (incorrect answer)
incorrect_completion = [[{"role": "assistant", 
                        "content": "I think the answer is about 50. Let me guess \\boxed{51}."}]]
ground_truth_incorrect = ["52"]

reward_incorrect = RLTrainingPipeline.math_reward_function(incorrect_completion, ground_truth_incorrect)
print(f"Incorrect answer reward: {reward_incorrect[0]}")
print(f"Response: {incorrect_completion[0][0]['content'][:100]}...")
print()

# Example without boxed format
no_box_completion = [[{"role": "assistant", 
                      "content": "The answer is 52 but I forgot to put it in the box format."}]]
ground_truth_no_box = ["52"]

reward_no_box = RLTrainingPipeline.math_reward_function(no_box_completion, ground_truth_no_box)
print(f"No box format reward: {reward_no_box[0]}")
print(f"Response: {no_box_completion[0][0]['content']}")

## Step 2: Load and Explore the Math Dataset

In [None]:
# Load the GSM8K dataset
print(f"Loading dataset: {MATH_DATASET}")
dataset = load_dataset(MATH_DATASET, DATASET_SUBSET)

train_dataset = dataset["train"]
test_dataset = dataset["test"]

print(f"Full train dataset size: {len(train_dataset)}")
print(f"Full test dataset size: {len(test_dataset)}")
print(f"Dataset columns: {train_dataset.column_names}")

# Select subsets for demonstration
if MAX_TRAIN_SAMPLES:
    train_dataset = train_dataset.select(range(min(MAX_TRAIN_SAMPLES, len(train_dataset))))
if MAX_EVAL_SAMPLES:
    eval_dataset = test_dataset.select(range(min(MAX_EVAL_SAMPLES, len(test_dataset))))

print(f"\nUsing {len(train_dataset)} training samples")
print(f"Using {len(eval_dataset)} evaluation samples")

In [None]:
# Display sample math problems
print("=== SAMPLE MATH PROBLEMS ===")
for i in range(2):
    example = train_dataset[i]
    print(f"\nProblem {i+1}:")
    print(f"Question: {example['question']}")
    print(f"Answer: {example['answer']}")
    print("-" * 50)

## Step 3: Load and Test Base Model

Let's load the base model and evaluate its performance on math problems.

In [None]:
# Initialize RL pipeline
print("Initializing RL training pipeline...")
rl_pipeline = RLTrainingPipeline(BASE_MODEL, use_gpu=USE_GPU)
rl_pipeline.load_model()

print(f"\nModel loaded: {BASE_MODEL}")
print(f"Model device: {next(rl_pipeline.model.parameters()).device}")
print(f"Number of parameters: {sum(p.numel() for p in rl_pipeline.model.parameters()):,}")

In [None]:
# Prepare datasets for training
print("Preparing datasets...")
train_dataset_processed = rl_pipeline.prepare_math_dataset(train_dataset)
eval_dataset_processed = rl_pipeline.prepare_math_dataset(eval_dataset)

print("Datasets prepared successfully!")

# Show processed format
sample = train_dataset_processed[0]
print(f"\nSample processed data:")
print(f"Prompt: {sample['prompt'][-1]['content'][:100]}...")
print(f"Ground truth: {sample['ground_truth']}")

## Step 4: Evaluate Base Model Performance

In [None]:
# Evaluate base model before training
print("Evaluating base model performance...")
base_accuracy = rl_pipeline.evaluate_model(
    eval_dataset_processed,
    rl_pipeline.math_reward_function,
    title="Base Model Performance (Before RL Training)"
)

print(f"\nBase model accuracy: {base_accuracy:.1%}")

## Step 5: Set Up and Run GRPO Training

Now we'll configure and run the GRPO training to improve mathematical reasoning.

In [None]:
# Setup GRPO training
print("Setting up GRPO training...")
rl_pipeline.setup_training(
    train_dataset_processed,
    rl_pipeline.math_reward_function,
    learning_rate=LEARNING_RATE,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_generations=NUM_GENERATIONS,
    logging_steps=2
)

print("GRPO training configuration set up successfully!")
print(f"- Will generate {NUM_GENERATIONS} responses per math problem")
print(f"- Learning rate: {LEARNING_RATE}")
print(f"- Training on {len(train_dataset_processed)} problems")

In [None]:
# Run GRPO training
print("Starting GRPO training...")
print("This process will:")
print("1. Generate multiple responses per math problem")
print("2. Score each response using the reward function")
print("3. Update the model to prefer better responses")
print("-" * 60)

rl_pipeline.train()

print("-" * 60)
print("GRPO training completed!")

## Step 6: Evaluate Trained Model Performance

In [None]:
# Evaluate the trained model
print("Evaluating trained model performance...")
trained_accuracy = rl_pipeline.evaluate_model(
    eval_dataset_processed,
    rl_pipeline.math_reward_function,
    title="RL-Trained Model Performance (After GRPO)"
)

print(f"\nTrained model accuracy: {trained_accuracy:.1%}")

## Step 7: Performance Analysis and Comparison

In [None]:
# Performance summary
improvement = trained_accuracy - base_accuracy

print("=" * 60)
print("TRAINING PERFORMANCE SUMMARY")
print("=" * 60)
print(f"Base model accuracy:    {base_accuracy:.1%}")
print(f"Trained model accuracy: {trained_accuracy:.1%}")
print(f"Absolute improvement:   {improvement:+.1%}")
print(f"Relative improvement:   {improvement/max(base_accuracy, 0.001)*100:+.1f}%")
print("=" * 60)

if improvement > 0:
    print("✅ GRPO training successfully improved model performance!")
elif improvement == 0:
    print("➖ No change in performance. Consider more training or different hyperparameters.")
else:
    print("⚠️  Performance decreased. This can happen with very small datasets or high learning rates.")

## Step 8: Detailed Response Comparison

In [None]:
# Compare responses before and after training
print("=== DETAILED RESPONSE COMPARISON ===")

# Load original model for comparison
original_model, original_tokenizer = load_model_and_tokenizer(BASE_MODEL, USE_GPU)

# Test on a few problems
comparison_problems = eval_dataset_processed.select(range(2))

for i, problem in enumerate(comparison_problems):
    print(f"\n--- Problem {i+1} ---")
    print(f"Question: {problem['prompt'][-1]['content']}")
    print(f"Ground Truth: {problem['ground_truth']}")
    
    # Original model response
    original_response = generate_responses(
        original_model, original_tokenizer, full_message=problem['prompt']
    )
    
    # Trained model response
    trained_response = generate_responses(
        rl_pipeline.trainer.model, rl_pipeline.tokenizer, full_message=problem['prompt']
    )
    
    print(f"\nOriginal Model: {original_response}")
    print(f"\nTrained Model:  {trained_response}")
    
    # Check if answers are correct
    original_correct = rl_pipeline.math_reward_function([[{"role": "assistant", "content": original_response}]], [problem['ground_truth']])[0]
    trained_correct = rl_pipeline.math_reward_function([[{"role": "assistant", "content": trained_response}]], [problem['ground_truth']])[0]
    
    print(f"\nOriginal Correct: {'✅' if original_correct else '❌'}")
    print(f"Trained Correct:  {'✅' if trained_correct else '❌'}")
    print("-" * 80)

# Clean up
del original_model, original_tokenizer

## Step 9: Save the RL-Trained Model

In [None]:
# Save the trained model
output_dir = "../models/rl_trained_model"
rl_pipeline.save_model(output_dir)

print(f"RL-trained model saved to: {output_dir}")
print("You can now load this model for inference or further training.")

## Summary and Key Takeaways

### What we accomplished:

1. **Implemented a reward function** for mathematical reasoning
2. **Loaded and processed** the GSM8K math dataset
3. **Evaluated base model performance** on math problems
4. **Trained using GRPO** with multiple generations per prompt
5. **Measured improvement** in mathematical reasoning ability
6. **Compared responses** before and after training

### Key insights about Online RL with GRPO:

- **Reward-driven learning**: The model learns from immediate feedback on response quality
- **Multiple generations**: Generating several responses per prompt provides richer training signal
- **Relative optimization**: GRPO compares responses within each batch for stable training
- **Task-specific improvement**: Performance improves specifically on the rewarded task (math reasoning)

### GRPO advantages:

- **No reward model**: Uses direct reward functions instead of learned reward models
- **Online learning**: Trains on the model's own generated data
- **Stable training**: Group relative optimization is more stable than absolute optimization
- **Efficient**: Can improve performance with relatively small datasets

### Considerations:

- **Reward function quality**: Performance is limited by the quality of the reward function
- **Computational cost**: Multiple generations increase computational requirements
- **Hyperparameter sensitivity**: Learning rate and generation count need careful tuning
- **Task specificity**: Improvements may not transfer to other tasks

### Next steps:

- Try with larger models and datasets
- Experiment with different reward functions
- Combine RL with SFT and DPO for comprehensive post-training
- Apply to other reasoning tasks (coding, logical reasoning, etc.)

---
*This tutorial is based on the DeepLearning.AI "Post-training LLMs" course, Lesson 7.*