# GRPO Training for Qwen2.5-Math-1.5B on Google Colab

This notebook demonstrates how to train a math reasoning model using **Group Relative Policy Optimization (GRPO)** on the MATH dataset.

## Requirements
- Google Colab with GPU (T4 or better, A100 recommended)
- ~16GB GPU memory for training

## What is GRPO?
GRPO (from DeepSeekMath and DeepSeek R1) is a policy gradient method that:
1. Generates multiple responses per question
2. Computes rewards based on answer correctness
3. Normalizes rewards within each group to get advantages
4. Trains using policy gradient methods

## 1. Setup Environment

In [13]:
# Check GPU availability
!nvidia-smi

Sun Feb  8 03:52:55 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   34C    P0             54W /  400W |       5MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [14]:
# Clone the repository
!git clone https://github.com/bearbearyu1223/qwen_math_grpo.git
%cd qwen_math_grpo

Cloning into 'qwen_math_grpo'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 36 (delta 9), reused 32 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 204.80 KiB | 762.00 KiB/s, done.
Resolving deltas: 100% (9/9), done.
/content/qwen_math_grpo/qwen_math_grpo/qwen_math_grpo


In [None]:
# Install uv package manager
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Add uv to PATH (source doesn't work with ! in Colab)
import os
os.environ["PATH"] = f"{os.environ['HOME']}/.local/bin:{os.environ['PATH']}"

# Install all dependencies including vLLM
!uv sync --extra vllm

In [16]:
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

PyTorch version: 2.9.1+cu128
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 85.2 GB


## 2. Download Dataset and Model

In [17]:
# Download the MATH dataset
!uv run python scripts/download_dataset.py

Downloading dataset: nlile/hendrycks-MATH-benchmark
Output directory: /content/qwen_math_grpo/qwen_math_grpo/qwen_math_grpo/data/math
Splits: ['train', 'test']

Saving train split (12000 examples) to data/math/train.jsonl
  Saved 12000 examples
Saving test split (500 examples) to data/math/test.jsonl
  Saved 500 examples

Download complete!


In [18]:
# Verify dataset
!wc -l data/math/train.jsonl data/math/test.jsonl

   12000 data/math/train.jsonl
     500 data/math/test.jsonl
   12500 total


In [19]:
# Preview a sample from the dataset
import json

with open('data/math/train.jsonl') as f:
    sample = json.loads(f.readline())

print("Problem:")
print(sample['problem'][:500])
print("\nAnswer:", sample['answer'])

Problem:
How many vertical asymptotes does the graph of $y=\frac{2}{x^2+x-6}$ have?

Answer: 2


## 3. Run GRPO Training

### Training Configuration

For Colab with a single GPU, we'll use single-GPU mode. Adjust parameters based on your GPU memory:

| GPU | Recommended Settings |
|-----|---------------------|
| T4 (16GB) | `--rollout-batch-size 8 --train-batch-size 8` |
| A100 (40GB) | `--rollout-batch-size 32 --train-batch-size 32` |
| A100 (80GB) | `--rollout-batch-size 64 --train-batch-size 64` |

In [None]:
# Run a quick test (10 steps) to verify training and collect metrics
!uv run python scripts/run_grpo.py \
    --model-name-or-path Qwen/Qwen2.5-Math-1.5B \
    --single-gpu \
    --policy-device cuda:0 \
    --rollout-batch-size 32 \
    --train-batch-size 32 \
    --gradient-accumulation-steps 8 \
    --n-grpo-steps 10 \
    --eval-steps 5 \
    --output-dir outputs/grpo_model

## 4. Evaluate the Trained Model

**Note:** If you encounter GPU memory issues when running vLLM evaluation, restart the runtime (Runtime â†’ Restart runtime) to free GPU memory, then run the cells below to restore the environment.

In [None]:
# Run this cell after restarting runtime to restore environment
import os
os.environ["PATH"] = f"{os.environ['HOME']}/.local/bin:{os.environ['PATH']}"
%cd /content/qwen_math_grpo

In [None]:
# Check saved model and training history
!ls -la outputs/grpo_model/
print("\nTraining history saved:")
!ls -la outputs/grpo_model/training_history.json 2>/dev/null || echo "Training history will be saved after training"

In [None]:
# Evaluate the GRPO-trained model (100 samples for quick evaluation)
!uv run python scripts/run_math_eval.py \
    --model-name-or-path outputs/grpo_model/final \
    --input-path data/math/test.jsonl \
    --output-path outputs/grpo_eval_results.jsonl \
    --backend vllm \
    --num-samples 100

In [32]:
# Evaluate the base model for comparison
!uv run python scripts/run_math_eval.py \
    --model-name-or-path Qwen/Qwen2.5-Math-1.5B \
    --input-path data/math/test.jsonl \
    --output-path outputs/base_eval_results.jsonl \
    --backend vllm \
    --num-samples 100

2026-02-08 04:41:55,196 - __main__ - INFO - Evaluating model: Qwen/Qwen2.5-Math-1.5B
2026-02-08 04:41:55,196 - __main__ - INFO - Backend: vllm
2026-02-08 04:41:55,196 - __main__ - INFO - Input: data/math/test.jsonl
2026-02-08 04:41:55,196 - __main__ - INFO - Output: outputs/base_eval_results.jsonl
2026-02-08 04:41:55,200 - cs336_alignment.evaluate_math - INFO - Read 500 examples from data/math/test.jsonl
2026-02-08 04:41:55,200 - cs336_alignment.evaluate_math - INFO - Limiting evaluation to 100 samples
2026-02-08 04:41:59,222 - cs336_alignment.evaluate_math - INFO - Loading model Qwen/Qwen2.5-Math-1.5B with vLLM backend...
[32mINFO[0m [90m02-08 04:41:59[0m [90m[utils.py:261][0m non-default args: {'trust_remote_code': True, 'disable_log_stats': True, 'model': 'Qwen/Qwen2.5-Math-1.5B'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is 

## 5. Compare Results

In [34]:
import json
from statistics import mean

def load_results(path):
    results = []
    with open(path) as f:
        for line in f:
            results.append(json.loads(line))
    return results

def compute_metrics(results):
    format_rewards = [r['metrics']['format_reward'] for r in results]
    answer_rewards = [r['metrics']['answer_reward'] for r in results]
    return {
        'format_accuracy': mean(format_rewards),
        'answer_accuracy': mean(answer_rewards),
        'n_samples': len(results)
    }

# Load and compare results
try:
    grpo_results = load_results('outputs/grpo_eval_results.jsonl')
    base_results = load_results('outputs/base_eval_results.jsonl')

    grpo_metrics = compute_metrics(grpo_results)
    base_metrics = compute_metrics(base_results)

    print("=" * 50)
    print("EVALUATION COMPARISON")
    print("=" * 50)
    print(f"\n{'Model':<25} {'Format Acc':<15} {'Answer Acc':<15}")
    print("-" * 55)
    print(f"{'Base (Qwen2.5-Math-1.5B)':<25} {base_metrics['format_accuracy']:<15.2%} {base_metrics['answer_accuracy']:<15.2%}")
    print(f"{'GRPO-Trained':<25} {grpo_metrics['format_accuracy']:<15.2%} {grpo_metrics['answer_accuracy']:<15.2%}")
    print("-" * 55)

    improvement = grpo_metrics['answer_accuracy'] - base_metrics['answer_accuracy']
    print(f"\nImprovement: {improvement:+.2%}")
except FileNotFoundError as e:
    print(f"Results file not found: {e}")
    print("Make sure to run the evaluation cells above first.")

EVALUATION COMPARISON

Model                     Format Acc      Answer Acc     
-------------------------------------------------------
Base (Qwen2.5-Math-1.5B)  38.00%          19.00%         
GRPO-Trained              59.00%          28.00%         
-------------------------------------------------------

Improvement: +9.00%


## 6. Plot Training Metrics

Visualize the training progress with reward and loss curves.

In [None]:
# Install matplotlib and plot training metrics
!pip install matplotlib -q

import json
import matplotlib.pyplot as plt

# Load training history
history_path = 'outputs/grpo_model/training_history.json'
try:
    with open(history_path) as f:
        history = json.load(f)
    
    # Extract metrics
    steps = [h["grpo_step"] for h in history]
    reward_mean = [h["reward_mean"] for h in history]
    answer_reward = [h["answer_reward_mean"] for h in history]
    loss = [h.get("loss", 0) for h in history]
    val_reward = [h.get("val_reward") for h in history]
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("GRPO Training Metrics", fontsize=14)
    
    # Plot 1: Reward Mean
    ax1 = axes[0, 0]
    ax1.plot(steps, reward_mean, "b-", linewidth=1.5, label="Reward Mean")
    ax1.set_xlabel("GRPO Step")
    ax1.set_ylabel("Reward")
    ax1.set_title("Average Reward per Step")
    ax1.grid(True, alpha=0.3)
    ax1.legend()
    
    # Plot 2: Answer Reward
    ax2 = axes[0, 1]
    ax2.plot(steps, answer_reward, "g-", linewidth=1.5, label="Answer Reward")
    # Add validation reward if available
    val_steps = [s for s, v in zip(steps, val_reward) if v is not None]
    val_values = [v for v in val_reward if v is not None]
    if val_values:
        ax2.plot(val_steps, val_values, "r--", linewidth=2, marker="o", markersize=6, label="Val Reward")
    ax2.set_xlabel("GRPO Step")
    ax2.set_ylabel("Answer Reward")
    ax2.set_title("Answer Reward (Train vs Val)")
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    
    # Plot 3: Loss
    ax3 = axes[1, 0]
    ax3.plot(steps, loss, "r-", linewidth=1.5, label="Policy Loss")
    ax3.set_xlabel("GRPO Step")
    ax3.set_ylabel("Loss")
    ax3.set_title("Policy Gradient Loss")
    ax3.grid(True, alpha=0.3)
    ax3.legend()
    
    # Plot 4: Reward Statistics
    ax4 = axes[1, 1]
    reward_max = [h["reward_max"] for h in history]
    reward_min = [h["reward_min"] for h in history]
    ax4.fill_between(steps, reward_min, reward_max, alpha=0.3, color="blue", label="Min-Max Range")
    ax4.plot(steps, reward_mean, "b-", linewidth=1.5, label="Mean")
    ax4.set_xlabel("GRPO Step")
    ax4.set_ylabel("Reward")
    ax4.set_title("Reward Range (Min/Max/Mean)")
    ax4.grid(True, alpha=0.3)
    ax4.legend()
    
    plt.tight_layout()
    plt.savefig("outputs/grpo_model/training_plot.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("\nPlot saved to outputs/grpo_model/training_plot.png")
    
except FileNotFoundError:
    print(f"Training history not found at {history_path}")
    print("Make sure training has completed successfully.")

In [None]:
# Print training summary
try:
    with open('outputs/grpo_model/training_history.json') as f:
        history = json.load(f)
    
    if history:
        first = history[0]
        last = history[-1]
        
        print("=" * 60)
        print("TRAINING SUMMARY")
        print("=" * 60)
        print(f"\nTotal GRPO steps: {len(history)}")
        
        print(f"\nInitial metrics (step 0):")
        print(f"  Reward Mean: {first['reward_mean']:.4f}")
        print(f"  Answer Reward: {first['answer_reward_mean']:.4f}")
        print(f"  Loss: {first.get('loss', 'N/A')}")
        
        print(f"\nFinal metrics (step {last['grpo_step']}):")
        print(f"  Reward Mean: {last['reward_mean']:.4f}")
        print(f"  Answer Reward: {last['answer_reward_mean']:.4f}")
        print(f"  Loss: {last.get('loss', 'N/A')}")
        
        # Improvement
        if first["answer_reward_mean"] != 0:
            improvement = (last["answer_reward_mean"] - first["answer_reward_mean"]) / first["answer_reward_mean"] * 100
            print(f"\nAnswer Reward Improvement: {improvement:+.1f}%")
        else:
            improvement = last["answer_reward_mean"] - first["answer_reward_mean"]
            print(f"\nAnswer Reward Change: {improvement:+.4f}")
        
        # Best validation reward
        val_rewards = [(h["grpo_step"], h["val_reward"]) for h in history if h.get("val_reward") is not None]
        if val_rewards:
            best_step, best_val = max(val_rewards, key=lambda x: x[1])
            print(f"\nBest Validation Reward: {best_val:.4f} (step {best_step})")
        
        print("=" * 60)
except FileNotFoundError:
    print("Training history not found. Run training first.")

## 7. View Evaluation Reports (Optional)

View detailed analysis of model predictions.

In [None]:
# View GRPO model analysis report
!cat outputs/grpo_eval_results_analysis.txt 2>/dev/null | head -60 || echo "Analysis report not found. Run evaluation first."

In [None]:
# View base model analysis report
!cat outputs/base_eval_results_analysis.txt 2>/dev/null | head -60 || echo "Analysis report not found. Run base model evaluation first."

## 8. Interactive Testing

In [None]:
# Load the trained model for interactive testing
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "outputs/grpo_model/final"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

print("Model loaded successfully!")

In [None]:
# Test with a math problem
def solve_math_problem(question):
    prompt = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: {question}
Assistant: <think>"""

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

# Example problem
question = "What is the sum of all positive integers n such that n^2 + n + 1 divides n^4 + 2n^3 + 3n^2 + 2n + 1?"
print(f"Question: {question}\n")
print("Model's Response:")
print(solve_math_problem(question))

In [None]:
# Try your own math problem
your_question = "If x + y = 10 and xy = 21, what is x^2 + y^2?"
print(f"Question: {your_question}\n")
print("Model's Response:")
print(solve_math_problem(your_question))

## Notes

### Training Tips
- Start with a small number of steps (5-10) to verify everything works
- Monitor GPU memory usage and adjust batch sizes accordingly
- Use Weights & Biases for experiment tracking: add `--wandb-project your-project-name`

### Expected Results
- Base Qwen2.5-Math-1.5B: ~50-60% format accuracy, varies on answer accuracy
- After GRPO training: Should see improvement in both format and answer accuracy

### Troubleshooting
- **OOM Error**: Reduce `--rollout-batch-size` and `--train-batch-size`
- **Slow Training**: This is expected on T4; consider using A100 for faster training
- **Low Accuracy**: Try more training steps or adjust learning rate