# GRPO Training for Qwen2.5-Math-1.5B on Google Colab

This notebook demonstrates how to train a math reasoning model using **Group Relative Policy Optimization (GRPO)** on the MATH dataset.

## Requirements
- Google Colab with GPU (T4 or better, A100 recommended)
- ~16GB GPU memory for training

## What is GRPO?
GRPO (from DeepSeekMath and DeepSeek R1) is a policy gradient method that:
1. Generates multiple responses per question
2. Computes rewards based on answer correctness
3. Normalizes rewards within each group to get advantages
4. Trains using policy gradient methods

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Clone the repository
!git clone https://github.com/bearbearyu1223/qwen_math_grpo.git
%cd qwen_math_grpo

In [None]:
# Install uv package manager
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Add uv to PATH (source doesn't work with ! in Colab)
import os
os.environ["PATH"] = f"{os.environ['HOME']}/.local/bin:{os.environ['PATH']}"

# Install base dependencies
!uv sync

# Install vLLM separately (needs system CUDA compatibility)
!uv pip install vllm>=0.8.4

In [None]:
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Download Dataset and Model

In [None]:
# Download the MATH dataset
!uv run python scripts/download_dataset.py

In [None]:
# Verify dataset
!wc -l data/math/train.jsonl data/math/test.jsonl

In [None]:
# Preview a sample from the dataset
import json

with open('data/math/train.jsonl') as f:
    sample = json.loads(f.readline())
    
print("Problem:")
print(sample['problem'][:500])
print("\nAnswer:", sample['answer'])

## 3. Run GRPO Training

### Training Configuration

For Colab with a single GPU, we'll use single-GPU mode. Adjust parameters based on your GPU memory:

| GPU | Recommended Settings |
|-----|---------------------|
| T4 (16GB) | `--rollout-batch-size 8 --train-batch-size 8` |
| A100 (40GB) | `--rollout-batch-size 32 --train-batch-size 32` |
| A100 (80GB) | `--rollout-batch-size 64 --train-batch-size 64` |

In [None]:
# Quick test run (5 steps) to verify everything works
!uv run python scripts/run_grpo.py \
    --model-name-or-path Qwen/Qwen2.5-Math-1.5B \
    --single-gpu \
    --policy-device cuda:0 \
    --rollout-batch-size 8 \
    --train-batch-size 8 \
    --gradient-accumulation-steps 8 \
    --n-grpo-steps 5 \
    --output-dir outputs/grpo_test

In [None]:
# Full training run (adjust n-grpo-steps based on your time budget)
!uv run python scripts/run_grpo.py \
    --model-name-or-path Qwen/Qwen2.5-Math-1.5B \
    --single-gpu \
    --policy-device cuda:0 \
    --rollout-batch-size 8 \
    --train-batch-size 8 \
    --gradient-accumulation-steps 8 \
    --n-grpo-steps 100 \
    --eval-steps 20 \
    --save-steps 50 \
    --output-dir outputs/grpo_model

## 4. Evaluate the Trained Model

In [None]:
# Check saved model
!ls -la outputs/grpo_model/

In [None]:
# Evaluate the GRPO-trained model
!uv run python scripts/run_math_eval.py \
    --model-name-or-path outputs/grpo_model/final \
    --input-path data/math/test.jsonl \
    --output-path outputs/grpo_eval_results.jsonl \
    --backend transformers \
    --num-samples 100

In [None]:
# Evaluate the base model for comparison
!uv run python scripts/run_math_eval.py \
    --model-name-or-path Qwen/Qwen2.5-Math-1.5B \
    --input-path data/math/test.jsonl \
    --output-path outputs/base_eval_results.jsonl \
    --backend transformers \
    --num-samples 100

## 5. Compare Results

In [None]:
import json
from statistics import mean

def load_results(path):
    results = []
    with open(path) as f:
        for line in f:
            results.append(json.loads(line))
    return results

def compute_metrics(results):
    format_rewards = [r['metrics']['format_reward'] for r in results]
    answer_rewards = [r['metrics']['answer_reward'] for r in results]
    return {
        'format_accuracy': mean(format_rewards),
        'answer_accuracy': mean(answer_rewards),
        'n_samples': len(results)
    }

# Load and compare results
try:
    grpo_results = load_results('outputs/grpo_eval_results.jsonl')
    base_results = load_results('outputs/base_eval_results.jsonl')
    
    grpo_metrics = compute_metrics(grpo_results)
    base_metrics = compute_metrics(base_results)
    
    print("=" * 50)
    print("EVALUATION COMPARISON")
    print("=" * 50)
    print(f"\n{'Model':<25} {'Format Acc':<15} {'Answer Acc':<15}")
    print("-" * 55)
    print(f"{'Base (Qwen2.5-Math-1.5B)':<25} {base_metrics['format_accuracy']:.2%:<15} {base_metrics['answer_accuracy']:.2%:<15}")
    print(f"{'GRPO-Trained':<25} {grpo_metrics['format_accuracy']:.2%:<15} {grpo_metrics['answer_accuracy']:.2%:<15}")
    print("-" * 55)
    
    improvement = grpo_metrics['answer_accuracy'] - base_metrics['answer_accuracy']
    print(f"\nImprovement: {improvement:+.2%}")
except FileNotFoundError as e:
    print(f"Results file not found: {e}")
    print("Make sure to run the evaluation cells above first.")

## 6. View Analysis Reports

In [None]:
# View GRPO model analysis report
!cat outputs/grpo_eval_results_analysis.txt | head -100

In [None]:
# View base model analysis report
!cat outputs/base_eval_results_analysis.txt | head -100

## 7. Save Model to Google Drive (Optional)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Copy trained model to Google Drive
!cp -r outputs/grpo_model /content/drive/MyDrive/grpo_model_backup

## 8. Interactive Testing

In [None]:
# Load the trained model for interactive testing
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "outputs/grpo_model/final"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

print("Model loaded successfully!")

In [None]:
# Test with a math problem
def solve_math_problem(question):
    prompt = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: {question}
Assistant: <think>"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

# Example problem
question = "What is the sum of all positive integers n such that n^2 + n + 1 divides n^4 + 2n^3 + 3n^2 + 2n + 1?"
print(f"Question: {question}\n")
print("Model's Response:")
print(solve_math_problem(question))

In [None]:
# Try your own math problem
your_question = "If x + y = 10 and xy = 21, what is x^2 + y^2?"
print(f"Question: {your_question}\n")
print("Model's Response:")
print(solve_math_problem(your_question))

## Notes

### Training Tips
- Start with a small number of steps (5-10) to verify everything works
- Monitor GPU memory usage and adjust batch sizes accordingly
- Use Weights & Biases for experiment tracking: add `--wandb-project your-project-name`

### Expected Results
- Base Qwen2.5-Math-1.5B: ~50-60% format accuracy, varies on answer accuracy
- After GRPO training: Should see improvement in both format and answer accuracy

### Troubleshooting
- **OOM Error**: Reduce `--rollout-batch-size` and `--train-batch-size`
- **Slow Training**: This is expected on T4; consider using A100 for faster training
- **Low Accuracy**: Try more training steps or adjust learning rate