# Qwen 1.5B GRPO Training on Colab

This notebook runs GRPO (Group Relative Policy Optimization) on **Qwen2.5-Coder-1.5B-Instruct** to fix the Trapping Rain Water problem.

## Why Colab?

GRPO requires both a **policy model** and a **reference model**:
- Local RTX 3080 (10GB): Cannot fit 1.5B + 1.5B + optimizer
- Colab T4 (16GB): Should work with some optimization
- Colab V100/A100: Comfortable fit

## Background

From our experiments:
- Qwen 1.5B solves 9/10 hard problems natively (90%)
- Only failure: **Trapping Rain Water** (4/5 = 80%)
- Goal: Use GRPO to fix this single failure

---

## Step 1: Check Runtime

**IMPORTANT**: Make sure you're using a GPU runtime!
- Go to Runtime -> Change runtime type
- Select **GPU** (T4 is fine, V100/A100 is better)
- Click Save

In [None]:
# Check GPU availability and memory
import torch

print("=" * 60)
print("GPU CHECK")
print("=" * 60)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Memory: {gpu_memory:.1f} GB")
    
    if gpu_memory < 15:
        print("\nWARNING: Less than 15GB VRAM. Training may be tight.")
        print("Consider using T4 (16GB) or better.")
    else:
        print("\nGood! Sufficient memory for GRPO on 1.5B model.")
else:
    print("ERROR: No GPU found!")
    print("Go to Runtime -> Change runtime type -> GPU")
    raise RuntimeError("GPU required for this notebook")

## Step 2: Install Dependencies

Install the required packages. This takes 2-3 minutes.

In [None]:
%%time
# Install dependencies
!pip install -q torch transformers accelerate peft bitsandbytes datasets trl
print("\nDependencies installed!")

## Step 3: Configuration

Set up the experiment parameters.

In [None]:
# =============================================================
# CONFIGURATION
# =============================================================

CONFIG = {
    # Model
    "model_name": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
    
    # Problem to fix
    "target_problem": "trapping_rain_water",
    
    # GRPO settings
    "num_steps": 10,           # Training steps
    "num_generations": 4,      # Generations per prompt (G)
    "learning_rate": 5e-5,
    "beta": 0.04,              # KL penalty
    "max_seq_length": 768,
    
    # Problem settings
    "difficulty": 5,           # 1-10 scale
    "num_test_cases": 5,
    
    # Seed
    "seed": 42,
}

print("=" * 60)
print("EXPERIMENT CONFIGURATION")
print("=" * 60)
for key, value in CONFIG.items():
    print(f"  {key:20} = {value}")

## Step 4: Define Problem Generator

Create the Trapping Rain Water problem generator.

In [None]:
import random
from dataclasses import dataclass
from typing import List, Any

@dataclass
class TestCase:
    """A single test case."""
    input_args: List[Any]
    expected_output: Any

@dataclass 
class AlgorithmicProblem:
    """A problem with test cases."""
    problem_type: str
    problem_id: str
    title: str
    description: str
    function_signature: str
    function_name: str
    test_cases: List[TestCase]
    difficulty: int = 5
    
    def to_prompt(self) -> str:
        """Convert to prompt for model."""
        examples = "\n".join(
            f"  {self.function_name}({repr(tc.input_args[0])}) -> {repr(tc.expected_output)}"
            for tc in self.test_cases[:3]
        )
        return f"""## {self.title}

{self.description}

### Function Signature
```python
{self.function_signature}
```

### Examples
```python
{examples}
```

Implement the function. Your solution must pass ALL test cases."""


class TrappingRainWaterGenerator:
    """
    Generates Trapping Rain Water problems.
    
    Given an elevation map, calculate how much water can be trapped.
    Classic two-pointer or DP problem.
    """
    
    def __init__(self, seed: int = None):
        self.rng = random.Random(seed)
        self._counter = 0
    
    @property
    def problem_type(self) -> str:
        return "trapping_rain_water"
    
    @property
    def title(self) -> str:
        return "Trapping Rain Water"
    
    @property
    def description(self) -> str:
        return """Given n non-negative integers representing an elevation map where the width of each bar is 1, compute how much water it can trap after raining.

The elevation map is represented as a list of integers where each integer represents the height of a bar.

Example: For heights [0,1,0,2,1,0,1,3,2,1,2,1], the answer is 6.
The water fills the valleys between the bars."""
    
    @property
    def function_signature(self) -> str:
        return "def trap(height: list) -> int:"
    
    @property
    def function_name(self) -> str:
        return "trap"
    
    def _solve(self, height: List[int]) -> int:
        """Reference solution using two pointers."""
        if not height:
            return 0
        
        left, right = 0, len(height) - 1
        left_max = right_max = 0
        water = 0
        
        while left < right:
            if height[left] < height[right]:
                if height[left] >= left_max:
                    left_max = height[left]
                else:
                    water += left_max - height[left]
                left += 1
            else:
                if height[right] >= right_max:
                    right_max = height[right]
                else:
                    water += right_max - height[right]
                right -= 1
        
        return water
    
    def _generate_heights(self, difficulty: int) -> List[int]:
        """Generate random height array."""
        # Length based on difficulty
        length = 5 + difficulty * 2
        max_height = 3 + difficulty
        
        # Generate heights with some structure to ensure water can be trapped
        heights = []
        for i in range(length):
            h = self.rng.randint(0, max_height)
            heights.append(h)
        
        return heights
    
    def generate(self, difficulty: int = 5, num_test_cases: int = 5) -> AlgorithmicProblem:
        """Generate a problem instance."""
        self._counter += 1
        
        test_cases = []
        for _ in range(num_test_cases):
            heights = self._generate_heights(difficulty)
            expected = self._solve(heights)
            test_cases.append(TestCase(
                input_args=[heights],
                expected_output=expected
            ))
        
        return AlgorithmicProblem(
            problem_type=self.problem_type,
            problem_id=f"{self.problem_type}_{self._counter}",
            title=self.title,
            description=self.description,
            function_signature=self.function_signature,
            function_name=self.function_name,
            test_cases=test_cases,
            difficulty=difficulty,
        )


# Test the generator
print("=" * 60)
print("PROBLEM GENERATOR TEST")
print("=" * 60)

gen = TrappingRainWaterGenerator(seed=42)
test_problem = gen.generate(difficulty=5, num_test_cases=5)

print(f"\nGenerated problem: {test_problem.problem_id}")
print(f"Test cases: {len(test_problem.test_cases)}")
print("\nSample test cases:")
for i, tc in enumerate(test_problem.test_cases[:3], 1):
    print(f"  {i}. trap({tc.input_args[0][:8]}...) -> {tc.expected_output}")

## Step 5: Load the Model

Load Qwen 1.5B with float16 for memory efficiency.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print("=" * 60)
print("LOADING MODEL")
print("=" * 60)
print(f"\nModel: {CONFIG['model_name']}")
print("Loading... (this takes 1-2 minutes)")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"])
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print("Tokenizer loaded!")

# Load model in float16
model = AutoModelForCausalLM.from_pretrained(
    CONFIG["model_name"],
    torch_dtype=torch.float16,
    device_map="auto",
)
print(f"Model loaded on: {model.device}")

# Model info
num_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {num_params / 1e9:.2f}B")

# Check memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"\nGPU Memory:")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Reserved: {reserved:.2f} GB")

## Step 6: Baseline Evaluation

Test the model on Trapping Rain Water before training.

In [None]:
import re

def extract_code(response: str) -> str:
    """Extract Python code from model response."""
    patterns = [r"```python\n(.*?)```", r"```\n(.*?)```"]
    for pattern in patterns:
        matches = re.findall(pattern, response, re.DOTALL)
        if matches:
            return matches[-1].strip()
    
    # Fallback: look for function definition
    if "def " in response:
        lines = response.split("\n")
        code_lines = []
        in_function = False
        for line in lines:
            if line.strip().startswith("def "):
                in_function = True
            if in_function:
                code_lines.append(line)
        if code_lines:
            return "\n".join(code_lines).strip()
    
    return response


def verify_solution(code: str, problem: AlgorithmicProblem) -> tuple:
    """Verify solution against all test cases."""
    func_name = problem.function_name
    namespace = {}
    
    try:
        exec(code, namespace)
    except Exception as e:
        return False, 0.0, str(e)
    
    if func_name not in namespace:
        # Try to find any function
        funcs = [k for k, v in namespace.items() if callable(v) and not k.startswith("_")]
        if funcs:
            func_name = funcs[0]
        else:
            return False, 0.0, "No function found"
    
    func = namespace[func_name]
    passed = 0
    total = len(problem.test_cases)
    
    for tc in problem.test_cases:
        try:
            result = func(*tc.input_args)
            if result == tc.expected_output:
                passed += 1
        except Exception:
            pass
    
    success = passed == total
    partial = passed / total if total > 0 else 0.0
    return success, partial, None


def generate_solution(model, tokenizer, problem, temperature=0.2):
    """Generate a solution for a problem."""
    messages = [
        {"role": "system", "content": "You are an expert Python programmer. Write ONLY the function implementation."},
        {"role": "user", "content": problem.to_prompt()},
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return response


# Baseline evaluation
print("=" * 60)
print("BASELINE EVALUATION")
print("=" * 60)

generator = TrappingRainWaterGenerator(seed=CONFIG["seed"])
num_eval = 5
passed_count = 0

print(f"\nTesting {num_eval} problems...")
for i in range(num_eval):
    problem = generator.generate(difficulty=CONFIG["difficulty"], num_test_cases=CONFIG["num_test_cases"])
    response = generate_solution(model, tokenizer, problem)
    code = extract_code(response)
    success, partial, error = verify_solution(code, problem)
    
    status = "PASS" if success else "FAIL"
    print(f"  [{i+1}] {status} ({partial*100:.0f}%)")
    
    if success:
        passed_count += 1

baseline_accuracy = passed_count / num_eval
print(f"\nBaseline Accuracy: {passed_count}/{num_eval} ({baseline_accuracy*100:.0f}%)")

## Step 7: GRPO Training

Now we implement and run GRPO to improve the model.

In [None]:
import copy
from torch.optim import AdamW

print("=" * 60)
print("GRPO TRAINING SETUP")
print("=" * 60)

# Create reference model (frozen copy)
print("\nCreating reference model...")
ref_model = copy.deepcopy(model)
ref_model.eval()
for param in ref_model.parameters():
    param.requires_grad = False
print("Reference model created!")

# Check memory after creating reference
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    print(f"\nGPU Memory after ref model: {allocated:.2f} GB")

# Optimizer
optimizer = AdamW(model.parameters(), lr=CONFIG["learning_rate"])
print(f"Optimizer: AdamW (lr={CONFIG['learning_rate']})")

In [None]:
def create_reward_function(problem):
    """Create reward function for a problem."""
    def reward_fn(completions: list) -> torch.Tensor:
        rewards = []
        for completion in completions:
            code = extract_code(completion)
            success, partial, _ = verify_solution(code, problem)
            if success:
                reward = 1.0
            else:
                reward = partial * 0.5  # Partial credit
            rewards.append(reward)
        return torch.tensor(rewards, dtype=torch.float32)
    return reward_fn


def compute_log_probs(model, input_ids, attention_mask, labels):
    """Compute log probabilities for a sequence."""
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    
    # Shift for next token prediction
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    # Compute log probs
    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
    token_log_probs = torch.gather(log_probs, dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
    
    # Mask padding
    mask = (shift_labels != tokenizer.pad_token_id).float()
    sequence_log_prob = (token_log_probs * mask).sum(dim=-1) / mask.sum(dim=-1).clamp(min=1)
    
    return sequence_log_prob


def grpo_step(model, ref_model, optimizer, prompt, problem, num_generations=4):
    """
    Single GRPO training step.
    
    1. Generate G completions
    2. Compute rewards for each
    3. Compute group-relative advantages
    4. Update policy to maximize advantage-weighted log probs
    """
    model.train()
    
    # Build prompt
    messages = [
        {"role": "system", "content": "You are an expert Python programmer."},
        {"role": "user", "content": prompt},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    prompt_length = inputs.input_ids.shape[1]
    
    # Generate completions
    completions = []
    completion_ids = []
    
    model.eval()
    with torch.no_grad():
        for _ in range(num_generations):
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.pad_token_id,
            )
            completion = tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True)
            completions.append(completion)
            completion_ids.append(outputs[0])
    
    model.train()
    
    # Compute rewards
    reward_fn = create_reward_function(problem)
    rewards = reward_fn(completions)
    
    # Group-relative advantages
    mean_reward = rewards.mean()
    std_reward = rewards.std() + 1e-8
    advantages = (rewards - mean_reward) / std_reward
    
    # Compute loss
    total_loss = 0.0
    
    for i, (completion_id, advantage) in enumerate(zip(completion_ids, advantages)):
        # Prepare inputs
        input_ids = completion_id.unsqueeze(0)
        attention_mask = torch.ones_like(input_ids)
        
        # Policy log prob
        policy_log_prob = compute_log_probs(model, input_ids, attention_mask, input_ids)
        
        # Reference log prob (for KL penalty)
        with torch.no_grad():
            ref_log_prob = compute_log_probs(ref_model, input_ids, attention_mask, input_ids)
        
        # KL penalty
        kl_penalty = policy_log_prob - ref_log_prob
        
        # Loss: -advantage * log_prob + beta * KL
        loss = -advantage.to(model.device) * policy_log_prob + CONFIG["beta"] * kl_penalty
        total_loss += loss
    
    total_loss = total_loss / num_generations
    
    # Backward pass
    optimizer.zero_grad()
    total_loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    
    return {
        "loss": total_loss.item(),
        "mean_reward": mean_reward.item(),
        "max_reward": rewards.max().item(),
        "completions": completions,
        "rewards": rewards.tolist(),
    }

print("GRPO functions defined!")

In [None]:
# Run GRPO training
print("=" * 60)
print("GRPO TRAINING")
print("=" * 60)
print(f"\nSteps: {CONFIG['num_steps']}")
print(f"Generations per step: {CONFIG['num_generations']}")
print()

generator = TrappingRainWaterGenerator(seed=CONFIG["seed"] + 100)  # Different seed from eval
training_metrics = []

for step in range(CONFIG["num_steps"]):
    # Generate new problem
    problem = generator.generate(
        difficulty=CONFIG["difficulty"],
        num_test_cases=CONFIG["num_test_cases"]
    )
    prompt = problem.to_prompt()
    
    # GRPO step
    metrics = grpo_step(
        model=model,
        ref_model=ref_model,
        optimizer=optimizer,
        prompt=prompt,
        problem=problem,
        num_generations=CONFIG["num_generations"]
    )
    
    training_metrics.append(metrics)
    
    print(f"Step {step+1:2d}/{CONFIG['num_steps']}: "
          f"Loss={metrics['loss']:.4f}, "
          f"Avg Reward={metrics['mean_reward']:.3f}, "
          f"Max Reward={metrics['max_reward']:.3f}")
    
    # Clear cache periodically
    if (step + 1) % 5 == 0:
        torch.cuda.empty_cache()

print("\nTraining complete!")

## Step 8: Post-Training Evaluation

Evaluate the model after GRPO training.

In [None]:
# Delete reference model to free memory
del ref_model
torch.cuda.empty_cache()

print("=" * 60)
print("POST-TRAINING EVALUATION")
print("=" * 60)

# Use fresh seed for evaluation
eval_generator = TrappingRainWaterGenerator(seed=CONFIG["seed"] + 999)
num_eval = 5
passed_count = 0

print(f"\nTesting {num_eval} problems...")
model.eval()

for i in range(num_eval):
    problem = eval_generator.generate(
        difficulty=CONFIG["difficulty"],
        num_test_cases=CONFIG["num_test_cases"]
    )
    response = generate_solution(model, tokenizer, problem, temperature=0.2)
    code = extract_code(response)
    success, partial, error = verify_solution(code, problem)
    
    status = "PASS" if success else "FAIL"
    print(f"  [{i+1}] {status} ({partial*100:.0f}%)")
    
    if success:
        passed_count += 1

final_accuracy = passed_count / num_eval
print(f"\nFinal Accuracy: {passed_count}/{num_eval} ({final_accuracy*100:.0f}%)")

## Step 9: Results Summary

In [None]:
print("=" * 60)
print("RESULTS SUMMARY")
print("=" * 60)

print(f"\nModel: {CONFIG['model_name']}")
print(f"Problem: {CONFIG['target_problem']}")
print(f"Training Steps: {CONFIG['num_steps']}")

print(f"\n{'Metric':<20} {'Before':>12} {'After':>12} {'Change':>12}")
print("-" * 58)
change = (final_accuracy - baseline_accuracy) * 100
print(f"{'Accuracy':<20} {baseline_accuracy*100:>11.0f}% {final_accuracy*100:>11.0f}% {change:>+11.0f}%")

# Training curve
print(f"\nTraining Progress:")
print(f"{'Step':>4} {'Avg Reward':>12} {'Max Reward':>12}")
print("-" * 30)
for i, m in enumerate(training_metrics):
    print(f"{i+1:>4} {m['mean_reward']:>12.3f} {m['max_reward']:>12.3f}")

if final_accuracy > baseline_accuracy:
    print(f"\n{'='*60}")
    print("SUCCESS! GRPO improved performance on Trapping Rain Water!")
    print(f"{'='*60}")
elif final_accuracy == baseline_accuracy:
    print(f"\n{'='*60}")
    print("No change. May need more training steps or different hyperparameters.")
    print(f"{'='*60}")
else:
    print(f"\n{'='*60}")
    print("Performance decreased. Consider reducing learning rate.")
    print(f"{'='*60}")

## Step 10: Save Model (Optional)

Save the trained model to Google Drive.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Save model
save_path = "/content/drive/MyDrive/axiom-rl/models/qwen-1.5b-grpo-trapping"
print(f"Saving model to: {save_path}")

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("Model saved!")

---

## Next Steps

1. **If successful**: Download the model and test locally on all hard problems
2. **If unsuccessful**: Try:
   - More training steps (20-50)
   - Lower learning rate (1e-5)
   - More generations per step (8)
   - Teacher distillation instead of GRPO

## Memory Tips

If you get OOM errors:
1. Reduce `num_generations` to 2
2. Reduce `max_seq_length` to 512
3. Use gradient checkpointing
4. Upgrade to A100 runtime