# GRPO-RLVR Training with Qwen2.5-0.5B on SageMaker

This notebook demonstrates GRPO with Reinforcement Learning from Verifiable Rewards (RLVR) for mathematical reasoning tasks.

## Key Features:
- GSM8K dataset with few-shot CoT prompting
- Verifiable rewards for mathematical reasoning
- Step-by-step solution verification
- Follows RLVR_finetuning.py pattern

In [None]:
# Import from sama_rl
import sys
sys.path.append('./sama_rl')
from sama_rl import GRPO_RLVR, create_inference_model

## Configuration

Configure GRPO-RLVR for mathematical reasoning:

In [None]:
# Create GRPO-RLVR trainer
rlvr_trainer = GRPO_RLVR(
    yaml_file="./sama_rl/recipes/GRPO_RLVR/Qwen2.5-0.5B.yaml",
    instance_type="ml.g6.48xlarge",  # 24GB GPU for RLVR
    hf_token="",  # Required for model access
    max_steps=100,
    wandb_api_key=""
)

print("GRPO-RLVR trainer configured:")
print(f"Model: {rlvr_trainer.config.model['name']}")
print(f"Dataset: {rlvr_trainer.config.data['dataset_name']}")
#print(f"Num shots: {rlvr_trainer.config.data['num_shots']}")
#print(f"Instance: {rlvr_trainer.config.sagemaker['instance_type']}")

## Dataset Preparation

Prepare GSM8K dataset with few-shot CoT prompting:

In [None]:
# Prepare GSM8K dataset
dataset = rlvr_trainer.prepare_dataset(
    dataset_name="gsm8k",
    num_shots=8,
    test_size=0.1
)

print(f"Dataset prepared:")
print(f"Train samples: {len(rlvr_trainer.train_dataset)}")
print(f"Val samples: {len(rlvr_trainer.val_dataset)}")

In [None]:
# Upload dataset to S3
train_s3_path, val_s3_path = rlvr_trainer.upload_dataset_to_s3()

print(f"Dataset uploaded to S3:")
print(f"Train: {train_s3_path}")
print(f"Val: {val_s3_path}")

## Training

Start GRPO-RLVR training with verifiable rewards:

In [None]:
# Start training
rlvr_trainer.train()

print(f"Training job: {rlvr_trainer.training_job_name}")
print("GRPO-RLVR training started with verifiable rewards!")

## Monitor Training

The training will:
- Use GSM8K dataset with 8-shot CoT prompting
- Apply verifiable rewards for mathematical reasoning
- Reward step-by-step solutions and correct operations
- Optimize for mathematical accuracy

Expected training time: ~1 hour on ml.g5.2xlarge

In [None]:
# Get model artifacts after training
model_uri = rlvr_trainer.get_model_artifacts()
print(f"Trained model artifacts: {model_uri}")

## Deployment

Deploy the trained model for mathematical reasoning:

In [None]:
# Deploy using sama_rl inference
inference_model = create_inference_model(
    model_uri=model_uri,
    instance_type="ml.g4dn.xlarge"
)

print(f"Model deployed: {inference_model}")

## Test Mathematical Reasoning

Test the trained model on GSM8K-style problems:

In [None]:
# Test mathematical reasoning problems
test_problems = [
    "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast every morning and bakes muffins for her friends every day with 4. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?",
    "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?"
]

for i, problem in enumerate(test_problems, 1):
    prompt = f"Question: {problem}\nLet me think step by step.\n"
    response = inference_model.predict(prompt, max_new_tokens=200)
    
    print(f"\n=== Problem {i} ===")
    print(f"Question: {problem}")
    print(f"Solution: {response}")
    print("=" * 50)

## Evaluation

Evaluate the model's mathematical reasoning accuracy:

In [None]:
import re

def extract_answer(text):
    """Extract numerical answer from model output"""
    numbers = re.findall(r'-?\d+', text.replace(',', ''))
    return numbers[-1] if numbers else None

def evaluate_gsm8k_sample(model, problems_and_answers):
    """Evaluate model on GSM8K problems"""
    correct = 0
    total = len(problems_and_answers)
    
    for problem, expected_answer in problems_and_answers:
        prompt = f"Question: {problem}\nLet me think step by step.\n"
        response = model.predict(prompt, max_new_tokens=200)
        predicted_answer = extract_answer(response)
        
        if predicted_answer == expected_answer:
            correct += 1
            
    accuracy = correct / total
    print(f"Accuracy: {correct}/{total} = {accuracy:.2%}")
    return accuracy

# Sample problems with answers
sample_problems = [
    ("Janet's ducks lay 16 eggs per day. She eats 3 for breakfast every morning and bakes muffins for her friends every day with 4. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", "18"),
    ("A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?", "3")
]

# Run evaluation
accuracy = evaluate_gsm8k_sample(inference_model, sample_problems)
print(f"\nModel shows {accuracy:.1%} accuracy on sample GSM8K problems")