# SAMA RL: GRPO Training Guide

**A complete guide to training language models with Group Relative Policy Optimization (GRPO)**

## What You'll Learn
- How to create reward functions that guide model behavior
- How to train models with GRPO on SageMaker
- How to deploy and run inference on trained models
- How to monitor and optimize training jobs

## Overview
Training an AI model with reinforcement learning:
- **Model generates response** → AI creates text output
- **Reward function scores response** → Function evaluates quality  
- **Model learns from rewards** → AI improves based on scores
- **GRPO** → The training algorithm that optimizes the model

## Table of Contents
1. [Setup & Imports](#setup)
2. [Understanding Reward Functions](#rewards)
3. [GRPO Training](#training)
4. [Model Deployment](#deployment)
5. [Inference](#inference)
6. [Advanced Usage](#advanced)

## 1. Setup & Imports

In [None]:
# Install SAMA RL
# pip install -e .

# Core SAMA RL imports
from sama_rl import GRPO, create_inference_model

# Standard libraries
from typing import List
import time

print("SAMA RL imported successfully")
print("Ready to start training")

## 2. Understanding Reward Functions

**Goal**: Teach the model what "good" responses look like

### Anatomy of a Reward Function
```python
def my_reward_function(completions: List[str], **kwargs) -> List[float]:
    # completions = list of model responses
    # kwargs = extra info (tokenizer, etc.)
    # returns = list of reward scores
```

### Length-Based Reward Function

**Goal**: Train model to write responses of a specific length  
**Use Cases**: Summaries, tweets, product descriptions

In [None]:
def create_length_reward(target_length: int = 400):
    """
    Creates reward function targeting specific length
    
    How it works:
    - Counts tokens in response
    - Gives higher reward for responses closer to target
    - Uses quadratic penalty for distance from target
    """
    def length_reward(completions: List[str], **kwargs) -> List[float]:
        tokenizer = kwargs.get('tokenizer')
        rewards = []
        
        for completion in completions:
            # Count tokens (or words as fallback)
            if tokenizer:
                num_tokens = len(tokenizer.encode(completion, add_special_tokens=False))
            else:
                num_tokens = len(completion.split())
            
            # Reward: closer to target = higher score
            distance = abs(num_tokens - target_length)
            reward = -(distance ** 2) / 1000
            rewards.append(reward)
        
        return rewards
    
    return length_reward

# Create length reward targeting 400 tokens
length_400_reward = create_length_reward(target_length=400)

print("Length reward function created")
print("Target: 400 tokens")
print("Penalty increases quadratically with distance from target")

### Sentiment-Based Reward Function

**Goal**: Train model to write positive, helpful responses  
**Use Cases**: Customer service, educational content, friendly chatbots

In [None]:
def create_sentiment_reward(positive_weight: float = 1.0, negative_weight: float = -0.5):
    """
    Creates reward function based on sentiment
    
    How it works:
    - Counts positive words (good, great, helpful, etc.)
    - Counts negative words (bad, terrible, awful, etc.)
    - Rewards positive sentiment, penalizes negative
    """
    positive_words = ['good', 'great', 'excellent', 'amazing', 'helpful', 'useful', 'clear']
    negative_words = ['bad', 'terrible', 'awful', 'horrible', 'useless', 'wrong', 'confusing']
    
    def sentiment_reward(completions: List[str], **kwargs) -> List[float]:
        rewards = []
        
        for completion in completions:
            text_lower = completion.lower()
            
            # Count sentiment words
            positive_count = sum(1 for word in positive_words if word in text_lower)
            negative_count = sum(1 for word in negative_words if word in text_lower)
            
            # Calculate sentiment score
            reward = (positive_count * positive_weight) + (negative_count * negative_weight)
            rewards.append(reward)
        
        return rewards
    
    return sentiment_reward

# Create sentiment reward favoring positive language
positive_sentiment_reward = create_sentiment_reward(positive_weight=1.0, negative_weight=-0.5)

print("Sentiment reward function created")
print("Rewards: good, great, helpful, excellent")
print("Penalizes: bad, terrible, awful, horrible")

## Test Your Reward Functions

Let's see how our reward functions work on sample text:

In [None]:
# Test completions with different characteristics
test_completions = [
    "Short response.",  # Short
    "This is a much longer response that contains detailed information and explanations that should score higher on length-based rewards.",  # Long
    "This response is great and excellent, providing wonderful insights that are very helpful.",  # Positive
    "This is a bad and terrible response that is awful and provides useless information.",  # Negative
    "This is a medium-length response with neutral tone and factual content."  # Neutral, medium
]

print("Testing reward functions on sample completions:")
print(f"Number of test completions: {len(test_completions)}")

# Test length rewards
length_rewards = length_400_reward(test_completions)
print("\nLength Rewards (target: 400 tokens):")
for i, (completion, reward) in enumerate(zip(test_completions, length_rewards), 1):
    word_count = len(completion.split())
    print(f"{i}. Reward: {reward:6.2f} | Words: {word_count:3d} | {completion[:50]}...")

# Test sentiment rewards
sentiment_rewards = positive_sentiment_reward(test_completions)
print("\nSentiment Rewards:")
for i, (completion, reward) in enumerate(zip(test_completions, sentiment_rewards), 1):
    sentiment = "Positive" if reward > 0 else "Negative" if reward < 0 else "Neutral"
    print(f"{i}. Reward: {reward:6.2f} | {sentiment} | {completion[:50]}...")

## 3. GRPO Training

**Goal**: Train your model with reward functions on SageMaker

### Basic GRPO Training

In [None]:
# Create GRPO trainer with length-based reward
trainer = GRPO(
    yaml_file="sama_rl/recipes/GRPO/qwen2-0.5b-grpo-config.yaml",
    reward_functions=[length_400_reward],  # Use our length reward
    max_steps=10,  # Small number for testing
    wandb_api_key="your_wandb_key_here"  # Replace with your key
)

print("GRPO Trainer Created")
print(f"Model: {trainer.config.model['name']}")
print(f"Dataset: {trainer.config.data['dataset_name']}")
print(f"Max steps: {trainer.config.training['max_steps']}")

# Get training job info
model_name = trainer.config.model['name'].split('/')[-1].lower().replace('-', '')
timestamp = int(time.time())
job_name = f"sama-grpo-{model_name}-{timestamp}"
print(f"Training job will be named: {job_name}")

### Start Training

**Warning: This will launch a real SageMaker training job and incur costs**

In [None]:
# Uncomment to start actual training
# trainer.train()

print("Training is commented out to prevent accidental costs")
print("\nTo start training:")
print("  1. Uncomment the line above")
print("  2. Add your real W&B API key")
print("  3. Run the cell")
print("\nExpected cost: ~$8-12 for 800 steps on ml.g4dn.2xlarge")
print("Expected time: ~30-45 minutes")

print("\nMonitor training:")
print("  • SageMaker Console: Training jobs")
print("  • W&B Dashboard: Real-time metrics")
print("  • CloudWatch: Detailed logs")

## 4. Model Deployment

**Goal**: Deploy your trained model to a SageMaker endpoint for inference

### Deploy from Existing Training Job

In [None]:
# Load existing training job and deploy
trainer = GRPO(training_job_name="sama-grpo-qwen205binstruct-1234567890")  # Replace with your job name

# Deploy with auto-selected instance (based on model size)
endpoint_name = trainer.deploy()

print(f"Model deployed to endpoint: {endpoint_name}")
print("Deployment complete - ready for inference")

### Deploy with Custom Instance

In [None]:
# Deploy with specific instance type
# endpoint_name = trainer.deploy(instance_type="ml.g5.2xlarge")

print("Available GPU instances for deployment:")
print("• ml.g5.xlarge - Small models (0.5B-1B) - ~$1.00/hour")
print("• ml.g5.2xlarge - Medium models (1B-3B) - ~$1.50/hour")
print("• ml.g5.4xlarge - Large models (7B+) - ~$2.50/hour")
print("• ml.g5.12xlarge - Very large models (13B+) - ~$7.00/hour")

## 5. Inference

**Goal**: Run inference on your deployed model

### Basic Inference

In [None]:
# Create inference model from deployed endpoint
model = create_inference_model(endpoint_name)

# Single inference
completion = model.generate(
    prompt="What is machine learning?",
    max_new_tokens=200,
    temperature=0.7
)

print("Prompt: What is machine learning?")
print(f"Completion: {completion}")
print(f"Token count: {model.get_token_count(completion)}")

### Batch Inference

In [None]:
# Test multiple prompts
test_prompts = [
    "Explain artificial intelligence in simple terms.",
    "What are the benefits of renewable energy?",
    "Describe the process of photosynthesis."
]

print("Running batch inference:")
for i, prompt in enumerate(test_prompts, 1):
    completion = model.generate(
        prompt=prompt,
        max_new_tokens=150,
        temperature=0.7,
        stop_on_repetition=True
    )
    tokens = model.get_token_count(completion)
    
    print(f"\nPrompt {i}: {prompt}")
    print(f"Completion ({tokens} tokens): {completion[:100]}...")

### Inference Parameters

In [None]:
# Different temperature settings
prompt = "Write a short story about a robot."

temperatures = [0.0, 0.7, 1.2]
for temp in temperatures:
    completion = model.generate(
        prompt=prompt,
        max_new_tokens=100,
        temperature=temp,
        stop_on_repetition=True
    )
    
    creativity = "Deterministic" if temp == 0.0 else "Balanced" if temp < 1.0 else "Creative"
    print(f"\nTemperature {temp} ({creativity}):")
    print(f"{completion[:80]}...")

## 6. Advanced Usage

### Multi-Reward Training

In [None]:
# Create trainer with multiple reward functions
multi_trainer = GRPO(
    yaml_file="sama_rl/recipes/GRPO/qwen2-0.5b-grpo-config.yaml",
    reward_functions=[
        length_400_reward,           # Target 400 tokens
        positive_sentiment_reward    # Positive language
    ],
    max_steps=10,
    wandb_api_key="your_key_here"
)

print("Multi-reward trainer created")
print("Reward 1: Length (400 tokens)")
print("Reward 2: Positive sentiment")
print("Model will optimize for both objectives")

### Configuration Overrides

In [None]:
# Override configuration parameters at runtime
advanced_trainer = GRPO(
    yaml_file="sama_rl/recipes/GRPO/qwen2-0.5b-grpo-config.yaml",
    reward_functions=[length_400_reward],
    
    # Override training parameters
    max_steps=1000,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    
    # Override SageMaker settings
    instance_type="ml.g4dn.4xlarge",
    
    # Override W&B settings
    wandb_api_key="your_key"
)

print("Configuration Overrides Applied:")
print(f"Max steps: {advanced_trainer.config.training['max_steps']}")
print(f"Learning rate: {advanced_trainer.config.training['learning_rate']}")

## Summary

### What You've Learned
- Create custom reward functions for any objective
- Train models with GRPO on SageMaker
- Deploy models to endpoints for inference
- Run inference with various parameters
- Override configurations for different use cases

### Complete Workflow
1. **Define reward function** → `create_length_reward(400)`
2. **Configure training** → `GRPO(yaml_file, reward_functions)`
3. **Train model** → `trainer.train()`
4. **Deploy model** → `trainer.deploy()`
5. **Run inference** → `model.generate(prompt)`

### Best Practices
- Test with small max_steps first (10-50)
- Use appropriate GPU instances for deployment
- Monitor costs with max_run limits
- Use stop_on_repetition for cleaner outputs
- Start with smaller models and scale up

You now have the tools to train language models with reinforcement learning using SAMA RL.