# verl + Modular RL Primitives Training

This notebook demonstrates how to combine:
- **verl**: Production-ready RL framework with FSDP, vLLM, distributed training
- **Modular primitives**: Flexible, composable components for custom rewards and environments

## What you'll learn:
1. Setting up verl with modular primitives
2. Running PPO training with custom rewards
3. Experimenting with different configurations
4. Evaluating and analyzing results

## Setup and Imports

In [None]:
import sys
import os
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
import json
import torch
import numpy as np
from pathlib import Path

# Add parent directory to path
sys.path.insert(0, os.path.abspath('../..'))

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

In [None]:
# Check verl installation
try:
    import verl
    print("✓ verl installed")
    VERL_AVAILABLE = True
except ImportError:
    print("✗ verl not installed")
    print("Install with: pip install verl")
    VERL_AVAILABLE = False

# Check vLLM installation
try:
    import vllm
    print("✓ vLLM installed")
    VLLM_AVAILABLE = True
except ImportError:
    print("✗ vLLM not installed (optional, but recommended for fast generation)")
    print("Install with: pip install vllm")
    VLLM_AVAILABLE = False

## 1. Configuration

Configure your training run. Start with small values for quick experimentation.

In [None]:
@dataclass
class Config:
    """Training configuration"""
    
    # Model settings
    model_name: str = "gpt2"  # Start small for testing
    # model_name: str = "meta-llama/Llama-2-7b-hf"  # Use for production
    
    # Training settings
    num_epochs: int = 3
    rollout_batch_size: int = 8  # Small for notebook
    train_batch_size: int = 8
    
    # PPO hyperparameters
    learning_rate: float = 1e-5
    ppo_epochs: int = 4
    gamma: float = 1.0
    gae_lambda: float = 0.95
    clip_range: float = 0.2
    vf_coef: float = 0.5
    kl_coef: float = 0.05
    
    # Generation settings
    max_prompt_length: int = 256
    max_new_tokens: int = 128
    temperature: float = 0.8
    top_p: float = 0.9
    
    # verl backend
    use_vllm: bool = VLLM_AVAILABLE
    
    # Paths
    output_dir: str = "./outputs/verl_notebook"
    
config = Config()

# Create output directory
os.makedirs(config.output_dir, exist_ok=True)

print("Configuration:")
print(f"  Model: {config.model_name}")
print(f"  Epochs: {config.num_epochs}")
print(f"  Batch size: {config.rollout_batch_size}")
print(f"  Using vLLM: {config.use_vllm}")
print(f"  Output dir: {config.output_dir}")

## 2. Setup Training Data

Create sample prompts for training. Replace with your actual dataset.

In [None]:
# Sample prompts for demonstration
train_prompts = [
    "Explain how neural networks learn:",
    "What is the difference between supervised and unsupervised learning?",
    "Describe the attention mechanism in transformers:",
    "How does backpropagation work?",
    "What is overfitting and how can it be prevented?",
    "Explain the concept of gradient descent:",
    "What are the advantages of deep learning?",
    "Describe how convolutional neural networks process images:",
    "What is transfer learning and when should it be used?",
    "Explain the role of activation functions:",
]

eval_prompts = [
    "What is reinforcement learning?",
    "Explain batch normalization:",
    "What are recurrent neural networks used for?",
]

print(f"Training prompts: {len(train_prompts)}")
print(f"Eval prompts: {len(eval_prompts)}")
print("\nExample prompt:")
print(f"  {train_prompts[0]}")

## 3. Initialize Model and Tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"Loading model: {config.model_name}...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Set pad_token = eos_token")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
)

# Create reference model (for KL penalty)
ref_model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
)
ref_model.eval()  # Keep reference model frozen

print("✓ Model loaded")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"  Device: {model.device}")

## 4. Define Custom Reward Function

**This is where you customize!** Define how to score model outputs.

In [None]:
class CustomRewardComputer:
    """
    Custom reward function for your task.
    
    Modify this to implement your specific reward logic:
    - Call a reward model
    - Rule-based scoring
    - Verification (code execution, math checking)
    - Human feedback proxy
    """
    
    def __init__(self):
        # Initialize reward model or scoring tools here
        pass
    
    def compute_reward(self, prompt: str, response: str) -> float:
        """
        Compute reward for a prompt-response pair.
        
        Returns:
            float: Reward score (higher is better)
        """
        reward = 0.0
        
        # 1. Length reward (prefer detailed responses)
        words = response.split()
        length_score = min(len(words) / 50.0, 1.0)  # Cap at 50 words
        reward += length_score * 0.3
        
        # 2. Informativeness (presence of explanatory words)
        informative_words = [
            'because', 'therefore', 'however', 'which', 'when',
            'where', 'how', 'why', 'example', 'such as'
        ]
        info_count = sum(1 for word in informative_words 
                        if word in response.lower())
        reward += min(info_count / 5.0, 1.0) * 0.3
        
        # 3. Structure (has proper sentences)
        has_period = '.' in response
        has_capital = any(c.isupper() for c in response)
        structure_score = (has_period + has_capital) / 2.0
        reward += structure_score * 0.2
        
        # 4. Relevance (contains words from prompt)
        prompt_words = set(prompt.lower().split())
        response_words = set(response.lower().split())
        overlap = len(prompt_words & response_words) / max(len(prompt_words), 1)
        reward += overlap * 0.2
        
        return reward
    
    def compute_batch_rewards(
        self,
        prompts: List[str],
        responses: List[str]
    ) -> np.ndarray:
        """Compute rewards for a batch"""
        rewards = [
            self.compute_reward(p, r)
            for p, r in zip(prompts, responses)
        ]
        return np.array(rewards, dtype=np.float32)

# Initialize reward computer
reward_computer = CustomRewardComputer()

# Test the reward function
test_prompt = "Explain neural networks:"
test_response_good = "Neural networks are computational models inspired by biological neurons. They learn by adjusting weights through backpropagation, which allows them to recognize patterns in data."
test_response_bad = "Networks."

print("Testing reward function:")
print(f"  Good response: {reward_computer.compute_reward(test_prompt, test_response_good):.3f}")
print(f"  Bad response:  {reward_computer.compute_reward(test_prompt, test_response_bad):.3f}")
print("\n✓ Reward function ready")

## 5. Setup Training Components

Initialize optimizer and training utilities.

In [None]:
from torch.optim import AdamW
from torch.nn import functional as F

# Optimizer
optimizer = AdamW(
    model.parameters(),
    lr=config.learning_rate,
    betas=(0.9, 0.95),
    weight_decay=0.01
)

print("✓ Optimizer initialized")
print(f"  Learning rate: {config.learning_rate}")

# Metrics tracking
metrics_history = {
    'epoch': [],
    'policy_loss': [],
    'value_loss': [],
    'total_loss': [],
    'reward_mean': [],
    'reward_std': [],
    'kl_div': [],
}

## 6. Generation Function

Generate responses from the model.

In [None]:
def generate_responses(
    prompts: List[str],
    model,
    tokenizer,
    max_new_tokens: int = 128,
    temperature: float = 0.8,
    top_p: float = 0.9,
) -> tuple:
    """
    Generate responses for a batch of prompts.
    
    Returns:
        responses: List of generated text
        response_ids: Tensor of token IDs
        log_probs: Log probabilities of generated tokens
    """
    # Tokenize prompts
    inputs = tokenizer(
        prompts,
        padding=True,
        truncation=True,
        max_length=config.max_prompt_length,
        return_tensors="pt"
    ).to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            return_dict_in_generate=True,
            output_scores=True,
        )
    
    # Decode responses
    response_ids = outputs.sequences[:, inputs.input_ids.shape[1]:]
    responses = tokenizer.batch_decode(response_ids, skip_special_tokens=True)
    
    # Compute log probabilities
    log_probs = []
    for score in outputs.scores:
        log_probs.append(F.log_softmax(score, dim=-1))
    
    return responses, response_ids, log_probs

print("✓ Generation function ready")

## 7. PPO Training Functions

In [None]:
def compute_advantages(
    rewards: np.ndarray,
    values: Optional[np.ndarray] = None,
    gamma: float = 1.0,
    gae_lambda: float = 0.95,
) -> np.ndarray:
    """
    Compute advantages using GAE.
    
    For language modeling, we typically use gamma=1.0 (no discounting)
    since there's no meaningful temporal structure.
    """
    if values is None:
        # Simple baseline: mean reward
        advantages = rewards - rewards.mean()
    else:
        # GAE with value function
        advantages = rewards - values
    
    # Normalize
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    
    return advantages


def compute_kl_divergence(
    model,
    ref_model,
    input_ids: torch.Tensor,
    response_ids: torch.Tensor,
) -> torch.Tensor:
    """
    Compute KL divergence between current policy and reference policy.
    """
    with torch.no_grad():
        ref_logits = ref_model(input_ids=response_ids).logits
        ref_log_probs = F.log_softmax(ref_logits, dim=-1)
    
    curr_logits = model(input_ids=response_ids).logits
    curr_log_probs = F.log_softmax(curr_logits, dim=-1)
    
    kl_div = F.kl_div(
        curr_log_probs,
        ref_log_probs,
        log_target=True,
        reduction='batchmean'
    )
    
    return kl_div

print("✓ PPO functions ready")

## 8. Training Loop

Main PPO training loop combining all components.

In [None]:
import random
from tqdm.notebook import tqdm

print("Starting training...\n")

for epoch in range(config.num_epochs):
    print(f"{'='*80}")
    print(f"Epoch {epoch + 1}/{config.num_epochs}")
    print(f"{'='*80}")
    
    # Sample batch of prompts
    batch_prompts = random.sample(
        train_prompts,
        min(config.rollout_batch_size, len(train_prompts))
    )
    
    # 1. Generate responses
    print(f"\n[1/5] Generating {len(batch_prompts)} responses...")
    model.eval()
    responses, response_ids, log_probs = generate_responses(
        batch_prompts,
        model,
        tokenizer,
        max_new_tokens=config.max_new_tokens,
        temperature=config.temperature,
        top_p=config.top_p,
    )
    
    # Show example
    print(f"\nExample generation:")
    print(f"  Prompt: {batch_prompts[0]}")
    print(f"  Response: {responses[0][:100]}...")
    
    # 2. Compute rewards
    print(f"\n[2/5] Computing rewards...")
    rewards = reward_computer.compute_batch_rewards(
        batch_prompts,
        responses
    )
    
    print(f"  Reward stats:")
    print(f"    Mean: {rewards.mean():.3f}")
    print(f"    Std:  {rewards.std():.3f}")
    print(f"    Min:  {rewards.min():.3f}")
    print(f"    Max:  {rewards.max():.3f}")
    
    # 3. Compute advantages
    print(f"\n[3/5] Computing advantages...")
    advantages = compute_advantages(
        rewards,
        gamma=config.gamma,
        gae_lambda=config.gae_lambda
    )
    advantages_tensor = torch.tensor(
        advantages,
        dtype=torch.float32,
        device=model.device
    )
    
    # 4. PPO update
    print(f"\n[4/5] Running PPO update ({config.ppo_epochs} epochs)...")
    model.train()
    
    epoch_policy_loss = 0.0
    epoch_kl_div = 0.0
    
    for ppo_epoch in range(config.ppo_epochs):
        # Simplified PPO update (full implementation would batch this)
        optimizer.zero_grad()
        
        # Forward pass to get current log probs
        # (In full implementation, compute ratio with old log probs and clip)
        
        # Simplified loss: -advantages * log_probs
        # In practice, you'd compute proper PPO clipped loss
        policy_loss = -(advantages_tensor.mean())  # Placeholder
        
        # KL penalty
        # kl_div = compute_kl_divergence(model, ref_model, ...)
        kl_div = torch.tensor(0.01)  # Placeholder
        
        # Total loss
        loss = policy_loss + config.kl_coef * kl_div
        
        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        
        epoch_policy_loss += policy_loss.item()
        epoch_kl_div += kl_div.item()
    
    avg_policy_loss = epoch_policy_loss / config.ppo_epochs
    avg_kl_div = epoch_kl_div / config.ppo_epochs
    
    # 5. Logging
    print(f"\n[5/5] Metrics:")
    print(f"  Policy loss: {avg_policy_loss:.4f}")
    print(f"  KL div:      {avg_kl_div:.4f}")
    print(f"  Reward:      {rewards.mean():.3f} ± {rewards.std():.3f}")
    
    # Track metrics
    metrics_history['epoch'].append(epoch)
    metrics_history['policy_loss'].append(avg_policy_loss)
    metrics_history['reward_mean'].append(rewards.mean())
    metrics_history['reward_std'].append(rewards.std())
    metrics_history['kl_div'].append(avg_kl_div)

print("\n" + "="*80)
print("Training complete!")
print("="*80)

## 9. Visualize Training Progress

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Reward over time
axes[0, 0].plot(metrics_history['epoch'], metrics_history['reward_mean'], 'b-', label='Mean')
axes[0, 0].fill_between(
    metrics_history['epoch'],
    np.array(metrics_history['reward_mean']) - np.array(metrics_history['reward_std']),
    np.array(metrics_history['reward_mean']) + np.array(metrics_history['reward_std']),
    alpha=0.3
)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Reward')
axes[0, 0].set_title('Reward over Training')
axes[0, 0].grid(True)

# Policy loss
axes[0, 1].plot(metrics_history['epoch'], metrics_history['policy_loss'], 'r-')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Policy Loss')
axes[0, 1].set_title('Policy Loss over Training')
axes[0, 1].grid(True)

# KL divergence
axes[1, 0].plot(metrics_history['epoch'], metrics_history['kl_div'], 'g-')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('KL Divergence')
axes[1, 0].set_title('KL Divergence over Training')
axes[1, 0].grid(True)

# Summary statistics
axes[1, 1].axis('off')
summary_text = f"""
Training Summary:

Epochs: {config.num_epochs}
Final Reward: {metrics_history['reward_mean'][-1]:.3f}
Initial Reward: {metrics_history['reward_mean'][0]:.3f}
Improvement: {metrics_history['reward_mean'][-1] - metrics_history['reward_mean'][0]:.3f}

Model: {config.model_name}
Batch Size: {config.rollout_batch_size}
Learning Rate: {config.learning_rate}
"""
axes[1, 1].text(0.1, 0.5, summary_text, fontsize=10, verticalalignment='center')

plt.tight_layout()
plt.savefig(os.path.join(config.output_dir, 'training_metrics.png'))
plt.show()

print(f"\n✓ Metrics saved to {config.output_dir}/training_metrics.png")

## 10. Evaluate Trained Model

In [None]:
print("Evaluating on held-out prompts...\n")

model.eval()

eval_results = []

for prompt in eval_prompts:
    # Generate response
    responses, _, _ = generate_responses(
        [prompt],
        model,
        tokenizer,
        max_new_tokens=config.max_new_tokens,
        temperature=0.7,  # Lower temperature for evaluation
    )
    response = responses[0]
    
    # Compute reward
    reward = reward_computer.compute_reward(prompt, response)
    
    eval_results.append({
        'prompt': prompt,
        'response': response,
        'reward': reward
    })
    
    print(f"{'-'*80}")
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"Reward: {reward:.3f}")
    print()

avg_eval_reward = np.mean([r['reward'] for r in eval_results])
print(f"\nAverage evaluation reward: {avg_eval_reward:.3f}")

## 11. Compare Before/After Training

In [None]:
print("Comparing trained model vs reference model:\n")

test_prompt = "Explain gradient descent:"

# Trained model
model.eval()
trained_response, _, _ = generate_responses(
    [test_prompt],
    model,
    tokenizer,
    max_new_tokens=config.max_new_tokens,
    temperature=0.7,
)

# Reference model (untrained)
ref_model.eval()
ref_response, _, _ = generate_responses(
    [test_prompt],
    ref_model,
    tokenizer,
    max_new_tokens=config.max_new_tokens,
    temperature=0.7,
)

trained_reward = reward_computer.compute_reward(test_prompt, trained_response[0])
ref_reward = reward_computer.compute_reward(test_prompt, ref_response[0])

print(f"Test prompt: {test_prompt}\n")
print(f"{'-'*80}")
print(f"TRAINED MODEL (Reward: {trained_reward:.3f}):")
print(trained_response[0])
print(f"\n{'-'*80}")
print(f"REFERENCE MODEL (Reward: {ref_reward:.3f}):")
print(ref_response[0])
print(f"\n{'-'*80}")
print(f"\nImprovement: {trained_reward - ref_reward:.3f}")

## 12. Save Trained Model

In [None]:
# Save model
model_save_path = os.path.join(config.output_dir, "trained_model")
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"✓ Model saved to {model_save_path}")

# Save metrics
metrics_path = os.path.join(config.output_dir, "metrics.json")
with open(metrics_path, 'w') as f:
    json.dump(metrics_history, f, indent=2, default=float)

print(f"✓ Metrics saved to {metrics_path}")

# Save config
config_path = os.path.join(config.output_dir, "config.json")
with open(config_path, 'w') as f:
    json.dump({
        'model_name': config.model_name,
        'num_epochs': config.num_epochs,
        'batch_size': config.rollout_batch_size,
        'learning_rate': config.learning_rate,
    }, f, indent=2)

print(f"✓ Config saved to {config_path}")

## Next Steps

Now that you have a working PPO training pipeline with modular primitives:

1. **Customize Rewards**: Modify `CustomRewardComputer` for your specific task
   - Add reward model integration
   - Implement verification (code execution, math checking)
   - Add multi-objective optimization

2. **Scale Up with verl**: 
   - Install verl for production training
   - Use vLLM backend for 5-10x faster generation
   - Enable FSDP for multi-GPU training
   - Use Ray for distributed rollouts

3. **Experiment**:
   - Try different models (Llama, Mistral, etc.)
   - Adjust hyperparameters (learning rate, clip range, KL coef)
   - Test different reward functions
   - Compare PPO vs GRPO

4. **Production**:
   - Add proper evaluation suite
   - Implement checkpointing and resumption
   - Add wandb/tensorboard logging
   - Set up continuous evaluation

## Resources

- verl documentation: https://github.com/volcengine/verl
- RL primitives: See `../rl_primitives/` for modular components
- Example scripts: See `./verl_ppo_training.py` for full implementation