# Fine-Tuning Your First LLM with Reinforcement Learning

This notebook walks you through the complete RLHF pipeline:
1. **Supervised Fine-Tuning (SFT)** - Teaching format
2. **Reward Modeling** - Learning preferences
3. **PPO Training** - Optimizing for quality

**Requirements:**
- Free Colab GPU (T4)
- ~1-2 hours runtime

**What you'll learn:**
- How each RLHF stage works
- The role of the KL penalty
- How to detect reward hacking

This notebook accompanies the guide at [rlbook.ai/applications/finetune-llm](https://rlbook.ai/applications/finetune-llm)

## Setup

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate peft trl bitsandbytes
!pip install -q torch torchvision torchaudio

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional
from tqdm.notebook import tqdm
import numpy as np
import random

# Check GPU
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Set seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

## Configuration

In [None]:
# Model configuration
MODEL_NAME = "Qwen/Qwen2.5-0.5B"  # Small model for learning, fits on free Colab

# Training configuration
@dataclass
class Config:
    # SFT
    sft_epochs: int = 3
    sft_lr: float = 1e-4
    sft_batch_size: int = 2
    
    # Reward Model
    rm_epochs: int = 5
    rm_lr: float = 1e-4
    rm_batch_size: int = 2
    
    # PPO
    ppo_iterations: int = 20
    ppo_epochs: int = 4
    ppo_lr: float = 1e-5
    ppo_batch_size: int = 4
    clip_range: float = 0.2
    kl_coef: float = 0.1
    target_kl: float = 0.02
    
    # Generation
    max_length: int = 256
    max_new_tokens: int = 100
    temperature: float = 0.7
    
    # LoRA
    lora_r: int = 16
    lora_alpha: int = 32
    lora_dropout: float = 0.1

config = Config()
print("Configuration loaded!")

## Load Base Model

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)

print(f"Model loaded: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Apply LoRA

LoRA (Low-Rank Adaptation) allows us to fine-tune efficiently by only training a small fraction of parameters.

In [None]:
# LoRA configuration
lora_config = LoraConfig(
    r=config.lora_r,
    lora_alpha=config.lora_alpha,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=config.lora_dropout,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")

---
# Part 1: Supervised Fine-Tuning (SFT)

First, we teach the model the format of good Q&A responses.

In [None]:
# Training examples for SFT
SFT_EXAMPLES = [
    {
        "question": "What is the capital of France?",
        "answer": "The capital of France is Paris."
    },
    {
        "question": "How does photosynthesis work?",
        "answer": "Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. It occurs in chloroplasts using chlorophyll pigments."
    },
    {
        "question": "What is machine learning?",
        "answer": "Machine learning is a subset of artificial intelligence where systems learn patterns from data rather than being explicitly programmed. Models improve through experience."
    },
    {
        "question": "Why is the sky blue?",
        "answer": "The sky appears blue because of Rayleigh scattering. Sunlight entering the atmosphere scatters off air molecules, with shorter blue wavelengths scattering more than longer red wavelengths."
    },
    {
        "question": "What is reinforcement learning?",
        "answer": "Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The goal is to maximize cumulative reward over time."
    },
    {
        "question": "What is a neural network?",
        "answer": "A neural network is a computational model inspired by biological neurons. It consists of layers of interconnected nodes that transform inputs into outputs through learned weights."
    },
    {
        "question": "What is gradient descent?",
        "answer": "Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest decrease of a loss function. It is fundamental to training neural networks."
    },
    {
        "question": "What is the purpose of an activation function?",
        "answer": "Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns. Without them, multiple layers would collapse to a single linear transformation."
    },
]

def format_for_training(example):
    """Format Q&A pair as training text."""
    return f"Question: {example['question']}\n\nAnswer: {example['answer']}"

# Preview
print("Example formatted data:")
print("-" * 40)
print(format_for_training(SFT_EXAMPLES[0]))

In [None]:
class QADataset(Dataset):
    """Dataset for SFT training."""
    
    def __init__(self, examples, tokenizer, max_length=256):
        self.examples = examples
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        text = format_for_training(self.examples[idx])
        encodings = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        return {
            "input_ids": encodings["input_ids"].squeeze(),
            "attention_mask": encodings["attention_mask"].squeeze(),
            "labels": encodings["input_ids"].squeeze()
        }

# Create dataset and dataloader
sft_dataset = QADataset(SFT_EXAMPLES, tokenizer, config.max_length)
sft_dataloader = DataLoader(sft_dataset, batch_size=config.sft_batch_size, shuffle=True)

print(f"SFT dataset: {len(sft_dataset)} examples")

In [None]:
# SFT Training
optimizer = AdamW(model.parameters(), lr=config.sft_lr)
model.train()

print("Starting SFT training...")
for epoch in range(config.sft_epochs):
    total_loss = 0
    for batch in tqdm(sft_dataloader, desc=f"Epoch {epoch+1}/{config.sft_epochs}"):
        batch = {k: v.to(model.device) for k, v in batch.items()}
        
        outputs = model(**batch)
        loss = outputs.loss
        
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(sft_dataloader)
    print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}")

print("SFT training complete!")

In [None]:
def generate_response(model, tokenizer, question, max_new_tokens=100, temperature=0.7):
    """Generate a response to a question."""
    prompt = f"Question: {question}\n\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "Answer:" in response:
        response = response.split("Answer:")[-1].strip()
    return response

# Test the SFT model
print("Testing SFT model:")
print("-" * 50)
test_questions = [
    "What is Python?",
    "How do computers work?",
    "What is the meaning of life?"
]

for q in test_questions:
    print(f"Q: {q}")
    print(f"A: {generate_response(model, tokenizer, q)}")
    print()

---
# Part 2: Reward Modeling

Now we train a model to predict which responses are better.

In [None]:
def score_response(prompt, response):
    """
    Heuristic scoring function to simulate human preferences.
    In real RLHF, this comes from actual human annotations.
    """
    score = 0
    
    # Prefer complete sentences
    if response.endswith(('.', '!', '?')):
        score += 2
    
    # Prefer reasonable length
    length = len(response)
    if 20 <= length <= 200:
        score += 2
    elif length < 20:
        score -= 2
    elif length > 300:
        score -= 1
    
    # Penalize filler phrases
    filler_phrases = ["I think", "maybe", "perhaps", "it depends"]
    for phrase in filler_phrases:
        if phrase.lower() in response.lower():
            score -= 1
    
    # Reward direct answers
    if response and response[0].isupper():
        score += 1
    
    return score

In [None]:
# Generate preference data
TRAINING_PROMPTS = [
    "What is Python?",
    "How do neural networks learn?",
    "Explain gradient descent.",
    "What is the purpose of activation functions?",
    "How does backpropagation work?",
    "What is overfitting?",
    "Explain the bias-variance tradeoff.",
    "What is a loss function?",
    "How do transformers work?",
    "What is attention in deep learning?",
]

def create_preference_data(model, tokenizer, prompts, n_samples=4):
    """Generate responses and create preference pairs."""
    preference_data = []
    
    for prompt in tqdm(prompts, desc="Generating preferences"):
        # Generate multiple responses
        responses = []
        for _ in range(n_samples):
            response = generate_response(model, tokenizer, prompt)
            responses.append(response)
        
        # Create pairs with preferences
        for i in range(len(responses)):
            for j in range(i + 1, len(responses)):
                score_i = score_response(prompt, responses[i])
                score_j = score_response(prompt, responses[j])
                
                if score_i > score_j:
                    chosen, rejected = responses[i], responses[j]
                elif score_j > score_i:
                    chosen, rejected = responses[j], responses[i]
                else:
                    continue
                
                preference_data.append({
                    "prompt": prompt,
                    "chosen": chosen,
                    "rejected": rejected
                })
    
    return preference_data

print("Generating preference data...")
preference_data = create_preference_data(model, tokenizer, TRAINING_PROMPTS, n_samples=4)
print(f"Created {len(preference_data)} preference pairs")

In [None]:
# Preview preference data
if preference_data:
    example = preference_data[0]
    print("Example preference pair:")
    print(f"Prompt: {example['prompt']}")
    print(f"Chosen: {example['chosen'][:150]}...")
    print(f"Rejected: {example['rejected'][:150]}...")

In [None]:
class RewardModel(nn.Module):
    """Reward model for RLHF."""
    
    def __init__(self, base_model_name, device="cuda"):
        super().__init__()
        self.device = device
        
        # Load a fresh copy of the base model
        self.base = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        
        # Freeze base model
        for param in self.base.parameters():
            param.requires_grad = False
        
        # Add reward head
        hidden_size = self.base.config.hidden_size
        self.reward_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, 1)
        ).to(device).half()
    
    def forward(self, input_ids, attention_mask=None):
        outputs = self.base(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        
        hidden = outputs.hidden_states[-1]
        
        if attention_mask is not None:
            seq_lengths = attention_mask.sum(dim=1) - 1
            batch_size = hidden.shape[0]
            last_hidden = hidden[torch.arange(batch_size, device=hidden.device), seq_lengths]
        else:
            last_hidden = hidden[:, -1, :]
        
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward

# Initialize reward model
print("Initializing reward model...")
reward_model = RewardModel(MODEL_NAME)
print("Reward model initialized!")

In [None]:
class PreferenceDataset(Dataset):
    """Dataset for reward model training."""
    
    def __init__(self, preference_data, tokenizer, max_length=256):
        self.data = preference_data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        chosen_text = f"Question: {item['prompt']}\n\nAnswer: {item['chosen']}"
        chosen_enc = self.tokenizer(
            chosen_text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        
        rejected_text = f"Question: {item['prompt']}\n\nAnswer: {item['rejected']}"
        rejected_enc = self.tokenizer(
            rejected_text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        
        return {
            "chosen_ids": chosen_enc["input_ids"].squeeze(),
            "chosen_mask": chosen_enc["attention_mask"].squeeze(),
            "rejected_ids": rejected_enc["input_ids"].squeeze(),
            "rejected_mask": rejected_enc["attention_mask"].squeeze(),
        }

rm_dataset = PreferenceDataset(preference_data, tokenizer)
rm_dataloader = DataLoader(rm_dataset, batch_size=config.rm_batch_size, shuffle=True)
print(f"Reward model dataset: {len(rm_dataset)} pairs")

In [None]:
# Train reward model
rm_optimizer = AdamW(reward_model.reward_head.parameters(), lr=config.rm_lr)
reward_model.train()

print("Training reward model...")
for epoch in range(config.rm_epochs):
    total_loss = 0
    total_correct = 0
    total_pairs = 0
    
    for batch in tqdm(rm_dataloader, desc=f"RM Epoch {epoch+1}/{config.rm_epochs}"):
        chosen_ids = batch["chosen_ids"].to(reward_model.device)
        chosen_mask = batch["chosen_mask"].to(reward_model.device)
        rejected_ids = batch["rejected_ids"].to(reward_model.device)
        rejected_mask = batch["rejected_mask"].to(reward_model.device)
        
        r_chosen = reward_model(chosen_ids, chosen_mask)
        r_rejected = reward_model(rejected_ids, rejected_mask)
        
        # Bradley-Terry loss
        loss = -F.logsigmoid(r_chosen - r_rejected).mean()
        
        rm_optimizer.zero_grad()
        loss.backward()
        rm_optimizer.step()
        
        total_loss += loss.item()
        total_correct += (r_chosen > r_rejected).sum().item()
        total_pairs += len(r_chosen)
    
    accuracy = total_correct / total_pairs if total_pairs > 0 else 0
    avg_loss = total_loss / len(rm_dataloader)
    print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.2%}")

print("Reward model training complete!")

---
# Part 3: PPO Training

Now we use RL to optimize the policy based on the reward model.

In [None]:
# Create reference policy (frozen copy)
print("Creating reference policy...")
ref_policy = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)
for param in ref_policy.parameters():
    param.requires_grad = False
print("Reference policy created!")

In [None]:
class SimplePPOTrainer:
    """
    Simplified PPO trainer for educational purposes.
    Focuses on clarity over efficiency.
    """
    
    def __init__(self, policy, ref_policy, reward_model, tokenizer, config):
        self.policy = policy
        self.ref_policy = ref_policy
        self.reward_model = reward_model
        self.tokenizer = tokenizer
        self.config = config
        
        self.optimizer = AdamW(
            [p for p in self.policy.parameters() if p.requires_grad],
            lr=config.ppo_lr
        )
        self.kl_coef = config.kl_coef
    
    def train_step(self, prompts: List[str]) -> Dict:
        """Complete training step: generate, compute rewards, update."""
        
        self.policy.eval()
        batch_data = []
        
        # Generate responses and collect data
        for prompt in prompts:
            formatted = f"Question: {prompt}\n\nAnswer:"
            inputs = self.tokenizer(formatted, return_tensors="pt").to(self.policy.device)
            
            with torch.no_grad():
                # Generate with policy
                outputs = self.policy.generate(
                    **inputs,
                    max_new_tokens=self.config.max_new_tokens,
                    temperature=self.config.temperature,
                    do_sample=True,
                    return_dict_in_generate=True,
                    output_scores=True,
                    pad_token_id=self.tokenizer.pad_token_id
                )
                
                prompt_len = inputs['input_ids'].shape[1]
                response_ids = outputs.sequences[0, prompt_len:]
                
                # Compute log probs
                log_probs = []
                for i, score in enumerate(outputs.scores):
                    if i >= len(response_ids):
                        break
                    probs = F.softmax(score[0] / self.config.temperature, dim=-1)
                    log_prob = torch.log(probs[response_ids[i]] + 1e-10)
                    log_probs.append(log_prob.item())
                
                # Reference log probs
                ref_outputs = self.ref_policy(outputs.sequences)
                ref_logits = ref_outputs.logits[0, prompt_len-1:-1]
                ref_probs = F.softmax(ref_logits / self.config.temperature, dim=-1)
                ref_log_probs = [
                    torch.log(ref_probs[i, response_ids[i]] + 1e-10).item()
                    for i in range(min(len(response_ids), len(ref_probs)))
                ]
                
                # Compute reward
                full_text = self.tokenizer.decode(outputs.sequences[0])
                reward_input = self.tokenizer(full_text, return_tensors="pt").to(self.reward_model.device)
                reward = self.reward_model(reward_input['input_ids'], reward_input['attention_mask']).item()
                
                response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
            
            batch_data.append({
                "prompt": prompt,
                "response": response,
                "log_probs": log_probs,
                "ref_log_probs": ref_log_probs,
                "reward": reward
            })
        
        # Compute advantages
        rewards = []
        advantages = []
        for data in batch_data:
            n_tokens = min(len(data["log_probs"]), len(data["ref_log_probs"]))
            if n_tokens == 0:
                continue
                
            token_rewards = []
            for t in range(n_tokens):
                kl_penalty = -self.kl_coef * (data["log_probs"][t] - data["ref_log_probs"][t])
                r = kl_penalty
                if t == n_tokens - 1:
                    r += data["reward"]
                token_rewards.append(r)
            
            # Simple returns (no value function)
            returns = []
            cumulative = 0
            for r in reversed(token_rewards):
                cumulative = r + 0.99 * cumulative
                returns.insert(0, cumulative)
            
            rewards.append(sum(token_rewards))
            advantages.append(returns)
        
        # PPO update
        self.policy.train()
        total_loss = 0
        total_kl = 0
        
        for _ in range(self.config.ppo_epochs):
            for i, data in enumerate(batch_data):
                if i >= len(advantages) or len(advantages[i]) == 0:
                    continue
                
                formatted = f"Question: {data['prompt']}\n\nAnswer: {data['response']}"
                inputs = self.tokenizer(formatted, return_tensors="pt").to(self.policy.device)
                
                outputs = self.policy(**inputs)
                logits = outputs.logits
                
                prompt_formatted = f"Question: {data['prompt']}\n\nAnswer:"
                prompt_len = len(self.tokenizer(prompt_formatted)['input_ids'])
                response_ids = inputs['input_ids'][0, prompt_len:]
                
                n_tokens = min(len(response_ids), len(data["log_probs"]), len(advantages[i]))
                if n_tokens == 0:
                    continue
                
                new_log_probs = []
                for t in range(n_tokens):
                    probs = F.softmax(logits[0, prompt_len + t - 1] / self.config.temperature, dim=-1)
                    lp = torch.log(probs[response_ids[t]] + 1e-10)
                    new_log_probs.append(lp)
                
                new_log_probs = torch.stack(new_log_probs)
                old_log_probs = torch.tensor(data["log_probs"][:n_tokens], device=self.policy.device)
                advs = torch.tensor(advantages[i][:n_tokens], device=self.policy.device, dtype=torch.float16)
                
                # Normalize
                advs = (advs - advs.mean()) / (advs.std() + 1e-8)
                
                # PPO clipped objective
                ratio = torch.exp(new_log_probs - old_log_probs)
                pg_loss1 = -advs * ratio
                pg_loss2 = -advs * torch.clamp(ratio, 1 - self.config.clip_range, 1 + self.config.clip_range)
                pg_loss = torch.max(pg_loss1, pg_loss2).mean()
                
                self.optimizer.zero_grad()
                pg_loss.backward()
                torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
                self.optimizer.step()
                
                total_loss += pg_loss.item()
                with torch.no_grad():
                    total_kl += (old_log_probs - new_log_probs).mean().item()
        
        n_updates = len(batch_data) * self.config.ppo_epochs
        avg_kl = total_kl / max(n_updates, 1)
        
        # Adaptive KL
        if avg_kl > self.config.target_kl * 1.5:
            self.kl_coef *= 1.2
        elif avg_kl < self.config.target_kl / 1.5:
            self.kl_coef /= 1.2
        self.kl_coef = max(0.01, min(1.0, self.kl_coef))
        
        return {
            "mean_reward": np.mean(rewards) if rewards else 0,
            "pg_loss": total_loss / max(n_updates, 1),
            "kl": avg_kl,
            "kl_coef": self.kl_coef,
            "n_responses": len(batch_data)
        }

In [None]:
# Initialize PPO trainer
ppo_trainer = SimplePPOTrainer(
    policy=model,
    ref_policy=ref_policy,
    reward_model=reward_model,
    tokenizer=tokenizer,
    config=config
)

# Training prompts
rl_prompts = TRAINING_PROMPTS * 3  # More prompts for training

In [None]:
# PPO Training Loop
print("Starting PPO training...")
print("=" * 60)

training_history = []

for iteration in range(config.ppo_iterations):
    # Sample batch
    batch_prompts = random.sample(rl_prompts, min(config.ppo_batch_size, len(rl_prompts)))
    
    # Training step
    metrics = ppo_trainer.train_step(batch_prompts)
    training_history.append(metrics)
    
    # Logging
    print(f"Iteration {iteration+1}/{config.ppo_iterations}: "
          f"Reward={metrics['mean_reward']:.3f}, "
          f"KL={metrics['kl']:.4f}, "
          f"Loss={metrics['pg_loss']:.4f}, "
          f"KL_coef={metrics['kl_coef']:.4f}")

print("=" * 60)
print("PPO training complete!")

---
# Evaluation

In [None]:
# Plot training progress
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Reward
rewards = [h['mean_reward'] for h in training_history]
axes[0].plot(rewards)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Mean Reward')
axes[0].set_title('Reward During Training')

# KL divergence
kls = [h['kl'] for h in training_history]
axes[1].plot(kls)
axes[1].axhline(y=config.target_kl, color='r', linestyle='--', label='Target KL')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('KL Divergence')
axes[1].set_title('KL Divergence During Training')
axes[1].legend()

# Loss
losses = [h['pg_loss'] for h in training_history]
axes[2].plot(losses)
axes[2].set_xlabel('Iteration')
axes[2].set_ylabel('Policy Loss')
axes[2].set_title('Policy Loss During Training')

plt.tight_layout()
plt.show()

In [None]:
# Compare responses before and after training
EVAL_PROMPTS = [
    "What is the difference between a list and a tuple in Python?",
    "Explain how convolutional neural networks work.",
    "What is the purpose of regularization?",
]

print("COMPARISON: Reference vs Trained Model")
print("=" * 60)

for prompt in EVAL_PROMPTS:
    print(f"\nQuestion: {prompt}")
    print("-" * 40)
    
    # Reference model
    ref_response = generate_response(ref_policy, tokenizer, prompt)
    print(f"Reference: {ref_response[:200]}...")
    
    # Trained model
    trained_response = generate_response(model, tokenizer, prompt)
    print(f"Trained: {trained_response[:200]}...")
    
    # Get rewards
    with torch.no_grad():
        ref_text = f"Question: {prompt}\n\nAnswer: {ref_response}"
        ref_input = tokenizer(ref_text, return_tensors="pt").to(reward_model.device)
        ref_reward = reward_model(ref_input['input_ids']).item()
        
        train_text = f"Question: {prompt}\n\nAnswer: {trained_response}"
        train_input = tokenizer(train_text, return_tensors="pt").to(reward_model.device)
        train_reward = reward_model(train_input['input_ids']).item()
    
    print(f"\nReward: Reference={ref_reward:.3f}, Trained={train_reward:.3f}")
    print("=" * 60)

In [None]:
# Check for reward hacking
def check_reward_hacking(model, tokenizer, prompts):
    """Check for signs of reward hacking."""
    responses = [generate_response(model, tokenizer, p) for p in prompts]
    
    print("Reward Hacking Check")
    print("-" * 40)
    
    # Repetition
    unique = set(responses)
    repetition = 1 - len(unique) / len(responses)
    print(f"Repetition rate: {repetition:.1%}")
    if repetition > 0.3:
        print("  ⚠️ High repetition - possible mode collapse")
    else:
        print("  ✓ Low repetition")
    
    # Length
    lengths = [len(r) for r in responses]
    print(f"Average length: {np.mean(lengths):.0f} chars (std: {np.std(lengths):.0f})")
    if np.mean(lengths) > 300:
        print("  ⚠️ Long responses - possible length gaming")
    else:
        print("  ✓ Reasonable length")
    
    print("-" * 40)
    return responses

test_prompts = [
    "What is a variable?",
    "Explain functions.",
    "What is a loop?",
    "What is an array?",
    "What is recursion?",
]

_ = check_reward_hacking(model, tokenizer, test_prompts)

---
# Summary

**What you've built:**
1. SFT model that understands Q&A format
2. Reward model that predicts response quality
3. PPO-trained policy optimized for the reward

**Key insights:**
- SFT teaches format; RL teaches quality
- KL penalty prevents reward hacking
- Reward models are imperfect proxies

**Next steps:**
- Try larger models (7B+)
- Use real human preferences
- Explore DPO as an alternative

Learn more at [rlbook.ai/chapters/rlhf](https://rlbook.ai/chapters/rlhf)