# Reward Modeling: Teaching AI to Understand Human Preferences

The reward model is the "judge" in RLHF - it learns to score responses based on human preferences.

## What You'll Learn

By the end of this notebook, you'll understand:
- The essay grader analogy: how reward models learn preferences
- The Bradley-Terry model: math behind preference learning
- Building a reward model from scratch
- Common pitfalls and how to avoid them
- Using TRL for reward model training

**Prerequisites:** Notebook 1 (What is RLHF)

**Time:** ~30 minutes

---
## The Big Picture: The Essay Grader Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE ESSAY GRADER ANALOGY                              │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Imagine training a teaching assistant to grade essays...     │
    │                                                                │
    │  THE PROBLEM:                                                 │
    │    Professor can't grade 10,000 essays personally.            │
    │    But they CAN compare pairs of essays:                      │
    │      "Essay A is better than Essay B"                        │
    │                                                                │
    │  THE SOLUTION (Reward Model):                                 │
    │    Train a TA to grade like the professor would!              │
    │                                                                │
    │    Step 1: Professor compares 1000 essay pairs               │
    │    Step 2: TA learns from these comparisons                   │
    │    Step 3: TA can now grade 10,000 essays automatically!      │
    │                                                                │
    │  IN RLHF:                                                     │
    │    Professor = Human annotators                               │
    │    Essays = LLM responses                                     │
    │    TA = Reward Model                                         │
    │                                                                │
    │  The reward model learns: RM(prompt, response) → score       │
    │  Higher score = More human-preferred response                 │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle

# Visualize the reward model concept
fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('Reward Model: Learning to Score Responses', fontsize=16, fontweight='bold')

# Input: Prompt + Response
input_box = FancyBboxPatch((1, 5.5), 4, 3, boxstyle="round,pad=0.1",
                            facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=3)
ax.add_patch(input_box)
ax.text(3, 7.8, 'INPUT', ha='center', fontsize=12, fontweight='bold', color='#1976d2')
ax.text(3, 7, 'Prompt + Response', ha='center', fontsize=10)
ax.text(3, 6.2, '"How do I learn Python?"', ha='center', fontsize=9, style='italic')
ax.text(3, 5.8, '"Start with basics..."', ha='center', fontsize=9, style='italic')

# Reward Model
rm_box = FancyBboxPatch((6, 5), 3, 4, boxstyle="round,pad=0.1",
                         facecolor='#fff3e0', edgecolor='#f57c00', linewidth=3)
ax.add_patch(rm_box)
ax.text(7.5, 8.2, 'REWARD', ha='center', fontsize=12, fontweight='bold', color='#f57c00')
ax.text(7.5, 7.6, 'MODEL', ha='center', fontsize=12, fontweight='bold', color='#f57c00')
ax.text(7.5, 6.5, 'Transformer\n+ Reward Head', ha='center', fontsize=10)
ax.text(7.5, 5.5, 'Trained on human\npreferences', ha='center', fontsize=9, style='italic')

# Output: Score
output_box = FancyBboxPatch((10.5, 6), 2.5, 2, boxstyle="round,pad=0.1",
                             facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=3)
ax.add_patch(output_box)
ax.text(11.75, 7.5, 'OUTPUT', ha='center', fontsize=12, fontweight='bold', color='#388e3c')
ax.text(11.75, 6.8, 'Score: 7.3', ha='center', fontsize=11, fontweight='bold')
ax.text(11.75, 6.3, '(Scalar)', ha='center', fontsize=9)

# Arrows
ax.annotate('', xy=(5.9, 7), xytext=(5.1, 7),
            arrowprops=dict(arrowstyle='->', lw=3, color='#666'))
ax.annotate('', xy=(10.4, 7), xytext=(9.1, 7),
            arrowprops=dict(arrowstyle='->', lw=3, color='#666'))

# Scale explanation
ax.text(7, 2.5, 'Score Interpretation:', ha='center', fontsize=11, fontweight='bold')
scale_items = [
    ('High (8-10)', '#4caf50', 'Excellent, preferred by humans'),
    ('Medium (5-7)', '#ff9800', 'Acceptable, but could be better'),
    ('Low (1-4)', '#f44336', 'Poor, humans would reject'),
]
for i, (label, color, desc) in enumerate(scale_items):
    ax.text(4 + i*3.5, 1.5, label, ha='center', fontsize=10, fontweight='bold', color=color)
    ax.text(4 + i*3.5, 1, desc, ha='center', fontsize=8, color='#666')

plt.tight_layout()
plt.show()

print("\nREWARD MODEL SUMMARY:")
print("  Input: (prompt, response) pair")
print("  Output: Single scalar score")
print("  Training: Learn from human preference comparisons")

---
## The Bradley-Terry Model: Learning from Comparisons

```
    ┌────────────────────────────────────────────────────────────────┐
    │              THE BRADLEY-TERRY MODEL                           │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  THE KEY INSIGHT:                                             │
    │    We don't need ABSOLUTE scores, just RELATIVE preferences!  │
    │                                                                │
    │  PROBABILITY MODEL:                                           │
    │    Given responses A and B, probability A is preferred:       │
    │                                                                │
    │    P(A > B) = exp(RM(A)) / (exp(RM(A)) + exp(RM(B)))          │
    │             = σ(RM(A) - RM(B))                                 │
    │                                                                │
    │    where σ is the sigmoid function                            │
    │                                                                │
    │  TRAINING LOSS:                                               │
    │    We have pairs where humans chose A over B                  │
    │    We want to maximize P(A > B)                               │
    │                                                                │
    │    Loss = -log P(chosen > rejected)                           │
    │         = -log σ(RM(chosen) - RM(rejected))                   │
    │                                                                │
    │  INTUITION:                                                   │
    │    If RM(chosen) >> RM(rejected) → Loss is small             │
    │    If RM(chosen) ≈ RM(rejected) → Loss is large              │
    │    If RM(chosen) << RM(rejected) → Loss is very large!       │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate the Bradley-Terry model

def bradley_terry_prob(score_a, score_b):
    """
    Probability that A is preferred over B.
    P(A > B) = σ(score_A - score_B)
    """
    return torch.sigmoid(score_a - score_b)


def reward_model_loss(score_chosen, score_rejected):
    """
    Bradley-Terry ranking loss.
    
    Loss = -log σ(score_chosen - score_rejected)
    
    Args:
        score_chosen: RM score for human-preferred response
        score_rejected: RM score for rejected response
    
    Returns:
        Mean loss over the batch
    """
    return -F.logsigmoid(score_chosen - score_rejected).mean()


# Demonstrate with examples
print("BRADLEY-TERRY MODEL DEMONSTRATION")
print("="*60)

scenarios = [
    ("RM clearly prefers chosen", torch.tensor(8.0), torch.tensor(2.0)),
    ("RM slightly prefers chosen", torch.tensor(5.5), torch.tensor(5.0)),
    ("RM is uncertain", torch.tensor(5.0), torch.tensor(5.0)),
    ("RM prefers rejected (wrong!)", torch.tensor(3.0), torch.tensor(7.0)),
]

print("\nScenario Analysis:")
print("-"*60)
print(f"{'Scenario':<35} {'Chosen':>8} {'Rejected':>8} {'P(C>R)':>8} {'Loss':>8}")
print("-"*60)

for name, chosen, rejected in scenarios:
    prob = bradley_terry_prob(chosen, rejected).item()
    loss = reward_model_loss(chosen, rejected).item()
    print(f"{name:<35} {chosen.item():>8.1f} {rejected.item():>8.1f} {prob:>8.1%} {loss:>8.3f}")

print("\n" + "="*60)
print("KEY INSIGHT: The loss is low when RM agrees with human preference!")

In [None]:
# Visualize the loss function

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Probability curve
ax1 = axes[0]
score_diff = np.linspace(-6, 6, 100)
prob = 1 / (1 + np.exp(-score_diff))  # Sigmoid

ax1.plot(score_diff, prob, 'b-', linewidth=3)
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

ax1.fill_between(score_diff, prob, 0.5, where=(score_diff > 0), alpha=0.3, color='green', label='RM prefers chosen')
ax1.fill_between(score_diff, prob, 0.5, where=(score_diff < 0), alpha=0.3, color='red', label='RM prefers rejected')

ax1.set_xlabel('RM(chosen) - RM(rejected)', fontsize=11)
ax1.set_ylabel('P(chosen > rejected)', fontsize=11)
ax1.set_title('Bradley-Terry Probability', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Loss curve
ax2 = axes[1]
loss = -np.log(prob + 1e-10)  # -log(sigmoid(diff))

ax2.plot(score_diff, loss, 'r-', linewidth=3)
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Annotate regions
ax2.annotate('Low loss\n(RM correct)', xy=(3, 0.1), fontsize=10, ha='center', color='green')
ax2.annotate('High loss\n(RM wrong!)', xy=(-3, 3), fontsize=10, ha='center', color='red')

ax2.set_xlabel('RM(chosen) - RM(rejected)', fontsize=11)
ax2.set_ylabel('Loss', fontsize=11)
ax2.set_title('Bradley-Terry Loss', fontsize=14, fontweight='bold')
ax2.set_ylim(0, 6)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTRAINING OBJECTIVE:")
print("  The RM learns to make score(chosen) > score(rejected)")
print("  Larger margin = lower loss = more confident")

---
## Building a Reward Model from Scratch

```
    ┌────────────────────────────────────────────────────────────────┐
    │              REWARD MODEL ARCHITECTURE                         │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  STRUCTURE:                                                    │
    │    RM = Transformer Backbone + Reward Head                    │
    │                                                                │
    │    ┌───────────────────────────┐                              │
    │    │  [CLS] token embedding    │                              │
    │    └───────────┬───────────────┘                              │
    │                │                                               │
    │    ┌───────────▼───────────────┐                              │
    │    │     Linear Layer          │  (hidden_dim → 1)            │
    │    └───────────┬───────────────┘                              │
    │                │                                               │
    │    ┌───────────▼───────────────┐                              │
    │    │     Scalar Output         │  (no activation!)            │
    │    └───────────────────────────┘                              │
    │                                                                │
    │  OPTIONS FOR BASE MODEL:                                      │
    │    • Same as SFT model (common)                               │
    │    • Smaller model (faster)                                   │
    │    • Initialized from SFT checkpoint (better!)                │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
class SimpleRewardModel(nn.Module):
    """
    A simplified Reward Model for demonstration.
    
    In practice, this would be a transformer model.
    Here we use a simple MLP to show the concept.
    
    Architecture:
        Input embedding → Hidden layers → Scalar output
    """
    
    def __init__(self, input_dim=64, hidden_dim=128):
        super().__init__()
        
        # "Backbone" - in practice, this is a transformer
        self.backbone = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
        )
        
        # Reward head - outputs scalar
        self.reward_head = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        """
        Forward pass: embedding → scalar reward.
        
        Args:
            x: Input embedding (batch_size, input_dim)
            
        Returns:
            Scalar reward for each input (batch_size, 1)
        """
        features = self.backbone(x)
        reward = self.reward_head(features)
        return reward


# Create and inspect the model
print("SIMPLE REWARD MODEL ARCHITECTURE")
print("="*60)

rm = SimpleRewardModel(input_dim=64, hidden_dim=128)
print(rm)

# Count parameters
total_params = sum(p.numel() for p in rm.parameters())
print(f"\nTotal parameters: {total_params:,}")

# Test forward pass
test_input = torch.randn(4, 64)  # Batch of 4 embeddings
rewards = rm(test_input)
print(f"\nInput shape: {test_input.shape}")
print(f"Output shape: {rewards.shape}")
print(f"Sample rewards: {rewards.squeeze().tolist()}")
print("="*60)

In [None]:
def train_reward_model(rm, preference_data, epochs=100, lr=0.001):
    """
    Train the reward model on preference data.
    
    Args:
        rm: Reward model
        preference_data: Dict with 'chosen' and 'rejected' embeddings
        epochs: Number of training epochs
        lr: Learning rate
    
    Returns:
        Training history (losses, accuracies)
    """
    optimizer = optim.Adam(rm.parameters(), lr=lr)
    
    losses = []
    accuracies = []
    
    chosen = preference_data['chosen']
    rejected = preference_data['rejected']
    
    for epoch in range(epochs):
        optimizer.zero_grad()
        
        # Get rewards for both
        reward_chosen = rm(chosen)
        reward_rejected = rm(rejected)
        
        # Bradley-Terry loss
        loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
        
        # Compute accuracy
        accuracy = (reward_chosen > reward_rejected).float().mean().item()
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
        accuracies.append(accuracy)
    
    return {'losses': losses, 'accuracies': accuracies}


# Create synthetic preference data
print("TRAINING REWARD MODEL")
print("="*60)

# Generate data where "chosen" has a clear pattern
n_samples = 500
np.random.seed(42)

# Chosen responses have higher values in first few dimensions
chosen_embeddings = torch.randn(n_samples, 64)
chosen_embeddings[:, :5] += 2  # Add "quality signal"

# Rejected responses have lower values
rejected_embeddings = torch.randn(n_samples, 64)
rejected_embeddings[:, :5] -= 1  # Lower quality

preference_data = {
    'chosen': chosen_embeddings,
    'rejected': rejected_embeddings
}

print(f"Training samples: {n_samples}")
print(f"Embedding dimension: 64")
print("\nTraining...")

# Train
rm = SimpleRewardModel(input_dim=64, hidden_dim=128)
history = train_reward_model(rm, preference_data, epochs=200, lr=0.01)

print(f"\nFinal loss: {history['losses'][-1]:.4f}")
print(f"Final accuracy: {history['accuracies'][-1]:.1%}")
print("="*60)

In [None]:
# Visualize training

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Loss curve
ax1 = axes[0]
ax1.plot(history['losses'], color='#f57c00', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=11)
ax1.set_ylabel('Loss', fontsize=11)
ax1.set_title('Reward Model Training Loss', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Right: Accuracy curve
ax2 = axes[1]
ax2.plot(history['accuracies'], color='#4caf50', linewidth=2)
ax2.axhline(y=0.5, color='gray', linestyle='--', label='Random (50%)')
ax2.set_xlabel('Epoch', fontsize=11)
ax2.set_ylabel('Accuracy', fontsize=11)
ax2.set_title('Preference Prediction Accuracy', fontsize=14, fontweight='bold')
ax2.set_ylim(0.4, 1.0)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nThe reward model learned to predict human preferences!")

In [None]:
# Visualize learned reward distribution

fig, ax = plt.subplots(figsize=(10, 6))

with torch.no_grad():
    chosen_rewards = rm(chosen_embeddings).numpy().flatten()
    rejected_rewards = rm(rejected_embeddings).numpy().flatten()

ax.hist(chosen_rewards, bins=30, alpha=0.7, label='Chosen (Preferred)', color='#4caf50', density=True)
ax.hist(rejected_rewards, bins=30, alpha=0.7, label='Rejected', color='#f44336', density=True)

ax.axvline(x=np.mean(chosen_rewards), color='#388e3c', linestyle='--', linewidth=2, 
           label=f'Chosen mean: {np.mean(chosen_rewards):.2f}')
ax.axvline(x=np.mean(rejected_rewards), color='#d32f2f', linestyle='--', linewidth=2,
           label=f'Rejected mean: {np.mean(rejected_rewards):.2f}')

ax.set_xlabel('Reward Score', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Learned Reward Score Distribution', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nThe distributions are well separated!")
print(f"Mean difference: {np.mean(chosen_rewards) - np.mean(rejected_rewards):.2f}")

---
## Common Pitfalls in Reward Modeling

```
    ┌────────────────────────────────────────────────────────────────┐
    │              REWARD MODEL PITFALLS                             │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  1. REWARD HACKING                                            │
    │     RM learns spurious correlations instead of true quality   │
    │     Example: Longer responses get higher scores               │
    │     Solution: Length normalization, diverse training data     │
    │                                                                │
    │  2. DISTRIBUTION SHIFT                                        │
    │     RM trained on SFT outputs, but PPO produces different    │
    │     Example: RM never saw adversarial outputs                │
    │     Solution: Iterative training, online data collection     │
    │                                                                │
    │  3. ANNOTATOR DISAGREEMENT                                    │
    │     Humans don't always agree on preferences                 │
    │     Example: Humor is subjective                             │
    │     Solution: Multiple annotators, uncertainty modeling      │
    │                                                                │
    │  4. OVERCONFIDENCE                                           │
    │     RM gives high scores to out-of-distribution inputs       │
    │     Solution: Ensemble RMs, calibration                      │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate length bias (common pitfall)

print("PITFALL: LENGTH BIAS DEMONSTRATION")
print("="*60)

# Simulate: longer responses get higher scores (BAD!)
def biased_scoring(responses):
    """A biased scoring function that prefers longer responses."""
    return [len(r) * 0.1 for r in responses]

responses = [
    "Paris.",  # Correct but short
    "The capital of France is Paris.",  # Good response
    "The capital of France is Paris. Paris is known as the City of Light and has many famous landmarks including the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.",  # Verbose
    "I'm not entirely sure but I think maybe possibly it could be Paris or perhaps some other city I don't really know for certain...",  # Verbose but BAD
]

biased_scores = biased_scoring(responses)

print("\nResponses scored by length-biased RM:")
print("-"*60)
for i, (resp, score) in enumerate(zip(responses, biased_scores)):
    print(f"Response {i+1} (len={len(resp):3d}): Score = {score:.1f}")
    print(f"  '{resp[:60]}...'" if len(resp) > 60 else f"  '{resp}'")
    print()

print("PROBLEM: Response 4 is BAD but gets highest score!")
print("SOLUTION: Normalize by length, train with diverse data.")

---
## Using TRL for Reward Model Training

In practice, use the TRL library from Hugging Face!

In [None]:
# Check TRL availability and show example
try:
    from trl import RewardTrainer, RewardConfig
    TRL_AVAILABLE = True
    print("✓ TRL is installed!")
except ImportError:
    TRL_AVAILABLE = False
    print("✗ TRL not installed.")
    print("  Install with: pip install trl transformers")

# Show example code
print("\n" + "="*60)
print("TRL REWARD MODEL TRAINING EXAMPLE")
print("="*60)

example_code = '''
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Load base model (often same architecture as SFT model)
model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=1  # Single scalar output
)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset
# Format: {"chosen": "good response", "rejected": "bad response"}
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:1000]")

# Configure training
training_args = RewardConfig(
    output_dir="reward_model",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=1e-5,
    max_length=512,
)

# Create trainer
trainer = RewardTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Train!
trainer.train()

# Save
trainer.save_model("reward_model_final")
'''

print(example_code)
print("="*60)

---
## Summary: Key Takeaways

### Reward Model Concept

| Component | Description |
|-----------|-------------|
| **Input** | (prompt, response) pair |
| **Output** | Scalar reward score |
| **Training** | Bradley-Terry loss on preferences |

### Bradley-Terry Model

```
P(A > B) = σ(RM(A) - RM(B))

Loss = -log σ(RM(chosen) - RM(rejected))
```

### Common Pitfalls

| Pitfall | Solution |
|---------|----------|
| Length bias | Normalize, diverse data |
| Distribution shift | Iterative training |
| Annotator disagreement | Multiple annotators |
| Overconfidence | Ensemble models |

---
## Test Your Understanding

**1. What is the Bradley-Terry model?**
<details>
<summary>Click to reveal answer</summary>
Bradley-Terry is a probability model for pairwise comparisons. It models the probability that response A is preferred over B as:

P(A > B) = σ(RM(A) - RM(B))

where σ is the sigmoid function. The reward model is trained to maximize this probability for human-chosen responses.
</details>

**2. Why do we train on comparisons instead of absolute scores?**
<details>
<summary>Click to reveal answer</summary>
Comparisons are easier for humans to provide! Saying "A is better than B" is more natural than assigning a score like "7.3". Also, comparison data is more consistent across annotators and less prone to calibration issues.
</details>

**3. What is reward hacking and how do we prevent it?**
<details>
<summary>Click to reveal answer</summary>
Reward hacking is when the RL policy exploits weaknesses in the reward model to get high scores without actually being helpful. For example, if the RM prefers longer responses, the policy might generate verbose nonsense.

Prevention: Length normalization, diverse training data, KL penalty during PPO, iterative reward model updates.
</details>

**4. What architecture is typically used for reward models?**
<details>
<summary>Click to reveal answer</summary>
Reward models typically use the same transformer architecture as the language model being aligned, but with a reward head (linear layer) that outputs a scalar instead of vocabulary logits. Often initialized from the SFT checkpoint for better starting point.
</details>

---
## What's Next?

Now that you understand reward modeling, let's see how to use it with PPO!

**Continue to:** [Notebook 3: PPO for Language Models](03_ppo_for_language_models.ipynb)

---

*The reward model is the judge - it learns to distinguish good from bad responses!*