# Part 6.4: PPO and Modern RL

This is the capstone of our RL journey — and it connects directly to how modern AI assistants like ChatGPT and Claude are trained. **Proximal Policy Optimization (PPO)** is the workhorse algorithm behind RLHF, the technique that transforms a raw language model into a helpful, harmless, and honest assistant.

PPO solves the trust region problem from Notebook 19 with a beautifully simple clipped objective. Combined with **Generalized Advantage Estimation (GAE)**, it provides stable, efficient policy optimization that scales from simple control tasks to aligning billion-parameter language models.

## Learning Objectives

- [ ] Derive and implement PPO's clipped surrogate objective
- [ ] Understand why clipping prevents destructively large policy updates
- [ ] Implement Generalized Advantage Estimation (GAE)
- [ ] Build a complete PPO agent from scratch in PyTorch
- [ ] Train PPO on a control task and analyze its behavior
- [ ] Implement a reward model trained on preference data
- [ ] Build the complete RLHF pipeline: SFT → Reward Model → PPO
- [ ] Understand the KL penalty and why it's critical for RLHF
- [ ] Connect the full curriculum: from linear algebra to language model alignment

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

np.random.seed(42)
torch.manual_seed(42)

print("Part 6.4: PPO and Modern RL")
print("=" * 50)

---

## 1. The PPO Objective

Recall the problem from Notebook 19: vanilla policy gradients can take destructively large steps, causing the policy to collapse. TRPO solved this with constrained optimization, but it's complex and expensive.

### PPO's Key Insight: Clipping

PPO uses the **probability ratio** between the new and old policies:

$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

- $r_t = 1$: New policy same as old
- $r_t > 1$: New policy makes this action *more* likely
- $r_t < 1$: New policy makes this action *less* likely

The **clipped surrogate objective** is:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$

where $\epsilon$ (typically 0.2) is the clipping parameter.

### Why This Works

The min and clip create a "trust region" around the old policy:

- If $\hat{A}_t > 0$ (good action): We want to increase $r_t$, but the clip at $1+\epsilon$ prevents going too far
- If $\hat{A}_t < 0$ (bad action): We want to decrease $r_t$, but the clip at $1-\epsilon$ prevents going too far

The result: the policy improves monotonically, without the instability of unconstrained policy gradients.

### Visualization: The Clipped Objective

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
epsilon = 0.2
r = np.linspace(0.2, 2.0, 500)

# Case 1: Positive advantage
ax = axes[0]
A = 1.0  # Positive advantage
unclipped = r * A
clipped = np.clip(r, 1 - epsilon, 1 + epsilon) * A
objective = np.minimum(unclipped, clipped)

ax.plot(r, unclipped, 'b--', linewidth=2, label='Unclipped: r·A', alpha=0.6)
ax.plot(r, clipped, 'r--', linewidth=2, label='Clipped: clip(r)·A', alpha=0.6)
ax.plot(r, objective, 'g-', linewidth=3, label='PPO objective: min(·,·)')
ax.axvline(x=1.0, color='gray', linestyle=':', alpha=0.5)
ax.axvline(x=1-epsilon, color='orange', linestyle='--', alpha=0.5, label=f'1-ε = {1-epsilon}')
ax.axvline(x=1+epsilon, color='orange', linestyle='--', alpha=0.5, label=f'1+ε = {1+epsilon}')
ax.fill_between(r, objective, alpha=0.1, color='green')
ax.set_xlabel('Probability ratio r(θ)', fontsize=12)
ax.set_ylabel('Objective', fontsize=12)
ax.set_title('Positive Advantage (A > 0)\n"Good action — increase probability, but not too much"',
             fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Case 2: Negative advantage
ax = axes[1]
A = -1.0  # Negative advantage
unclipped = r * A
clipped = np.clip(r, 1 - epsilon, 1 + epsilon) * A
objective = np.minimum(unclipped, clipped)

ax.plot(r, unclipped, 'b--', linewidth=2, label='Unclipped: r·A', alpha=0.6)
ax.plot(r, clipped, 'r--', linewidth=2, label='Clipped: clip(r)·A', alpha=0.6)
ax.plot(r, objective, 'g-', linewidth=3, label='PPO objective: min(·,·)')
ax.axvline(x=1.0, color='gray', linestyle=':', alpha=0.5)
ax.axvline(x=1-epsilon, color='orange', linestyle='--', alpha=0.5)
ax.axvline(x=1+epsilon, color='orange', linestyle='--', alpha=0.5)
ax.fill_between(r, objective, alpha=0.1, color='green')
ax.set_xlabel('Probability ratio r(θ)', fontsize=12)
ax.set_ylabel('Objective', fontsize=12)
ax.set_title('Negative Advantage (A < 0)\n"Bad action — decrease probability, but not too much"',
             fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

plt.suptitle('PPO Clipped Surrogate Objective (ε = 0.2)', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Key: The green line (PPO objective) is flat outside [1-ε, 1+ε].")
print("This means there's NO gradient incentive to change the policy too much.")

---

## 2. Generalized Advantage Estimation (GAE)

To compute advantages, we need a good estimate. **GAE** provides a tunable tradeoff between bias and variance:

$$\hat{A}_t^{GAE} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

The parameter $\lambda \in [0, 1]$ controls the tradeoff:

| λ | Estimate | Bias | Variance |
|---|---------|------|----------|
| 0 | One-step TD: $\delta_t$ | High bias | Low variance |
| 1 | Full Monte Carlo return | No bias | High variance |
| 0.95 | Typical PPO setting | Good balance | Good balance |

In [None]:
def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95, dones=None):
    """Compute Generalized Advantage Estimation.
    
    Args:
        rewards: list of rewards [r_0, r_1, ..., r_T]
        values: list of value estimates [V(s_0), V(s_1), ..., V(s_T)]
        next_value: V(s_{T+1}) (bootstrap value)
        gamma: discount factor
        lam: GAE parameter (0 = TD, 1 = Monte Carlo)
        dones: list of done flags
    
    Returns:
        advantages: GAE advantage estimates
        returns: advantages + values (targets for value function)
    """
    T = len(rewards)
    advantages = np.zeros(T)
    
    if dones is None:
        dones = [False] * T
    
    # Work backwards
    gae = 0
    for t in reversed(range(T)):
        if t == T - 1:
            next_val = next_value * (1 - dones[t])
        else:
            next_val = values[t + 1] * (1 - dones[t])
        
        # TD error
        delta = rewards[t] + gamma * next_val - values[t]
        
        # GAE: exponentially-weighted sum of TD errors
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages[t] = gae
    
    returns = advantages + np.array(values)
    return advantages, returns


# Demonstrate GAE with different lambda values
np.random.seed(42)
T = 20
rewards = np.random.randn(T) * 0.5 + 0.5
values = np.cumsum(np.random.randn(T) * 0.3) + 5
next_value = values[-1] + 0.1

fig, ax = plt.subplots(1, 1, figsize=(12, 6))

lambdas = [0.0, 0.5, 0.95, 1.0]
colors = ['#e74c3c', '#f39c12', '#2ecc71', '#3498db']

for lam, color in zip(lambdas, colors):
    advantages, _ = compute_gae(rewards, values, next_value, gamma=0.99, lam=lam)
    ax.plot(advantages, 'o-', color=color, linewidth=2, markersize=4,
            label=f'λ = {lam} (var = {np.var(advantages):.2f})')

ax.axhline(y=0, color='black', linewidth=0.5)
ax.set_xlabel('Time step', fontsize=12)
ax.set_ylabel('Advantage estimate', fontsize=12)
ax.set_title('GAE: Bias-Variance Tradeoff with λ', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("λ=0: Low variance but biased (only uses one-step TD error)")
print("λ=1: Unbiased but high variance (uses full returns)")
print("λ=0.95: Sweet spot — PPO's default, used in practice")

---

## 3. Building PPO from Scratch

Let's build a complete PPO implementation. The algorithm:

1. **Collect** a batch of trajectories using the current policy
2. **Compute** advantages using GAE
3. **Optimize** the clipped surrogate objective for multiple epochs on the same batch
4. **Repeat**

The key innovation: PPO reuses the same batch for **multiple gradient steps** (unlike vanilla policy gradients which use each sample once). The clipping ensures these multiple steps don't move too far.

In [None]:
class CartPoleSimple:
    """CartPole environment (reused from Notebooks 18-19)."""
    def __init__(self):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masscart + self.masspole
        self.length = 0.5
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02
        self.x_threshold = 2.4
        self.theta_threshold = 12 * np.pi / 180
        self.state_dim = 4
        self.n_actions = 2
        self.state = None
    
    def reset(self):
        self.state = np.random.uniform(-0.05, 0.05, size=4)
        return self.state.copy()
    
    def step(self, action):
        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag
        cos_theta, sin_theta = np.cos(theta), np.sin(theta)
        temp = (force + self.polemass_length * theta_dot**2 * sin_theta) / self.total_mass
        theta_acc = (self.gravity * sin_theta - cos_theta * temp) / (
            self.length * (4.0/3.0 - self.masspole * cos_theta**2 / self.total_mass))
        x_acc = temp - self.polemass_length * theta_acc * cos_theta / self.total_mass
        x += self.tau * x_dot; x_dot += self.tau * x_acc
        theta += self.tau * theta_dot; theta_dot += self.tau * theta_acc
        self.state = np.array([x, x_dot, theta, theta_dot])
        done = (abs(x) > self.x_threshold or abs(theta) > self.theta_threshold)
        return self.state.copy(), 1.0 if not done else 0.0, done

In [None]:
class PPOActorCritic(nn.Module):
    """Shared actor-critic network for PPO."""
    
    def __init__(self, state_dim, n_actions, hidden_dim=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        self.actor = nn.Linear(hidden_dim, n_actions)
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        features = self.shared(x)
        action_logits = self.actor(features)
        value = self.critic(features).squeeze(-1)
        return action_logits, value
    
    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        dist = Categorical(logits=logits)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy(), value
    
    def evaluate_action(self, state, action):
        """Evaluate a previously taken action (for PPO update)."""
        logits, value = self.forward(state)
        dist = Categorical(logits=logits)
        return dist.log_prob(action), dist.entropy(), value


class PPOAgent:
    """Proximal Policy Optimization agent."""
    
    def __init__(self, state_dim, n_actions, lr=3e-4, gamma=0.99, lam=0.95,
                 clip_epsilon=0.2, value_coef=0.5, entropy_coef=0.01,
                 ppo_epochs=4, mini_batch_size=64, max_grad_norm=0.5):
        
        self.gamma = gamma
        self.lam = lam
        self.clip_epsilon = clip_epsilon
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.ppo_epochs = ppo_epochs
        self.mini_batch_size = mini_batch_size
        self.max_grad_norm = max_grad_norm
        
        self.network = PPOActorCritic(state_dim, n_actions)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
    
    def collect_rollout(self, env, n_steps=2048):
        """Collect a batch of experience from the environment."""
        states, actions, rewards, dones = [], [], [], []
        log_probs, values = [], []
        
        state = env.reset()
        episode_rewards = []
        episode_lengths = []
        current_ep_reward = 0
        current_ep_length = 0
        
        for _ in range(n_steps):
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            with torch.no_grad():
                action, log_prob, _, value = self.network.get_action_and_value(state_tensor)
            
            next_state, reward, done = env.step(action.item())
            
            states.append(state)
            actions.append(action.item())
            rewards.append(reward)
            dones.append(float(done))
            log_probs.append(log_prob.item())
            values.append(value.item())
            
            current_ep_reward += reward
            current_ep_length += 1
            
            if done:
                episode_rewards.append(current_ep_reward)
                episode_lengths.append(current_ep_length)
                current_ep_reward = 0
                current_ep_length = 0
                state = env.reset()
            else:
                state = next_state
        
        # Bootstrap value for last state
        with torch.no_grad():
            _, next_value = self.network(torch.FloatTensor(state).unsqueeze(0))
            next_value = next_value.item()
        
        # Compute GAE advantages
        advantages, returns = compute_gae(
            rewards, values, next_value, self.gamma, self.lam, dones
        )
        
        rollout = {
            'states': np.array(states),
            'actions': np.array(actions),
            'log_probs': np.array(log_probs),
            'returns': returns,
            'advantages': advantages,
            'values': np.array(values),
        }
        
        return rollout, episode_rewards, episode_lengths
    
    def update(self, rollout):
        """Perform PPO update on collected rollout."""
        states = torch.FloatTensor(rollout['states'])
        actions = torch.LongTensor(rollout['actions'])
        old_log_probs = torch.FloatTensor(rollout['log_probs'])
        returns = torch.FloatTensor(rollout['returns'])
        advantages = torch.FloatTensor(rollout['advantages'])
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        n_samples = len(states)
        metrics = {'policy_loss': [], 'value_loss': [], 'entropy': [],
                   'approx_kl': [], 'clip_fraction': []}
        
        # Multiple epochs over the same data (the PPO innovation!)
        for epoch in range(self.ppo_epochs):
            # Random mini-batch indices
            indices = np.arange(n_samples)
            np.random.shuffle(indices)
            
            for start in range(0, n_samples, self.mini_batch_size):
                end = start + self.mini_batch_size
                batch_idx = indices[start:end]
                
                # Get current policy's evaluation of old actions
                new_log_probs, entropy, new_values = self.network.evaluate_action(
                    states[batch_idx], actions[batch_idx]
                )
                
                # Probability ratio
                ratio = torch.exp(new_log_probs - old_log_probs[batch_idx])
                
                # Clipped surrogate objective
                batch_advantages = advantages[batch_idx]
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1 - self.clip_epsilon,
                                    1 + self.clip_epsilon) * batch_advantages
                policy_loss = -torch.min(surr1, surr2).mean()
                
                # Value loss
                value_loss = F.mse_loss(new_values, returns[batch_idx])
                
                # Entropy bonus
                entropy_loss = -entropy.mean()
                
                # Total loss
                total_loss = (policy_loss + 
                             self.value_coef * value_loss + 
                             self.entropy_coef * entropy_loss)
                
                self.optimizer.zero_grad()
                total_loss.backward()
                nn.utils.clip_grad_norm_(self.network.parameters(), self.max_grad_norm)
                self.optimizer.step()
                
                # Track metrics
                with torch.no_grad():
                    approx_kl = (old_log_probs[batch_idx] - new_log_probs).mean().item()
                    clip_frac = ((ratio - 1.0).abs() > self.clip_epsilon).float().mean().item()
                
                metrics['policy_loss'].append(policy_loss.item())
                metrics['value_loss'].append(value_loss.item())
                metrics['entropy'].append(-entropy_loss.item())
                metrics['approx_kl'].append(approx_kl)
                metrics['clip_fraction'].append(clip_frac)
        
        return {k: np.mean(v) for k, v in metrics.items()}


print("PPO Agent architecture:")
agent = PPOAgent(state_dim=4, n_actions=2)
print(agent.network)
total_params = sum(p.numel() for p in agent.network.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(f"PPO epochs per update: {agent.ppo_epochs}")
print(f"Clip epsilon: {agent.clip_epsilon}")
print(f"GAE lambda: {agent.lam}")

### Training the PPO Agent

In [None]:
def train_ppo(env, agent, n_iterations=50, n_steps_per_rollout=2048):
    """Train PPO agent."""
    all_rewards = []
    all_lengths = []
    all_metrics = []
    
    for iteration in range(n_iterations):
        # Collect experience
        rollout, ep_rewards, ep_lengths = agent.collect_rollout(env, n_steps_per_rollout)
        
        # PPO update
        metrics = agent.update(rollout)
        
        all_rewards.extend(ep_rewards)
        all_lengths.extend(ep_lengths)
        all_metrics.append(metrics)
        
        if (iteration + 1) % 5 == 0:
            avg_reward = np.mean(ep_rewards) if ep_rewards else 0
            avg_length = np.mean(ep_lengths) if ep_lengths else 0
            print(f"Iter {iteration+1:3d} | Avg Reward: {avg_reward:6.1f} | "
                  f"Avg Length: {avg_length:5.1f} | "
                  f"KL: {metrics['approx_kl']:.4f} | "
                  f"Clip%: {metrics['clip_fraction']:.2%}")
    
    return all_rewards, all_lengths, all_metrics


# Train PPO
env = CartPoleSimple()
ppo_agent = PPOAgent(state_dim=4, n_actions=2, lr=3e-4)

rewards_ppo, lengths_ppo, metrics_ppo = train_ppo(env, ppo_agent, n_iterations=50)

### Visualization: PPO Training Metrics

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Episode lengths
ax = axes[0, 0]
window = 20
ax.plot(lengths_ppo, alpha=0.3, color='#3498db')
if len(lengths_ppo) >= window:
    smoothed = np.convolve(lengths_ppo, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(lengths_ppo)), smoothed, color='#2c3e50', linewidth=2)
ax.axhline(y=200, color='red', linestyle='--', label='Goal (200 steps)')
ax.set_xlabel('Episode')
ax.set_ylabel('Episode Length')
ax.set_title('PPO: Episode Length', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Approximate KL divergence
ax = axes[0, 1]
kls = [m['approx_kl'] for m in metrics_ppo]
ax.plot(kls, 'o-', color='#9b59b6', markersize=4)
ax.set_xlabel('PPO Iteration')
ax.set_ylabel('Approx KL Divergence')
ax.set_title('KL Divergence (Policy Change per Update)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

# Clip fraction
ax = axes[1, 0]
clips = [m['clip_fraction'] for m in metrics_ppo]
ax.plot(clips, 'o-', color='#e74c3c', markersize=4)
ax.set_xlabel('PPO Iteration')
ax.set_ylabel('Fraction of Clipped Updates')
ax.set_title('Clip Fraction (How Often Clipping Activates)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

# Entropy
ax = axes[1, 1]
entropies = [m['entropy'] for m in metrics_ppo]
ax.plot(entropies, 'o-', color='#2ecc71', markersize=4)
ax.set_xlabel('PPO Iteration')
ax.set_ylabel('Policy Entropy')
ax.set_title('Entropy (Exploration Level)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.suptitle('PPO Training Dashboard', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print("Key observations:")
print("  - KL divergence stays small → policy changes are controlled")
print("  - Clip fraction shows how often the trust region is active")
print("  - Entropy gradually decreases as policy becomes more confident")

---

## 4. PPO vs. Previous Methods

Let's compare PPO against A2C and REINFORCE on the same task, using the same total number of environment steps.

In [None]:
# A2C agent (from Notebook 19) for comparison
class A2CAgent:
    def __init__(self, state_dim, n_actions, lr=3e-4, gamma=0.99):
        self.gamma = gamma
        self.network = PPOActorCritic(state_dim, n_actions)  # Same architecture
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
    
    def train_episode(self, env, max_steps=500):
        state = env.reset()
        log_probs, values, rewards, entropies = [], [], [], []
        
        for step in range(max_steps):
            state_t = torch.FloatTensor(state).unsqueeze(0)
            action, lp, ent, val = self.network.get_action_and_value(state_t)
            next_state, reward, done = env.step(action.item())
            
            log_probs.append(lp)
            values.append(val)
            rewards.append(reward)
            entropies.append(ent)
            state = next_state
            if done:
                break
        
        # Compute returns
        returns_list = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns_list.insert(0, G)
        returns_t = torch.FloatTensor(returns_list)
        
        values_t = torch.cat(values)
        log_probs_t = torch.cat(log_probs)
        entropies_t = torch.cat(entropies)
        advantages = returns_t - values_t.detach()
        if len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        loss = (-(log_probs_t * advantages).mean() + 
                0.5 * F.mse_loss(values_t, returns_t) - 
                0.01 * entropies_t.mean())
        
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)
        self.optimizer.step()
        
        return sum(rewards), step + 1


# Run comparison
print("Training A2C for comparison...")
a2c_agent = A2CAgent(4, 2)
l_a2c = []
for ep in range(500):
    _, length = a2c_agent.train_episode(CartPoleSimple())
    l_a2c.append(length)
    if (ep + 1) % 100 == 0:
        print(f"  Episode {ep+1}: avg length = {np.mean(l_a2c[-100:]):.1f}")

# Compare
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
window = 30

for data, label, color in [(lengths_ppo, 'PPO', '#2ecc71'),
                            (l_a2c, 'A2C', '#3498db')]:
    if len(data) >= window:
        smoothed = np.convolve(data, np.ones(window)/window, mode='valid')
        ax.plot(smoothed, label=label, color=color, linewidth=2.5)

ax.axhline(y=200, color='gray', linestyle='--', alpha=0.5, label='Goal')
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Length (smoothed)', fontsize=12)
ax.set_title('PPO vs A2C: Stability and Performance', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 5. From RL to RLHF: The Complete Pipeline

Now we connect everything to **language model alignment**. The RLHF pipeline has three stages:

### Stage 1: Supervised Fine-Tuning (SFT)
Train the LM on high-quality demonstrations (human-written responses).

### Stage 2: Reward Model Training
Train a model to predict human preferences. Given two responses, it learns which one humans prefer.

### Stage 3: PPO Optimization
Use PPO to optimize the LM's policy to maximize the reward model's scores, with a KL penalty to stay close to the SFT model.

$$\text{objective} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}\left[R_\phi(x, y) - \beta \cdot D_{KL}(\pi_\theta \| \pi_{SFT})\right]$$

The KL penalty is **critical** — without it, the model would find degenerate ways to maximize the reward model (reward hacking).

In [None]:
# Visualization: RLHF Pipeline
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('The RLHF Pipeline', fontsize=18, fontweight='bold')

# Stage boxes
stages = [
    (0.5, 7, 3.5, 2, 'Stage 1: SFT', '#3498db',
     'Fine-tune LM on\nhuman demonstrations'),
    (5, 7, 3.5, 2, 'Stage 2: Reward\nModel', '#e74c3c',
     'Learn human\npreferences'),
    (9.5, 7, 3.5, 2, 'Stage 3: PPO', '#2ecc71',
     'Optimize policy with\nreward + KL penalty'),
]

for x, y, w, h, title, color, desc in stages:
    box = mpatches.FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.3",
                                   facecolor=color, edgecolor='black', linewidth=2, alpha=0.9)
    ax.add_patch(box)
    ax.text(x + w/2, y + h/2 + 0.2, title, ha='center', va='center',
            fontsize=11, fontweight='bold', color='white')
    ax.text(x + w/2, y - 0.5, desc, ha='center', va='center',
            fontsize=9, color='gray', style='italic')

# Arrows between stages
ax.annotate('', xy=(5, 8), xytext=(4, 8),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='gray'))
ax.annotate('', xy=(9.5, 8), xytext=(8.5, 8),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='gray'))

# Data flows
data_items = [
    (2.25, 5.8, 'Human\nDemonstrations', '#3498db'),
    (6.75, 5.8, 'Comparison\nData (A vs B)', '#e74c3c'),
    (11.25, 5.8, 'PPO + KL Penalty\n+ Reward Signal', '#2ecc71'),
]

for x, y, text, color in data_items:
    ax.text(x, y, text, ha='center', va='center', fontsize=9,
            color=color, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', 
                     edgecolor=color, alpha=0.8))

# Result
result_box = mpatches.FancyBboxPatch((4, 2), 6, 1.5, boxstyle="round,pad=0.3",
                                      facecolor='#f39c12', edgecolor='black', linewidth=2)
ax.add_patch(result_box)
ax.text(7, 2.75, 'Aligned Language Model\nHelpful, Harmless, Honest',
        ha='center', va='center', fontsize=12, fontweight='bold', color='white')

ax.annotate('', xy=(7, 3.5), xytext=(7, 5.3),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='#f39c12'))

# RL mapping
mapping = [
    (1, 1, 'Agent: LM'),
    (4, 1, 'State: Prompt + context'),
    (7.5, 1, 'Action: Next token'),
    (11, 1, 'Reward: R(x,y) - β·KL'),
]

for x, y, text in mapping:
    ax.text(x, y, text, fontsize=9, color='#2c3e50',
            bbox=dict(boxstyle='round,pad=0.2', facecolor='#ecf0f1', alpha=0.8))

ax.text(7, 0.3, 'RL Mapping', ha='center', fontsize=10, fontweight='bold', color='gray')

plt.tight_layout()
plt.show()

---

## 6. Building a Reward Model

The reward model is trained on **preference data**: pairs of responses where humans indicated which they prefer. Let's build one from scratch.

In [None]:
class RewardModel(nn.Module):
    """Reward model trained on preference data.
    
    Takes a (prompt, response) embedding and outputs a scalar reward score.
    Trained with the Bradley-Terry preference model.
    """
    
    def __init__(self, input_dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Scalar reward
        )
    
    def forward(self, x):
        return self.net(x).squeeze(-1)


def generate_preference_data(n_pairs=1000, input_dim=16):
    """Simulate preference data.
    
    We simulate a hidden 'quality' function that humans implicitly use to
    rank responses. The reward model must learn to approximate this function.
    """
    # Hidden quality function: true reward = linear combination + noise
    true_weights = np.random.randn(input_dim)
    
    pairs_chosen = []
    pairs_rejected = []
    
    for _ in range(n_pairs):
        # Generate two candidate responses (as embeddings)
        response_a = np.random.randn(input_dim) * 0.5
        response_b = np.random.randn(input_dim) * 0.5
        
        # True quality scores
        quality_a = np.dot(response_a, true_weights) + np.random.randn() * 0.3
        quality_b = np.dot(response_b, true_weights) + np.random.randn() * 0.3
        
        # Human "chooses" the higher quality response
        if quality_a > quality_b:
            pairs_chosen.append(response_a)
            pairs_rejected.append(response_b)
        else:
            pairs_chosen.append(response_b)
            pairs_rejected.append(response_a)
    
    return (torch.FloatTensor(np.array(pairs_chosen)),
            torch.FloatTensor(np.array(pairs_rejected)),
            true_weights)


def train_reward_model(model, chosen, rejected, n_epochs=50, lr=1e-3):
    """Train reward model using Bradley-Terry preference loss.
    
    Loss = -log(sigmoid(r_chosen - r_rejected))
    This is equivalent to: chosen should score higher than rejected.
    """
    optimizer = optim.Adam(model.parameters(), lr=lr)
    losses = []
    accuracies = []
    
    for epoch in range(n_epochs):
        r_chosen = model(chosen)
        r_rejected = model(rejected)
        
        # Bradley-Terry loss: -log P(chosen > rejected)
        loss = -F.logsigmoid(r_chosen - r_rejected).mean()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Accuracy: how often does the model rank chosen > rejected?
        with torch.no_grad():
            accuracy = (r_chosen > r_rejected).float().mean().item()
        
        losses.append(loss.item())
        accuracies.append(accuracy)
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f} | Accuracy: {accuracy:.3f}")
    
    return losses, accuracies


# Generate data and train
input_dim = 16
chosen, rejected, true_weights = generate_preference_data(n_pairs=2000, input_dim=input_dim)

reward_model = RewardModel(input_dim)
rm_losses, rm_accuracies = train_reward_model(reward_model, chosen, rejected)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(rm_losses, color='#e74c3c', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Bradley-Terry Loss', fontsize=12)
axes[0].set_title('Reward Model: Training Loss', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].plot(rm_accuracies, color='#2ecc71', linewidth=2)
axes[1].axhline(y=0.5, color='gray', linestyle='--', label='Random baseline')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Preference Accuracy', fontsize=12)
axes[1].set_title('Reward Model: Agreement with Human Preferences', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0.4, 1.0)

plt.tight_layout()
plt.show()

# Verify: reward model scores align with true quality
with torch.no_grad():
    test_responses = torch.randn(200, input_dim) * 0.5
    predicted_rewards = reward_model(test_responses).numpy()
    true_rewards = test_responses.numpy() @ true_weights

fig, ax = plt.subplots(1, 1, figsize=(7, 6))
ax.scatter(true_rewards, predicted_rewards, alpha=0.5, s=30, color='#3498db')
# Fit line
z = np.polyfit(true_rewards, predicted_rewards, 1)
p = np.poly1d(z)
ax.plot(sorted(true_rewards), p(sorted(true_rewards)), 'r-', linewidth=2, label=f'Correlation')
ax.set_xlabel('True Quality Score', fontsize=12)
ax.set_ylabel('Reward Model Prediction', fontsize=12)
ax.set_title('Reward Model Captures True Preferences', fontsize=13, fontweight='bold')
correlation = np.corrcoef(true_rewards, predicted_rewards)[0, 1]
ax.text(0.05, 0.95, f'Correlation: {correlation:.3f}', transform=ax.transAxes,
        fontsize=12, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 7. The KL Penalty: Preventing Reward Hacking

Without constraints, RL optimization will find **degenerate solutions** that maximize the reward model's score without actually being helpful. This is called **reward hacking**.

Example: A reward model might give high scores to long, confident-sounding responses. Without a KL penalty, the LM would learn to generate extremely long, repetitive text that sounds confident but says nothing useful.

The **KL divergence penalty** keeps the RL-trained policy close to the SFT model:

$$\text{reward}_{\text{total}} = R_\phi(x, y) - \beta \cdot D_{KL}(\pi_\theta(\cdot|x) \| \pi_{SFT}(\cdot|x))$$

- $\beta$ small: More freedom to optimize reward → risk of reward hacking
- $\beta$ large: Stay close to SFT model → limited improvement from RL
- $\beta$ just right: Meaningful improvement while maintaining quality

In [None]:
# Demonstrate the KL penalty effect
def simulate_rlhf_with_kl(beta, n_steps=100):
    """Simulate RLHF optimization with different KL penalty strengths."""
    # Simulate a policy as a distribution over 5 "response styles"
    sft_policy = np.array([0.2, 0.3, 0.15, 0.25, 0.1])  # SFT baseline
    reward_scores = np.array([0.3, 0.8, -0.2, 0.9, -0.5])  # Reward model scores
    
    policy = sft_policy.copy()
    policy_history = [policy.copy()]
    reward_history = []
    kl_history = []
    
    lr = 0.1
    for step in range(n_steps):
        # Sample from current policy
        action = np.random.choice(5, p=policy)
        reward = reward_scores[action]
        
        # KL divergence from SFT policy
        kl = np.sum(policy * np.log(policy / (sft_policy + 1e-8) + 1e-8))
        
        # Total reward with KL penalty
        total_reward = reward - beta * kl
        
        # Update (simplified policy gradient)
        gradient = np.zeros(5)
        gradient[action] = total_reward
        logits = np.log(policy + 1e-8) + lr * gradient
        policy = np.exp(logits) / np.exp(logits).sum()
        
        policy_history.append(policy.copy())
        reward_history.append(reward)
        kl_history.append(kl)
    
    return np.array(policy_history), reward_history, kl_history


fig, axes = plt.subplots(1, 3, figsize=(15, 5))

betas = [0.0, 0.1, 1.0]
titles = ['β=0 (No KL penalty)', 'β=0.1 (Moderate)', 'β=1.0 (Strong)']
style_names = ['Verbose', 'Concise', 'Rude', 'Helpful', 'Off-topic']
style_colors = ['#e74c3c', '#2ecc71', '#95a5a6', '#3498db', '#f39c12']

for ax, beta, title in zip(axes, betas, titles):
    policy_hist, rewards, kls = simulate_rlhf_with_kl(beta, n_steps=200)
    
    for i, (name, color) in enumerate(zip(style_names, style_colors)):
        ax.plot(policy_hist[:, i], color=color, linewidth=2, label=name)
    
    ax.set_xlabel('Step', fontsize=11)
    ax.set_ylabel('Policy probability', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)
    if ax == axes[2]:
        ax.legend(fontsize=8, loc='center right')

plt.suptitle('KL Penalty Controls Policy Drift from SFT', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("β=0: Policy collapses to one style (reward hacking)")
print("β=0.1: Policy shifts toward high-reward styles while maintaining diversity")
print("β=1.0: Policy barely moves from SFT (too conservative)")

---

## 8. Simplified RLHF Pipeline

Let's build a complete (simplified) RLHF pipeline that trains a small "language model" using PPO with a reward model and KL penalty.

In [None]:
class SimpleLanguageModel(nn.Module):
    """Simplified 'language model' that generates responses as continuous vectors.
    
    In reality, LLMs output token probabilities. We simplify by having the
    model output a response embedding directly.
    """
    
    def __init__(self, prompt_dim=8, response_dim=16, hidden_dim=32):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(prompt_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )
        self.mean_head = nn.Linear(hidden_dim, response_dim)
        self.log_std = nn.Parameter(torch.zeros(response_dim) - 1.0)
    
    def forward(self, prompt):
        features = self.net(prompt)
        mean = self.mean_head(features)
        std = torch.exp(self.log_std)
        return mean, std
    
    def generate(self, prompt):
        """Generate a response (sample from policy)."""
        mean, std = self.forward(prompt)
        dist = torch.distributions.Normal(mean, std)
        response = dist.rsample()
        log_prob = dist.log_prob(response).sum(dim=-1)
        return response, log_prob
    
    def log_prob(self, prompt, response):
        """Compute log probability of a response under this model."""
        mean, std = self.forward(prompt)
        dist = torch.distributions.Normal(mean, std)
        return dist.log_prob(response).sum(dim=-1)


class RLHFTrainer:
    """Complete RLHF training pipeline."""
    
    def __init__(self, prompt_dim=8, response_dim=16, kl_coef=0.1,
                 clip_epsilon=0.2, lr=1e-4):
        self.kl_coef = kl_coef
        self.clip_epsilon = clip_epsilon
        
        # The policy being trained
        self.policy = SimpleLanguageModel(prompt_dim, response_dim)
        
        # SFT reference model (frozen)
        self.ref_model = SimpleLanguageModel(prompt_dim, response_dim)
        self.ref_model.load_state_dict(self.policy.state_dict())
        for param in self.ref_model.parameters():
            param.requires_grad = False
        
        # Pre-trained reward model
        self.reward_model = reward_model  # From section 6
        for param in self.reward_model.parameters():
            param.requires_grad = False
        
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
    
    def compute_reward(self, prompts, responses):
        """Compute reward = R(x,y) - β * KL(π || π_ref)."""
        with torch.no_grad():
            # Reward model score
            rm_reward = self.reward_model(responses)
            
            # KL divergence between policy and reference
            policy_logprob = self.policy.log_prob(prompts, responses)
            ref_logprob = self.ref_model.log_prob(prompts, responses)
            kl_div = policy_logprob - ref_logprob
            
            # Total reward
            total_reward = rm_reward - self.kl_coef * kl_div
        
        return total_reward, rm_reward, kl_div
    
    def train_step(self, prompts, n_ppo_epochs=4):
        """One PPO training step."""
        # Generate responses from current policy
        with torch.no_grad():
            responses, old_log_probs = self.policy.generate(prompts)
        
        # Compute rewards
        total_rewards, rm_rewards, kl_divs = self.compute_reward(prompts, responses)
        
        # Normalize rewards (advantage estimate)
        advantages = (total_rewards - total_rewards.mean()) / (total_rewards.std() + 1e-8)
        
        # PPO update
        for _ in range(n_ppo_epochs):
            new_log_probs = self.policy.log_prob(prompts, responses)
            ratio = torch.exp(new_log_probs - old_log_probs.detach())
            
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.clip_epsilon,
                               1 + self.clip_epsilon) * advantages
            loss = -torch.min(surr1, surr2).mean()
            
            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
            self.optimizer.step()
        
        return {
            'loss': loss.item(),
            'rm_reward': rm_rewards.mean().item(),
            'kl_div': kl_divs.mean().item(),
            'total_reward': total_rewards.mean().item(),
        }


# Run RLHF training
trainer = RLHFTrainer(prompt_dim=8, response_dim=16, kl_coef=0.1)

history = {'rm_reward': [], 'kl_div': [], 'total_reward': []}

print("Training RLHF pipeline...")
for step in range(200):
    # Generate random prompts
    prompts = torch.randn(32, 8) * 0.5
    
    metrics = trainer.train_step(prompts)
    
    for key in history:
        history[key].append(metrics[key])
    
    if (step + 1) % 50 == 0:
        print(f"Step {step+1:3d} | RM Reward: {metrics['rm_reward']:7.3f} | "
              f"KL: {metrics['kl_div']:6.3f} | Total: {metrics['total_reward']:7.3f}")

In [None]:
# Visualize RLHF training
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Reward model score
ax = axes[0]
ax.plot(history['rm_reward'], color='#2ecc71', linewidth=2)
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('Reward Model Score', fontsize=12)
ax.set_title('RM Reward Increases\n(Model learns to please RM)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

# KL divergence
ax = axes[1]
ax.plot(history['kl_div'], color='#e74c3c', linewidth=2)
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('KL Divergence', fontsize=12)
ax.set_title('KL Divergence from SFT\n(Policy drift, controlled by β)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

# Total reward
ax = axes[2]
ax.plot(history['total_reward'], color='#3498db', linewidth=2)
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('Total Reward (RM - β·KL)', fontsize=12)
ax.set_title('Total RLHF Objective\n(Balances quality and safety)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.suptitle('RLHF Training Progress', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("The RLHF pipeline successfully:")
print("  1. Increases reward model score (learns better responses)")
print("  2. Controls KL divergence (doesn't drift too far from SFT)")
print("  3. Maximizes the total objective (quality + safety balance)")

---

## 9. The Full Picture: From Linear Algebra to RLHF

Let's trace how every notebook in this curriculum connects to RLHF:

| Notebook | Topic | How It's Used in RLHF |
|----------|-------|----------------------|
| **01** | Linear Algebra | Embeddings, weight matrices, attention |
| **02** | Calculus | Gradients, backpropagation, optimization |
| **03** | Probability | KL divergence, policy distributions, Bradley-Terry model |
| **04** | Python OOP | Model architectures, training loops |
| **05** | NumPy | Efficient tensor operations |
| **06** | Perceptrons | Foundation of neural networks |
| **07** | Backpropagation | How all networks learn |
| **08** | PyTorch | Framework for building everything |
| **09** | Training Deep Networks | Optimization, regularization, stability |
| **10** | CNNs | Feature extraction (vision RL uses CNNs) |
| **11** | RNNs | Sequential processing, precursor to transformers |
| **12** | Attention | The core mechanism of transformers |
| **13** | Transformers | Architecture of the language model being aligned |
| **14** | Language Models | The base model that RLHF fine-tunes |
| **15** | Embeddings | How text is represented as vectors |
| **16** | Fine-tuning & PEFT | SFT stage of RLHF, LoRA for efficient training |
| **17** | RL Fundamentals | MDPs, value functions, Bellman equations |
| **18** | Q-Learning & DQN | Value-based RL, foundation for understanding |
| **19** | Policy Gradients | Policy optimization, the mechanism PPO uses |
| **20** | PPO & Modern RL | **The algorithm that aligns language models** |

In [None]:
# The journey visualization
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.set_xlim(0, 20)
ax.set_ylim(0, 12)
ax.axis('off')

parts = [
    (1, 'Math\nFoundations', '#3498db', '01-03', 10),
    (4.5, 'Python\nFoundations', '#2ecc71', '04-05', 8.5),
    (7.5, 'Neural Network\nFundamentals', '#e74c3c', '06-09', 7),
    (10.5, 'Neural Network\nArchitectures', '#9b59b6', '10-12', 5.5),
    (13.5, 'Transformers\n& LLMs', '#f39c12', '13-16', 4),
    (16.5, 'Reinforcement\nLearning', '#1abc9c', '17-20', 2.5),
]

for x, label, color, nbs, y in parts:
    box = mpatches.FancyBboxPatch((x, y), 2.5, 1.5, boxstyle="round,pad=0.2",
                                   facecolor=color, edgecolor='black', linewidth=2, alpha=0.9)
    ax.add_patch(box)
    ax.text(x + 1.25, y + 0.75, label, ha='center', va='center',
            fontsize=9, fontweight='bold', color='white')
    ax.text(x + 1.25, y - 0.3, f'NB {nbs}', ha='center', fontsize=8, color='gray')

# Arrows connecting them
for i in range(len(parts) - 1):
    x1 = parts[i][0] + 2.5
    y1 = parts[i][4] + 0.75
    x2 = parts[i+1][0]
    y2 = parts[i+1][4] + 0.75
    ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
               arrowprops=dict(arrowstyle='->', lw=2, color='gray',
                              connectionstyle='arc3,rad=0.15'))

# Final destination
box = mpatches.FancyBboxPatch((6, 0.2), 8, 1.2, boxstyle="round,pad=0.3",
                               facecolor='#2c3e50', edgecolor='gold', linewidth=3)
ax.add_patch(box)
ax.text(10, 0.8, 'RLHF: Aligned Language Models', ha='center', va='center',
        fontsize=14, fontweight='bold', color='gold')

ax.annotate('', xy=(10, 1.4), xytext=(17.75, 2.5),
           arrowprops=dict(arrowstyle='->', lw=3, color='gold'))

ax.set_title('The Complete Learning Journey: 20 Notebooks', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: PPO Hyperparameter Study

The clipping parameter ε is crucial. Train PPO on CartPole with ε = {0.05, 0.1, 0.2, 0.3, 0.5}. Plot learning curves and clip fractions. What happens when ε is too small or too large?

In [None]:
# Exercise 1: Your code here
# Hint: Create PPOAgent instances with different clip_epsilon values
# and compare their training curves


### Exercise 2: Reward Hacking Demonstration

Train the RLHF pipeline with β=0 (no KL penalty) and β=0.5 (strong penalty). Show that without the KL penalty, the model finds degenerate solutions that maximize the reward model but produce low-quality outputs.

In [None]:
# Exercise 2: Your code here
# Hint: Run RLHFTrainer with different kl_coef values
# Monitor both rm_reward and kl_div over training


### Exercise 3: GAE Lambda Ablation

Train PPO with λ = {0, 0.5, 0.9, 0.95, 1.0} and compare learning stability and final performance. Verify that λ=0 (pure TD) has lower variance but higher bias, while λ=1 (Monte Carlo) has higher variance but lower bias.

In [None]:
# Exercise 3: Your code here
# Hint: Modify the lam parameter in PPOAgent and compare training curves


---

## Summary

### Key Concepts

- **PPO** uses a clipped surrogate objective to prevent destructively large policy updates: $L^{CLIP} = \min(r_t \hat{A}_t, \text{clip}(r_t, 1\pm\epsilon) \hat{A}_t)$
- **GAE** ($\lambda$) provides a tunable bias-variance tradeoff for advantage estimation
- PPO reuses data for **multiple gradient steps** per batch (unlike vanilla policy gradients)
- The **reward model** learns human preferences from comparison data using the Bradley-Terry model
- The **KL penalty** prevents reward hacking by keeping the policy close to the SFT reference
- The RLHF pipeline: **SFT → Reward Model → PPO** transforms a base LM into an aligned assistant

### Fundamental Insight

PPO's genius is simplicity: a single clipping operation replaces TRPO's complex constrained optimization while achieving similar results. This simplicity is what made it practical enough to scale to RLHF with billion-parameter language models. The algorithm that makes AI assistants helpful, harmless, and honest is, at its core, just the policy gradient theorem + a clipped ratio + a KL penalty.

### The Complete Journey

From matrix multiplication to RLHF, we've traced the complete path from mathematical foundations through neural network architectures, language models, and reinforcement learning. Every concept built on the last — linear algebra enabled neural networks, which enabled transformers, which enabled language models, which are aligned using RL. You now have the conceptual and implementation foundation to understand how modern AI systems work.

---

## What's Next?

Congratulations on completing the full curriculum! Here are paths for continued learning:

- **Scaling**: How do these techniques work at the scale of GPT-4 / Claude? Study distributed training, mixed precision, model parallelism
- **DPO**: Direct Preference Optimization — an alternative to RLHF that skips the reward model entirely
- **Constitutional AI**: Anthropic's approach to alignment using AI-generated feedback
- **Multi-modal models**: Extending transformers to vision, audio, and beyond
- **Agents**: Using LLMs as reasoning engines that take actions in the world
- **Safety & Alignment**: The broader challenge of ensuring AI systems remain beneficial