# PPO From Scratch: The Algorithm Behind ChatGPT

Welcome to PPO - the most popular and practical RL algorithm today! PPO is used to train ChatGPT and many other AI systems.

## What You'll Learn

By the end of this notebook, you'll understand:
- Why policy gradient methods need constraints (with a driving analogy!)
- What PPO does and why it works
- The clipping trick (the secret sauce of PPO)
- Actor-Critic architecture
- How to implement PPO from scratch
- Train an agent on CartPole!

**Prerequisites:** `policy-gradient/` notebooks (REINFORCE, Actor-Critic)

**Time:** ~45 minutes

---
## The Big Picture: Why PPO?

### The Problem with Vanilla Policy Gradient

```
    ┌────────────────────────────────────────────────────────┐
    │            THE POLICY UPDATE PROBLEM                    │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  VANILLA POLICY GRADIENT:                              │
    │    "If an action worked, make it MORE likely"          │
    │                                                         │
    │  THE PROBLEM:                                          │
    │    Updates can be TOO LARGE!                           │
    │                                                         │
    │    Before: π(action) = 30%                             │
    │    After:  π(action) = 95%   ← Too much change!        │
    │                                                         │
    │    This can:                                           │
    │    • Destabilize training                              │
    │    • Cause the policy to "forget" good behavior        │
    │    • Lead to catastrophic performance drops            │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

### The Driving Analogy

```
    ┌────────────────────────────────────────────────────────┐
    │            PPO = CAREFUL DRIVING                        │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  VANILLA POLICY GRADIENT = Aggressive Driver           │
    │    "Turn the wheel as hard as possible!"               │
    │    → Sometimes overshoots and crashes                  │
    │                                                         │
    │  PPO = Careful Driver                                  │
    │    "Turn the wheel, but not too much at once"          │
    │    → Smooth, stable progress                           │
    │                                                         │
    │  The key insight:                                      │
    │    LIMIT how much the policy can change in one update! │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

---
## PPO's Solution: The Clipping Trick

PPO limits policy updates using a clever clipping mechanism:

```
    ┌────────────────────────────────────────────────────────┐
    │              PPO CLIPPING EXPLAINED                     │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  First, define the RATIO of new to old policy:         │
    │                                                         │
    │         π_new(a|s)                                     │
    │  r(θ) = ─────────── = "How much did policy change?"   │
    │         π_old(a|s)                                     │
    │                                                         │
    │  • r = 1.0: Policy unchanged                           │
    │  • r = 2.0: Action is now 2x more likely               │
    │  • r = 0.5: Action is now half as likely               │
    │                                                         │
    │  Then, CLIP the ratio to stay close to 1:              │
    │                                                         │
    │  clipped_r = clip(r, 1-ε, 1+ε)                         │
    │                                                         │
    │  With ε = 0.2:                                         │
    │    • r can only be between 0.8 and 1.2                 │
    │    • Max 20% change per update!                        │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle

try:
    import gymnasium as gym
except ImportError:
    import gym

# Visualize the PPO clipping
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: The ratio and clipping
ax1 = axes[0]
epsilon = 0.2

# Ratio range
r = np.linspace(0.5, 2.0, 100)

# Clipped ratio
clipped_r = np.clip(r, 1 - epsilon, 1 + epsilon)

ax1.plot(r, r, 'b-', linewidth=2, label='Original ratio r(θ)')
ax1.plot(r, clipped_r, 'r-', linewidth=3, label='Clipped ratio')
ax1.axhline(y=1, color='gray', linestyle='--', alpha=0.5)
ax1.axhline(y=1+epsilon, color='green', linestyle=':', linewidth=2, label=f'Upper limit (1+ε = {1+epsilon})')
ax1.axhline(y=1-epsilon, color='orange', linestyle=':', linewidth=2, label=f'Lower limit (1-ε = {1-epsilon})')

ax1.fill_between(r, 1-epsilon, 1+epsilon, alpha=0.2, color='green', label='Allowed range')

ax1.set_xlabel('Policy Ratio r(θ) = π_new / π_old', fontsize=12)
ax1.set_ylabel('Effective Ratio (after clipping)', fontsize=12)
ax1.set_title('PPO Clipping Mechanism\n(ε = 0.2)', fontsize=14, fontweight='bold')
ax1.legend(loc='upper left')
ax1.set_xlim(0.5, 2.0)
ax1.set_ylim(0.5, 2.0)
ax1.grid(True, alpha=0.3)

# Right: Effect on objective
ax2 = axes[1]

# For positive advantage (good action)
advantage = 1.0
objective = r * advantage
clipped_objective = clipped_r * advantage
ppo_objective = np.minimum(objective, clipped_objective)

ax2.plot(r, objective, 'b--', linewidth=2, alpha=0.5, label='Unclipped objective')
ax2.plot(r, ppo_objective, 'r-', linewidth=3, label='PPO objective (min of both)')
ax2.axvline(x=1, color='gray', linestyle='--', alpha=0.5)

ax2.fill_between(r, 0, ppo_objective, alpha=0.2, color='red')

ax2.set_xlabel('Policy Ratio r(θ)', fontsize=12)
ax2.set_ylabel('Objective Value', fontsize=12)
ax2.set_title('PPO Objective (Positive Advantage)\n"Good action - increase probability"', fontsize=14, fontweight='bold')
ax2.legend()
ax2.set_xlim(0.5, 2.0)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("KEY INSIGHT: Why Clipping Works")
print("="*60)
print("\n• When ratio > 1.2: No extra reward for going higher")
print("  → "You've increased enough, stop pushing!"")
print("\n• When ratio < 0.8: No extra penalty for going lower")
print("  → "You've decreased enough, stop pulling!"")
print("\n• This keeps updates STABLE and SAFE!")
print("="*60)

---
## The PPO-Clip Objective

The full PPO-Clip objective is:

```
    ┌────────────────────────────────────────────────────────┐
    │              PPO-CLIP OBJECTIVE                         │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  L^CLIP(θ) = E[ min( r(θ) × A,                         │
    │                      clip(r(θ), 1-ε, 1+ε) × A ) ]      │
    │                                                         │
    │  Where:                                                │
    │    r(θ) = π_new(a|s) / π_old(a|s)  (probability ratio) │
    │    A    = Advantage (how much better than average?)    │
    │    ε    = Clip parameter (usually 0.2)                 │
    │                                                         │
    │  The MIN takes the MORE PESSIMISTIC option:            │
    │    • For good actions (A > 0): caps the benefit        │
    │    • For bad actions (A < 0): caps the penalty         │
    │                                                         │
    │  This prevents "overshooting" in either direction!     │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
def ppo_clip_objective(old_log_probs, new_log_probs, advantages, clip_epsilon=0.2):
    """
    Compute the PPO-Clip objective.
    
    Args:
        old_log_probs: Log probabilities from the OLD policy
        new_log_probs: Log probabilities from the NEW policy
        advantages: How much better than average each action was
        clip_epsilon: How much to clip (default 0.2 = 20%)
    
    Returns:
        The PPO loss (negative because we want to maximize)
    """
    # Step 1: Compute the probability RATIO
    # r(θ) = π_new(a|s) / π_old(a|s)
    # In log space: log(r) = log(π_new) - log(π_old)
    # So: r = exp(log(π_new) - log(π_old))
    ratio = torch.exp(new_log_probs - old_log_probs)
    
    # Step 2: Compute the CLIPPED ratio
    # Keep ratio between [1-ε, 1+ε]
    clipped_ratio = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
    
    # Step 3: Compute both objectives
    objective1 = ratio * advantages          # Unclipped
    objective2 = clipped_ratio * advantages  # Clipped
    
    # Step 4: Take the MINIMUM (pessimistic bound)
    # This prevents too large updates in either direction
    ppo_objective = torch.min(objective1, objective2)
    
    # Return negative because we want to MAXIMIZE the objective
    # (but optimizers MINIMIZE loss)
    loss = -ppo_objective.mean()
    
    return loss


# Demonstrate the loss function
print("PPO LOSS FUNCTION DEMONSTRATION")
print("="*60)

# Example data
old_log_probs = torch.tensor([-1.0, -0.5, -2.0])
new_log_probs = torch.tensor([-0.5, -0.3, -2.5])  # Policy changed
advantages = torch.tensor([1.0, -0.5, 2.0])  # Some good, some bad actions

# Compute ratio
ratio = torch.exp(new_log_probs - old_log_probs)
print(f"\nProbability ratios (π_new / π_old):")
for i, r in enumerate(ratio):
    change = "MORE likely" if r > 1 else "LESS likely"
    print(f"  Action {i}: ratio = {r:.2f} ({change})")

# Compute loss
loss = ppo_clip_objective(old_log_probs, new_log_probs, advantages)
print(f"\nPPO Loss: {loss:.4f}")
print("="*60)

---
## Actor-Critic Architecture

PPO uses an **Actor-Critic** architecture:

```
    ┌────────────────────────────────────────────────────────┐
    │              ACTOR-CRITIC ARCHITECTURE                  │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │                    ┌─────────┐                         │
    │                    │  State  │                         │
    │                    └────┬────┘                         │
    │                         │                              │
    │                    ┌────▼────┐                         │
    │                    │ Shared  │                         │
    │                    │ Layers  │                         │
    │                    └────┬────┘                         │
    │                   ╱     │     ╲                        │
    │                  ╱      │      ╲                       │
    │           ┌─────▼───┐       ┌───▼─────┐               │
    │           │  ACTOR  │       │ CRITIC  │               │
    │           │ (Policy)│       │ (Value) │               │
    │           └────┬────┘       └────┬────┘               │
    │                │                 │                     │
    │           ┌────▼────┐       ┌────▼────┐               │
    │           │ Action  │       │  V(s)   │               │
    │           │ Probs   │       │(Value)  │               │
    │           └─────────┘       └─────────┘               │
    │                                                         │
    │  ACTOR: "What action should I take?"                   │
    │  CRITIC: "How good is this state?"                     │
    │                                                         │
    │  They SHARE layers because state understanding is      │
    │  useful for both!                                      │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
class ActorCriticNetwork(nn.Module):
    """
    Actor-Critic Network for PPO.
    
    Has two "heads":
    - Actor head: outputs action probabilities (policy)
    - Critic head: outputs state value V(s)
    
    They share early layers because understanding the state
    is useful for both tasks!
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        
        # ========================================
        # SHARED LAYERS (Feature extraction)
        # ========================================
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),  # Tanh is common in policy networks
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        
        # ========================================
        # ACTOR HEAD (Policy: state → action probs)
        # ========================================
        self.actor = nn.Linear(hidden_dim, action_dim)
        # No softmax here - we'll use it when sampling
        
        # ========================================
        # CRITIC HEAD (Value: state → V(s))
        # ========================================
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        """
        Forward pass through the network.
        
        Returns:
            action_logits: Raw scores for each action
            value: Estimated state value V(s)
        """
        # Shared feature extraction
        features = self.shared(x)
        
        # Actor head: action logits
        action_logits = self.actor(features)
        
        # Critic head: state value
        value = self.critic(features)
        
        return action_logits, value
    
    def get_action_and_value(self, state, action=None):
        """
        Get action, log probability, entropy, and value.
        
        If action is provided, compute its log prob.
        If not, sample a new action.
        """
        logits, value = self.forward(state)
        
        # Create categorical distribution from logits
        dist = torch.distributions.Categorical(logits=logits)
        
        # Sample action if not provided
        if action is None:
            action = dist.sample()
        
        # Compute log probability and entropy
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        
        return action, log_prob, entropy, value.squeeze(-1)


# Create and demonstrate the network
print("ACTOR-CRITIC NETWORK")
print("="*60)

network = ActorCriticNetwork(state_dim=4, action_dim=2)
print(f"\nNetwork architecture:")
print(network)

# Test with sample state
sample_state = torch.FloatTensor([[0.01, 0.02, -0.03, 0.04]])

with torch.no_grad():
    action, log_prob, entropy, value = network.get_action_and_value(sample_state)

print(f"\nSample state: {sample_state.numpy()[0]}")
print(f"\nOutputs:")
print(f"  Action sampled: {action.item()} ({'LEFT' if action.item() == 0 else 'RIGHT'})")
print(f"  Log probability: {log_prob.item():.4f}")
print(f"  Entropy: {entropy.item():.4f}")
print(f"  State value V(s): {value.item():.4f}")
print("="*60)

---
## Generalized Advantage Estimation (GAE)

PPO uses **GAE** to compute advantages, which balances bias and variance:

```
    ┌────────────────────────────────────────────────────────┐
    │      GENERALIZED ADVANTAGE ESTIMATION (GAE)             │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  ADVANTAGE = How much BETTER was this action than      │
    │              what we expected?                          │
    │                                                         │
    │  A_t = r_t + γV(s_{t+1}) - V(s_t)    ← TD residual      │
    │        ↑     ↑                 ↑                        │
    │        │     │                 └── What we expected     │
    │        │     └── Future value (discounted)             │
    │        └── Immediate reward                            │
    │                                                         │
    │  GAE combines multiple TD residuals:                   │
    │                                                         │
    │  A^GAE_t = δ_t + (γλ)δ_{t+1} + (γλ)²δ_{t+2} + ...     │
    │                                                         │
    │  λ = 0: Use only one-step TD (high bias, low variance) │
    │  λ = 1: Use full returns (low bias, high variance)     │
    │  λ = 0.95: Good balance (typical choice)               │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """
    Compute Generalized Advantage Estimation (GAE).
    
    Args:
        rewards: List of rewards [r_0, r_1, ..., r_T]
        values: List of value estimates [V(s_0), V(s_1), ..., V(s_T), V(s_{T+1})]
        dones: List of done flags [done_0, done_1, ..., done_T]
        gamma: Discount factor
        lam: GAE lambda (trade-off between bias and variance)
    
    Returns:
        advantages: GAE advantages for each timestep
        returns: Discounted returns (advantages + values)
    """
    advantages = []
    gae = 0
    
    # Work backwards through time
    for t in reversed(range(len(rewards))):
        # TD residual: δ_t = r_t + γV(s_{t+1}) - V(s_t)
        if dones[t]:
            # Terminal state: no future value
            delta = rewards[t] - values[t]
            gae = delta
        else:
            delta = rewards[t] + gamma * values[t + 1] - values[t]
            gae = delta + gamma * lam * gae
        
        advantages.insert(0, gae)
    
    advantages = torch.tensor(advantages, dtype=torch.float32)
    
    # Returns = advantages + values
    returns = advantages + torch.tensor(values[:-1], dtype=torch.float32)
    
    return advantages, returns


# Demonstrate GAE
print("GAE DEMONSTRATION")
print("="*60)

# Example trajectory
rewards = [1, 1, 1, 1, 10]  # Normal steps, then big reward
values = [5.0, 5.5, 6.0, 7.0, 9.0, 0.0]  # V(s) estimates (last is terminal)
dones = [False, False, False, False, True]

advantages, returns = compute_gae(rewards, values, dones)

print(f"\nRewards:    {rewards}")
print(f"Values:     {values[:-1]}")
print(f"Dones:      {dones}")
print(f"\nAdvantages: {advantages.tolist()}")
print(f"Returns:    {returns.tolist()}")

print("\n" + "-"*60)
print("Interpretation:")
print("  Positive advantage = Action was BETTER than expected")
print("  Negative advantage = Action was WORSE than expected")
print("  The last action got a big reward, so its advantage is high!")
print("="*60)

---
## The Complete PPO Agent

In [None]:
class PPOAgent:
    """
    Complete PPO Agent.
    
    This is the algorithm used to train ChatGPT and many other AI systems!
    """
    
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, 
                 lam=0.95, clip_epsilon=0.2, epochs=10, batch_size=64):
        """
        Args:
            state_dim: Size of state observations
            action_dim: Number of possible actions
            lr: Learning rate
            gamma: Discount factor
            lam: GAE lambda
            clip_epsilon: PPO clip parameter
            epochs: Number of epochs per update
            batch_size: Mini-batch size for updates
        """
        self.gamma = gamma
        self.lam = lam
        self.clip_epsilon = clip_epsilon
        self.epochs = epochs
        self.batch_size = batch_size
        
        # Actor-Critic network
        self.network = ActorCriticNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        # Storage for trajectory
        self.states = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.values = []
        self.dones = []
    
    def select_action(self, state):
        """Select action and store data for later update."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        with torch.no_grad():
            action, log_prob, _, value = self.network.get_action_and_value(state_tensor)
        
        return action.item(), log_prob.item(), value.item()
    
    def store(self, state, action, log_prob, reward, value, done):
        """Store one step of interaction."""
        self.states.append(state)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        self.rewards.append(reward)
        self.values.append(value)
        self.dones.append(done)
    
    def update(self, next_value=0):
        """
        Update the policy using collected trajectories.
        
        This is where the PPO magic happens!
        """
        # Add next value for GAE computation
        values = self.values + [next_value]
        
        # Compute advantages and returns using GAE
        advantages, returns = compute_gae(
            self.rewards, values, self.dones, self.gamma, self.lam
        )
        
        # Normalize advantages (important for stable training!)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Convert to tensors
        states = torch.FloatTensor(np.array(self.states))
        actions = torch.LongTensor(self.actions)
        old_log_probs = torch.FloatTensor(self.log_probs)
        
        # ========================================
        # PPO UPDATE: Multiple epochs over the data
        # ========================================
        total_loss = 0
        n_samples = len(self.states)
        
        for epoch in range(self.epochs):
            # Random permutation for mini-batches
            indices = np.random.permutation(n_samples)
            
            for start in range(0, n_samples, self.batch_size):
                end = start + self.batch_size
                batch_indices = indices[start:end]
                
                # Get batch data
                batch_states = states[batch_indices]
                batch_actions = actions[batch_indices]
                batch_old_log_probs = old_log_probs[batch_indices]
                batch_advantages = advantages[batch_indices]
                batch_returns = returns[batch_indices]
                
                # Get new log probs and values
                _, new_log_probs, entropy, new_values = self.network.get_action_and_value(
                    batch_states, batch_actions
                )
                
                # ========================================
                # COMPUTE PPO LOSSES
                # ========================================
                
                # 1. Policy loss (PPO-Clip)
                ratio = torch.exp(new_log_probs - batch_old_log_probs)
                clipped_ratio = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
                policy_loss = -torch.min(
                    ratio * batch_advantages,
                    clipped_ratio * batch_advantages
                ).mean()
                
                # 2. Value loss (MSE between predicted and actual returns)
                value_loss = 0.5 * ((new_values - batch_returns) ** 2).mean()
                
                # 3. Entropy bonus (encourages exploration)
                entropy_loss = -0.01 * entropy.mean()
                
                # Total loss
                loss = policy_loss + 0.5 * value_loss + entropy_loss
                
                # Update network
                self.optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)
                self.optimizer.step()
                
                total_loss += loss.item()
        
        # Clear storage
        self.states = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.values = []
        self.dones = []
        
        return total_loss / (self.epochs * (n_samples // self.batch_size + 1))


print("PPO AGENT CREATED")
print("="*60)
print("""
PPO combines several key ideas:

1. ACTOR-CRITIC: Two outputs from one network
   • Actor: Outputs action probabilities
   • Critic: Outputs state value V(s)

2. GAE: Compute advantages with bias-variance trade-off

3. CLIPPING: Limit policy updates to stay stable

4. MULTIPLE EPOCHS: Reuse data for efficiency

5. ENTROPY BONUS: Encourage exploration
""")
print("="*60)

---
## Training PPO on CartPole

In [None]:
def train_ppo(env_name='CartPole-v1', n_episodes=500, rollout_length=2048, verbose=True):
    """
    Train a PPO agent.
    
    Args:
        env_name: Gymnasium environment name
        n_episodes: Maximum number of episodes
        rollout_length: Steps to collect before each update
        verbose: Whether to print progress
    
    Returns:
        agent: Trained PPO agent
        rewards_history: List of episode rewards
    """
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = PPOAgent(state_dim, action_dim)
    
    rewards_history = []
    episode_reward = 0
    step_count = 0
    episode = 0
    
    state, _ = env.reset()
    
    if verbose:
        print("TRAINING PPO ON CARTPOLE")
        print("="*60)
    
    while episode < n_episodes:
        # ========================================
        # COLLECT ROLLOUT
        # ========================================
        for _ in range(rollout_length):
            # Select action
            action, log_prob, value = agent.select_action(state)
            
            # Take step in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Store experience
            agent.store(state, action, log_prob, reward, value, done)
            
            episode_reward += reward
            step_count += 1
            
            if done:
                rewards_history.append(episode_reward)
                episode += 1
                
                if verbose and episode % 50 == 0:
                    avg_reward = np.mean(rewards_history[-50:])
                    print(f"Episode {episode:4d} | Avg Reward (last 50): {avg_reward:.1f}")
                
                episode_reward = 0
                state, _ = env.reset()
                
                if episode >= n_episodes:
                    break
            else:
                state = next_state
        
        # ========================================
        # UPDATE POLICY
        # ========================================
        if len(agent.states) > 0:
            # Get value of last state for GAE
            with torch.no_grad():
                _, _, _, next_value = agent.network.get_action_and_value(
                    torch.FloatTensor(state).unsqueeze(0)
                )
                next_value = next_value.item() if not done else 0
            
            agent.update(next_value)
    
    env.close()
    
    if verbose:
        print("="*60)
        print(f"Training complete!")
        final_avg = np.mean(rewards_history[-100:]) if len(rewards_history) >= 100 else np.mean(rewards_history)
        print(f"Final average reward (last 100): {final_avg:.1f}")
        print(f"Solved threshold: 195.0")
        print(f"Status: {'SOLVED!' if final_avg >= 195 else 'Keep training...'}")
        print("="*60)
    
    return agent, rewards_history

# Train the agent
agent, rewards = train_ppo(n_episodes=300)

In [None]:
# Plot training progress
plt.figure(figsize=(12, 5))

plt.plot(rewards, alpha=0.3, color='blue', label='Raw')

# Smoothed
window = 20
if len(rewards) >= window:
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(rewards)), smoothed, color='blue', 
             linewidth=2, label=f'Smoothed (window={window})')

plt.axhline(y=195, color='green', linestyle='--', linewidth=2, label='Solved (195)')

plt.xlabel('Episode', fontsize=12)
plt.ylabel('Total Reward', fontsize=12)
plt.title('PPO Training on CartPole', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## PPO in the Real World: RLHF for ChatGPT

PPO is the algorithm used to train ChatGPT and other language models!

```
    ┌────────────────────────────────────────────────────────┐
    │           PPO FOR LANGUAGE MODELS (RLHF)                │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  In CartPole:                                          │
    │    • State: Cart position, pole angle, etc.            │
    │    • Action: Push LEFT or RIGHT                        │
    │    • Reward: +1 for each timestep balanced             │
    │                                                         │
    │  In ChatGPT:                                           │
    │    • State: Conversation so far                        │
    │    • Action: Next token to generate                    │
    │    • Reward: Human preference (helpful? harmless?)     │
    │                                                         │
    │  The SAME algorithm, just different state/action/reward!│
    │                                                         │
    │  Why PPO for LLMs?                                     │
    │    • Stable: Won't break the language model            │
    │    • Sample efficient: Uses each experience well       │
    │    • Proven: Works at scale!                           │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

---
## Summary: Key Takeaways

### What is PPO?

PPO = Policy Gradient + Clipping + Multiple Epochs

### The Core Idea

| Concept | Description |
|---------|-------------|
| **Probability Ratio** | r(θ) = π_new / π_old (how much did policy change?) |
| **Clipping** | Keep ratio between [1-ε, 1+ε] |
| **PPO Objective** | min(r × A, clip(r) × A) |

### Why PPO Works

1. **Clipping prevents catastrophic updates** - can't change too much
2. **Multiple epochs** - uses data efficiently
3. **Simple to implement** - no complex constraints
4. **Stable** - works across many environments

### Key Hyperparameters

| Parameter | Typical Value | Meaning |
|-----------|---------------|----------|
| ε (clip) | 0.2 | Max 20% policy change per update |
| λ (GAE) | 0.95 | Advantage estimation smoothing |
| epochs | 10 | Reuse each batch 10 times |
| γ | 0.99 | Discount factor |

---
## Test Your Understanding

**1. What problem does PPO solve?**
<details>
<summary>Click to reveal answer</summary>
PPO solves the problem of unstable policy updates in policy gradient methods. Without constraints, policy updates can be too large, causing the policy to "overshoot" and lose previously learned behavior. PPO clips the objective to prevent this.
</details>

**2. What is the probability ratio r(θ)?**
<details>
<summary>Click to reveal answer</summary>
r(θ) = π_new(a|s) / π_old(a|s). It measures how much more (or less) likely an action is under the new policy compared to the old policy. If r = 1.0, the policy hasn't changed. If r = 2.0, the action is twice as likely now.
</details>

**3. What does clipping do in PPO?**
<details>
<summary>Click to reveal answer</summary>
Clipping keeps the ratio between [1-ε, 1+ε]. For example, with ε=0.2, the ratio stays between 0.8 and 1.2. This means the policy can change at most 20% per update, preventing large destabilizing changes.
</details>

**4. Why does PPO use multiple epochs?**
<details>
<summary>Click to reveal answer</summary>
PPO updates the policy multiple times (epochs) on the same batch of data to use it more efficiently. Unlike vanilla policy gradient which uses each sample once, PPO can safely reuse data because the clipping prevents the policy from changing too much.
</details>

**5. How is PPO used to train ChatGPT?**
<details>
<summary>Click to reveal answer</summary>
In RLHF (Reinforcement Learning from Human Feedback), PPO is used with: State = conversation so far, Action = next token to generate, Reward = score from a reward model trained on human preferences. The clipping is especially important to prevent the language model from changing too much and losing its language abilities.
</details>

---
## What's Next?

Excellent work! You've implemented PPO from scratch!

In the next notebook, we'll use the **Stable-Baselines3** library which provides highly optimized implementations:
- PPO with vectorized environments
- Built-in logging and monitoring
- Easy hyperparameter tuning

**Continue to:** [Notebook 3: PPO with Stable-Baselines](03_ppo_with_stable_baselines.ipynb)

---

*You've just implemented the same algorithm used to train ChatGPT! The principles are exactly the same, just with different states (text instead of cart position) and rewards (human preferences instead of balance time).*