# Part 6.3: Policy Gradient Methods — The Formula 1 Edition

DQN learns a value function and derives a policy from it. But this approach has fundamental limitations: it can only handle discrete actions, and it can't learn stochastic policies. **Policy gradient methods** take a radically different approach — they parameterize the policy directly as a neural network $\pi_\theta(a|s)$ and optimize it using gradient ascent on expected reward.

This is the paradigm that powers **PPO**, the algorithm behind RLHF and ChatGPT's alignment. Understanding policy gradients is understanding the core mechanism that makes modern AI assistants helpful.

**The F1 Connection:** Policy gradient methods are like training a race strategy *directly* by reinforcing good race decisions and suppressing bad ones. Instead of building a lookup table of Q-values (DQN), the network directly outputs: "Given P3 with worn tires and rain approaching, the probability of each action is: pit for inters 65%, push one more lap 25%, conserve 10%." The strategy is learned by reinforcing decisions that led to good race outcomes and penalizing those that didn't. This is how a driver builds racecraft — not by memorizing a table, but by developing instincts through thousands of laps of experience.

## Learning Objectives

- [ ] Derive the policy gradient theorem and understand why it works
- [ ] Implement REINFORCE, the simplest policy gradient algorithm
- [ ] Understand the high-variance problem and why baselines help
- [ ] Implement a variance-reducing baseline
- [ ] Build an actor-critic architecture from scratch
- [ ] Understand the advantage function and why it's central to modern RL
- [ ] Implement Advantage Actor-Critic (A2C)
- [ ] Compare policy gradient methods against DQN
- [ ] Recognize why trust regions matter (motivation for PPO)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

np.random.seed(42)
torch.manual_seed(42)

print("Part 6.3: Policy Gradient Methods")
print("=" * 50)

---

## 1. The Policy Gradient Theorem

### The Key Idea

Instead of learning Q-values and deriving a policy, we directly parameterize a policy:

$$\pi_\theta(a|s) = P(a|s; \theta)$$

and optimize the **expected return**:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$

We want $\nabla_\theta J(\theta)$ so we can do gradient ascent. But the expectation is over trajectories sampled from the policy — how do we differentiate through sampling?

### The Policy Gradient Theorem

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$.

**In words**: To improve the policy, increase the log-probability of actions that led to high returns, and decrease it for actions with low returns.

### Intuitive Explanation

Think of it like training a comedian:
- Try different jokes (sample actions from policy)
- Note which ones get big laughs (high returns)
- Tell those jokes more often (increase their probability)
- Stop telling jokes that bomb (decrease their probability)

The $\log \pi_\theta$ term is the "how to adjust" and $G_t$ is the "which direction."

**F1 analogy:** Imagine a rookie driver learning racecraft over their first season. Each race, they make decisions: pit early at Barcelona (gained 3 positions — great return), pushed too hard at Spa (spun off — terrible return), defended conservatively at Monza (maintained position — decent return). The policy gradient says: *increase the probability of early pitting in similar situations, decrease the probability of pushing on worn tires in high-speed corners, keep the defensive approach at power circuits*. Over many races, the driver's instincts (policy) converge toward the strategy that maximizes championship points. The $\log \pi_\theta$ is how to adjust the driving instincts, and $G_t$ is whether the race outcome was good or bad.

### Visualization: How Policy Gradients Work

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Before update: uniform-ish policy
actions = ['Left', 'Right', 'Up', 'Down']
probs_before = [0.25, 0.25, 0.25, 0.25]
returns = [-2.0, 5.0, 1.0, -1.0]  # Returns for each action in this state

colors = ['#e74c3c' if r < 0 else '#2ecc71' for r in returns]

axes[0].bar(actions, probs_before, color='#95a5a6', edgecolor='black')
axes[0].set_ylabel('π(a|s)', fontsize=12)
axes[0].set_title('Before Update', fontsize=13, fontweight='bold')
axes[0].set_ylim(0, 0.7)
axes[0].grid(True, alpha=0.3, axis='y')

# Gradient signal
axes[1].bar(actions, returns, color=colors, edgecolor='black')
axes[1].set_ylabel('Return G_t', fontsize=12)
axes[1].set_title('Gradient Signal (Returns)', fontsize=13, fontweight='bold')
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].grid(True, alpha=0.3, axis='y')
for i, r in enumerate(returns):
    axes[1].text(i, r + (0.2 if r >= 0 else -0.4), f'{r:+.1f}',
                ha='center', fontweight='bold', fontsize=11)

# After update: shifted toward high-return actions
logits = np.array(probs_before) * np.exp(np.array(returns) * 0.3)
probs_after = logits / logits.sum()

bar_colors = ['#e74c3c' if r < 0 else '#2ecc71' for r in returns]
axes[2].bar(actions, probs_after, color=bar_colors, edgecolor='black')
axes[2].set_ylabel('π(a|s)', fontsize=12)
axes[2].set_title('After Update', fontsize=13, fontweight='bold')
axes[2].set_ylim(0, 0.7)
axes[2].grid(True, alpha=0.3, axis='y')

# Annotations
for i, (before, after) in enumerate(zip(probs_before, probs_after)):
    change = after - before
    symbol = '↑' if change > 0 else '↓'
    axes[2].text(i, after + 0.03, f'{symbol}{abs(change):.2f}',
                ha='center', fontsize=9, color='#2c3e50')

plt.suptitle('Policy Gradient: Increase Probability of High-Return Actions',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

## 2. CartPole Environment (Reused from Notebook 18)

In [None]:
class CartPoleSimple:
    """Simplified CartPole environment."""
    
    def __init__(self):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masscart + self.masspole
        self.length = 0.5
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02
        self.x_threshold = 2.4
        self.theta_threshold = 12 * np.pi / 180
        self.state_dim = 4
        self.n_actions = 2
        self.state = None
    
    def reset(self):
        self.state = np.random.uniform(-0.05, 0.05, size=4)
        return self.state.copy()
    
    def step(self, action):
        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag
        cos_theta = np.cos(theta)
        sin_theta = np.sin(theta)
        temp = (force + self.polemass_length * theta_dot**2 * sin_theta) / self.total_mass
        theta_acc = (self.gravity * sin_theta - cos_theta * temp) / (
            self.length * (4.0/3.0 - self.masspole * cos_theta**2 / self.total_mass))
        x_acc = temp - self.polemass_length * theta_acc * cos_theta / self.total_mass
        x += self.tau * x_dot
        x_dot += self.tau * x_acc
        theta += self.tau * theta_dot
        theta_dot += self.tau * theta_acc
        self.state = np.array([x, x_dot, theta, theta_dot])
        done = (abs(x) > self.x_threshold or abs(theta) > self.theta_threshold)
        reward = 1.0 if not done else 0.0
        return self.state.copy(), reward, done


env = CartPoleSimple()
print(f"State dim: {env.state_dim}, Actions: {env.n_actions}")

---

## 3. REINFORCE: The Simplest Policy Gradient

REINFORCE (Williams, 1992) is the most direct implementation of the policy gradient theorem:

1. Run the policy for a complete episode, collecting $(s_t, a_t, r_t)$
2. Compute returns $G_t$ for each timestep
3. Compute the policy gradient: $\nabla_\theta J \approx \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$
4. Update: $\theta \leftarrow \theta + \alpha \nabla_\theta J$

In [None]:
class PolicyNetwork(nn.Module):
    """Simple policy network: outputs action probabilities."""
    
    def __init__(self, state_dim, n_actions, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
    
    def forward(self, x):
        logits = self.net(x)
        return F.softmax(logits, dim=-1)
    
    def get_action(self, state):
        """Sample an action from the policy."""
        state = torch.FloatTensor(state).unsqueeze(0)
        probs = self.forward(state)
        dist = Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action)


class REINFORCE:
    """REINFORCE policy gradient algorithm."""
    
    def __init__(self, state_dim, n_actions, lr=1e-3, gamma=0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, n_actions)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Episode storage
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state):
        """Select action and store log probability."""
        action, log_prob = self.policy.get_action(state)
        self.log_probs.append(log_prob)
        return action
    
    def store_reward(self, reward):
        self.rewards.append(reward)
    
    def update(self):
        """Update policy after a complete episode."""
        # Compute discounted returns (backwards)
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.FloatTensor(returns)
        
        # Normalize returns (variance reduction trick)
        if len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Policy gradient loss: -log_prob * return (negative for gradient ascent)
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        loss = torch.stack(policy_loss).sum()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.log_probs = []
        self.rewards = []
        
        return loss.item()


def train_reinforce(env, agent, n_episodes=1000, max_steps=500):
    """Train a REINFORCE agent."""
    episode_rewards = []
    episode_lengths = []
    losses = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            action = agent.select_action(state)
            next_state, reward, done = env.step(action)
            agent.store_reward(reward)
            total_reward += reward
            state = next_state
            if done:
                break
        
        loss = agent.update()
        losses.append(loss)
        episode_rewards.append(total_reward)
        episode_lengths.append(step + 1)
        
        if (episode + 1) % 100 == 0:
            avg_len = np.mean(episode_lengths[-100:])
            print(f"Episode {episode+1:4d} | Avg Length: {avg_len:6.1f}")
    
    return episode_rewards, episode_lengths, losses


# Train REINFORCE
env = CartPoleSimple()
reinforce_agent = REINFORCE(state_dim=4, n_actions=2, lr=1e-3, gamma=0.99)
r_reinforce, l_reinforce, loss_reinforce = train_reinforce(env, reinforce_agent, n_episodes=1000)

### Visualization: REINFORCE Training

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

window = 50

# Episode lengths
ax = axes[0]
ax.plot(l_reinforce, alpha=0.2, color='#3498db')
smoothed = np.convolve(l_reinforce, np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(l_reinforce)), smoothed, color='#2c3e50', linewidth=2,
        label=f'{window}-ep avg')
ax.axhline(y=200, color='red', linestyle='--', label='Goal')
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Length', fontsize=12)
ax.set_title('REINFORCE: Learning to Balance', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# High variance illustration
ax = axes[1]
# Show rolling std of episode lengths
if len(l_reinforce) >= window:
    rolling_mean = np.convolve(l_reinforce, np.ones(window)/window, mode='valid')
    rolling_std = np.array([np.std(l_reinforce[max(0,i-window):i+1]) 
                            for i in range(window-1, len(l_reinforce))])
    x = range(window-1, len(l_reinforce))
    ax.fill_between(x, rolling_mean - rolling_std, rolling_mean + rolling_std,
                    alpha=0.3, color='#e74c3c', label='±1 std')
    ax.plot(x, rolling_mean, color='#2c3e50', linewidth=2, label='Mean')

ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Length', fontsize=12)
ax.set_title('REINFORCE: High Variance Problem', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice the high variance — REINFORCE learns, but noisily.")
print("This is because it uses complete episode returns as the gradient signal.")

---

## 4. The Variance Problem and Baselines

REINFORCE suffers from **high variance** because the return $G_t$ can vary wildly between episodes, even for the same state-action pair. This means the gradient estimates are noisy, leading to slow and unstable learning.

### Baselines to the Rescue

We can subtract any function $b(s)$ that doesn't depend on the action without biasing the gradient:

$$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]$$

**Why no bias?** Because $\mathbb{E}_a[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$

The **optimal baseline** turns out to be close to $V(s)$ — the expected return from that state. This gives us:

$$G_t - b(s_t) \approx G_t - V(s_t)$$

This is the **advantage** — how much better this action's return was compared to what we expected. Actions with positive advantage get reinforced; negative advantage gets suppressed.

**F1 analogy:** Without a baseline, the strategist reinforces *every* decision from a race that scored points — even the bad ones, because the overall return was positive. At a race where you finished P4 (great!), every pit stop call gets reinforced, including the one where you stayed out too long on worn tires. With a baseline (expected result for P4-quality car = P5), only the decisions that made you *better than expected* get reinforced. The baseline says "P5 was expected, so only the specific decisions that got you from P5 to P4 deserve credit." This is the advantage — how much better was this decision compared to the average outcome from this situation?

In [None]:
# Demonstrate the variance reduction from baselines
np.random.seed(42)

# Simulate returns for a state where the true value is ~50
n_samples = 1000
true_value = 50
returns = np.random.normal(true_value, 30, n_samples)  # High variance returns

# Without baseline: gradient signal proportional to G_t
signal_no_baseline = returns

# With baseline: gradient signal proportional to G_t - V(s)
signal_with_baseline = returns - true_value

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(signal_no_baseline, bins=50, color='#e74c3c', alpha=0.7, edgecolor='black')
axes[0].axvline(x=0, color='black', linewidth=2)
axes[0].set_xlabel('Gradient Signal (G_t)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title(f'Without Baseline\nstd = {np.std(signal_no_baseline):.1f}',
                  fontsize=13, fontweight='bold')
axes[0].text(0.05, 0.95, 'Almost all positive!\nEvery action gets reinforced',
            transform=axes[0].transAxes, fontsize=10, va='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

axes[1].hist(signal_with_baseline, bins=50, color='#2ecc71', alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='black', linewidth=2)
axes[1].set_xlabel('Gradient Signal (G_t - V(s))', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title(f'With Baseline\nstd = {np.std(signal_with_baseline):.1f}',
                  fontsize=13, fontweight='bold')
axes[1].text(0.05, 0.95, 'Centered at zero!\nOnly above-average\nactions reinforced',
            transform=axes[1].transAxes, fontsize=10, va='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.suptitle('Why Baselines Reduce Variance', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print(f"Variance without baseline: {np.var(signal_no_baseline):.1f}")
print(f"Variance with baseline:    {np.var(signal_with_baseline):.1f}")
print(f"Same mean gradient, but {np.var(signal_no_baseline)/np.var(signal_with_baseline):.0f}x lower variance!")

---

## 5. REINFORCE with Baseline

Let's add a learned baseline (value function) to REINFORCE. This is the first step toward actor-critic methods.

In [None]:
class ValueNetwork(nn.Module):
    """Critic network: estimates V(s)."""
    
    def __init__(self, state_dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, x):
        return self.net(x).squeeze(-1)


class REINFORCEWithBaseline:
    """REINFORCE with a learned value function baseline."""
    
    def __init__(self, state_dim, n_actions, lr_policy=1e-3, lr_value=1e-3, gamma=0.99):
        self.gamma = gamma
        
        # Actor (policy)
        self.policy = PolicyNetwork(state_dim, n_actions)
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        
        # Baseline (value function)
        self.value_net = ValueNetwork(state_dim)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        self.log_probs = []
        self.rewards = []
        self.states = []
    
    def select_action(self, state):
        action, log_prob = self.policy.get_action(state)
        self.log_probs.append(log_prob)
        self.states.append(state)
        return action
    
    def store_reward(self, reward):
        self.rewards.append(reward)
    
    def update(self):
        # Compute returns
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)
        
        # Get baseline values
        states = torch.FloatTensor(np.array(self.states))
        values = self.value_net(states).detach()
        
        # Advantages = returns - baseline
        advantages = returns - values
        if len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Policy loss (REINFORCE with baseline)
        policy_loss = []
        for log_prob, adv in zip(self.log_probs, advantages):
            policy_loss.append(-log_prob * adv)
        policy_loss = torch.stack(policy_loss).sum()
        
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        # Value loss (fit baseline to returns)
        value_predictions = self.value_net(states)
        value_loss = F.mse_loss(value_predictions, returns)
        
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()
        
        # Clear
        self.log_probs = []
        self.rewards = []
        self.states = []
        
        return policy_loss.item(), value_loss.item()


# Train REINFORCE with baseline
env = CartPoleSimple()
baseline_agent = REINFORCEWithBaseline(state_dim=4, n_actions=2)

r_baseline = []
l_baseline = []

for episode in range(1000):
    state = env.reset()
    total_reward = 0
    
    for step in range(500):
        action = baseline_agent.select_action(state)
        next_state, reward, done = env.step(action)
        baseline_agent.store_reward(reward)
        total_reward += reward
        state = next_state
        if done:
            break
    
    baseline_agent.update()
    r_baseline.append(total_reward)
    l_baseline.append(step + 1)
    
    if (episode + 1) % 100 == 0:
        print(f"Episode {episode+1:4d} | Avg Length: {np.mean(l_baseline[-100:]):6.1f}")

In [None]:
# Compare REINFORCE vs REINFORCE with baseline
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

window = 50
for data, label, color in [(l_reinforce, 'REINFORCE (no baseline)', '#e74c3c'),
                            (l_baseline, 'REINFORCE + Baseline', '#2ecc71')]:
    smoothed = np.convolve(data, np.ones(window)/window, mode='valid')
    ax.plot(smoothed, label=label, color=color, linewidth=2.5)

ax.axhline(y=200, color='gray', linestyle='--', alpha=0.5, label='Goal')
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Length (smoothed)', fontsize=12)
ax.set_title('Baselines Reduce Variance and Speed Up Learning', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 6. The Advantage Function

The **advantage function** is the difference between the Q-value and the state value:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

**Interpretation**: "How much better is action $a$ compared to the average action in state $s$?"

- $A > 0$: This action is better than average → increase its probability
- $A < 0$: This action is worse than average → decrease its probability
- $A = 0$: This action is exactly average

The advantage is **the** central concept in modern policy gradient methods. PPO, A2C, A3C, and TRPO all use the advantage to compute policy gradients.

**F1 analogy:** The advantage answers: "Was pitting on lap 25 better or worse than our average action from that position?" If you're P3 with worn tires and the average outcome from this state is P4 (V(s) = 4 points), but pitting on lap 25 led to P2 (Q(s, pit_lap_25) = 18 points), the advantage is +14 — strongly reinforce that decision. If staying out led to P6 (Q(s, stay_out) = 8 points), the advantage is -4 — suppress that decision. The advantage strips away "were we in a good position?" and focuses purely on "did THIS specific decision help or hurt?"

### Estimating the Advantage

We don't know $Q(s,a)$ directly, but we can estimate the advantage using the TD error:

$$\hat{A}_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

This is a one-step estimate. In Notebook 20, we'll see **Generalized Advantage Estimation (GAE)**, which blends between one-step and full-return estimates.

In [None]:
# Visualize the advantage concept
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Q, V, and A relationship
ax = axes[0]
actions = ['Left', 'Stay', 'Right']
q_values = [3.0, 5.0, 8.0]
v_value = 5.33  # Average Q
advantages = [q - v_value for q in q_values]

x = np.arange(len(actions))
bars_q = ax.bar(x - 0.15, q_values, 0.3, label='Q(s,a)', color='#3498db', alpha=0.8)
ax.axhline(y=v_value, color='#e74c3c', linewidth=2.5, linestyle='--', label=f'V(s) = {v_value:.2f}')

# Show advantage annotations
for i, (q, a) in enumerate(zip(q_values, advantages)):
    color = '#2ecc71' if a > 0 else '#e74c3c'
    ax.annotate(f'A = {a:+.2f}', xy=(i, q), xytext=(i + 0.4, q),
               fontsize=10, fontweight='bold', color=color,
               arrowprops=dict(arrowstyle='->', color=color))

ax.set_xticks(x)
ax.set_xticklabels(actions)
ax.set_ylabel('Value', fontsize=12)
ax.set_title('Advantage = Q(s,a) - V(s)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

# Right: How advantage shapes policy updates
ax = axes[1]
adv_range = np.linspace(-3, 3, 100)

# Policy update magnitude: proportional to advantage
ax.fill_between(adv_range[adv_range < 0], 0, adv_range[adv_range < 0],
                alpha=0.3, color='#e74c3c', label='Decrease probability')
ax.fill_between(adv_range[adv_range >= 0], 0, adv_range[adv_range >= 0],
                alpha=0.3, color='#2ecc71', label='Increase probability')
ax.plot(adv_range, adv_range, 'k-', linewidth=2)
ax.axhline(y=0, color='black', linewidth=0.5)
ax.axvline(x=0, color='black', linewidth=0.5)

ax.set_xlabel('Advantage A(s,a)', fontsize=12)
ax.set_ylabel('Policy gradient signal', fontsize=12)
ax.set_title('Advantage Drives Policy Updates', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 7. Actor-Critic: The Best of Both Worlds

REINFORCE (even with a baseline) waits until the end of an episode to update. **Actor-critic** methods update at every step, using the TD advantage estimate:

- **Actor**: The policy network $\pi_\theta(a|s)$ — decides what to do
- **Critic**: The value network $V_\phi(s)$ — evaluates how good states are

The critic provides the advantage signal that the actor uses to improve.

**F1 analogy:** The actor is the driver making real-time decisions on track — push, conserve, defend, pit. The critic is the race strategist on the pit wall evaluating the situation: "We're in P4, that's worth X expected points given current tire state." The driver (actor) makes a decision, the strategist (critic) evaluates the outcome: "That undercut moved us to P3 — better than the P4 we expected, advantage is positive." The driver adjusts their instincts (policy) based on the strategist's feedback. Neither works optimally alone — the driver has instincts but no big picture, the strategist has analysis but can't drive. Together, they form an actor-critic system.

| Component | REINFORCE | Actor-Critic | F1 Parallel |
|-----------|-----------|---------------|-------------|
| **Updates** | After full episode | After each step | Post-race review vs. lap-by-lap radio calls |
| **Gradient signal** | Full return $G_t$ | TD advantage $r + \gamma V(s') - V(s)$ | Full race result vs. "that lap was 0.3s better than expected" |
| **Variance** | High (even with baseline) | Lower (TD estimates) | Race outcomes vary wildly; lap-by-lap feedback is steadier |
| **Bias** | Unbiased | Slightly biased (bootstrapping) | Full picture but noisy vs. immediate feedback with assumptions |

In [None]:
class ActorCritic(nn.Module):
    """Actor-Critic with shared feature layers."""
    
    def __init__(self, state_dim, n_actions, hidden_dim=64):
        super().__init__()
        
        # Shared feature extraction
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )
        
        # Actor head (policy)
        self.actor = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
        
        # Critic head (value function)
        self.critic = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, x):
        features = self.shared(x)
        action_probs = F.softmax(self.actor(features), dim=-1)
        value = self.critic(features).squeeze(-1)
        return action_probs, value
    
    def get_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0)
        probs, value = self.forward(state)
        dist = Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action), value


class A2CAgent:
    """Advantage Actor-Critic (A2C) agent."""
    
    def __init__(self, state_dim, n_actions, lr=3e-4, gamma=0.99,
                 value_coef=0.5, entropy_coef=0.01):
        self.gamma = gamma
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        
        self.network = ActorCritic(state_dim, n_actions)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        # Episode storage
        self.log_probs = []
        self.values = []
        self.rewards = []
        self.entropies = []
    
    def select_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs, value = self.network(state_tensor)
        dist = Categorical(probs)
        action = dist.sample()
        
        self.log_probs.append(dist.log_prob(action))
        self.values.append(value)
        self.entropies.append(dist.entropy())
        
        return action.item()
    
    def store_reward(self, reward):
        self.rewards.append(reward)
    
    def update(self, next_state, done):
        """Update after collecting a batch of experience."""
        # Bootstrap value of last state
        with torch.no_grad():
            next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0)
            _, next_value = self.network(next_state_tensor)
            next_value = next_value * (1 - done)
        
        # Compute returns and advantages
        returns = []
        G = next_value.item()
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)
        
        values = torch.cat(self.values)
        log_probs = torch.cat(self.log_probs)
        entropies = torch.cat(self.entropies)
        
        advantages = returns - values.detach()
        if len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Actor loss: policy gradient with advantage
        actor_loss = -(log_probs * advantages).mean()
        
        # Critic loss: fit value function to returns
        critic_loss = F.mse_loss(values, returns)
        
        # Entropy bonus: encourage exploration
        entropy_loss = -entropies.mean()
        
        # Total loss
        total_loss = (actor_loss + 
                     self.value_coef * critic_loss + 
                     self.entropy_coef * entropy_loss)
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)
        self.optimizer.step()
        
        # Clear
        self.log_probs = []
        self.values = []
        self.rewards = []
        self.entropies = []
        
        return actor_loss.item(), critic_loss.item()


# Train A2C
env = CartPoleSimple()
a2c_agent = A2CAgent(state_dim=4, n_actions=2, lr=3e-4)

l_a2c = []
actor_losses = []
critic_losses = []

for episode in range(1000):
    state = env.reset()
    
    for step in range(500):
        action = a2c_agent.select_action(state)
        next_state, reward, done = env.step(action)
        a2c_agent.store_reward(reward)
        state = next_state
        if done:
            break
    
    a_loss, c_loss = a2c_agent.update(next_state, float(done))
    l_a2c.append(step + 1)
    actor_losses.append(a_loss)
    critic_losses.append(c_loss)
    
    if (episode + 1) % 100 == 0:
        print(f"Episode {episode+1:4d} | Avg Length: {np.mean(l_a2c[-100:]):6.1f} | "
              f"Actor Loss: {np.mean(actor_losses[-100:]):7.3f} | "
              f"Critic Loss: {np.mean(critic_losses[-100:]):7.3f}")

### Visualization: A2C Training and Architecture

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Performance comparison
ax = axes[0]
window = 50
for data, label, color in [(l_reinforce, 'REINFORCE', '#e74c3c'),
                            (l_baseline, 'REINFORCE + Baseline', '#f39c12'),
                            (l_a2c, 'A2C', '#2ecc71')]:
    smoothed = np.convolve(data, np.ones(window)/window, mode='valid')
    ax.plot(smoothed, label=label, color=color, linewidth=2.5)

ax.axhline(y=200, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Length', fontsize=12)
ax.set_title('Policy Gradient Methods Comparison', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Actor and critic losses
ax = axes[1]
window = 20
a_smooth = np.convolve(actor_losses, np.ones(window)/window, mode='valid')
c_smooth = np.convolve(critic_losses, np.ones(window)/window, mode='valid')
ax.plot(a_smooth, label='Actor Loss', color='#3498db', linewidth=2)
ax.plot(c_smooth, label='Critic Loss', color='#e74c3c', linewidth=2)
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('A2C: Actor and Critic Losses', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Architecture diagram
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('Actor-Critic Architecture', fontsize=16, fontweight='bold')

# Input
box = mpatches.FancyBboxPatch((3.5, 6.5), 3, 0.8, boxstyle="round,pad=0.2",
                               facecolor='#95a5a6', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(5, 6.9, 'State s', ha='center', va='center', fontsize=12, fontweight='bold', color='white')

# Shared layers
box = mpatches.FancyBboxPatch((3, 5), 4, 0.8, boxstyle="round,pad=0.2",
                               facecolor='#9b59b6', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(5, 5.4, 'Shared Features', ha='center', va='center', fontsize=12, fontweight='bold', color='white')
ax.annotate('', xy=(5, 5.8), xytext=(5, 6.5), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Actor head
box = mpatches.FancyBboxPatch((1, 3), 3, 0.8, boxstyle="round,pad=0.2",
                               facecolor='#3498db', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(2.5, 3.4, 'Actor π(a|s)', ha='center', va='center', fontsize=12, fontweight='bold', color='white')
ax.annotate('', xy=(3.5, 3.8), xytext=(4.5, 5), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Critic head
box = mpatches.FancyBboxPatch((6, 3), 3, 0.8, boxstyle="round,pad=0.2",
                               facecolor='#e74c3c', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(7.5, 3.4, 'Critic V(s)', ha='center', va='center', fontsize=12, fontweight='bold', color='white')
ax.annotate('', xy=(6.5, 3.8), xytext=(5.5, 5), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Outputs
ax.text(2.5, 2.3, 'Action\nprobabilities', ha='center', va='center', fontsize=10,
        color='#3498db', style='italic')
ax.text(7.5, 2.3, 'State\nvalue', ha='center', va='center', fontsize=10,
        color='#e74c3c', style='italic')

# Advantage
box = mpatches.FancyBboxPatch((3.5, 1), 3, 0.8, boxstyle="round,pad=0.2",
                               facecolor='#2ecc71', edgecolor='black', linewidth=2)
ax.add_patch(box)
ax.text(5, 1.4, 'Advantage A(s,a)', ha='center', va='center', fontsize=12, fontweight='bold', color='white')

ax.annotate('', xy=(4, 1.8), xytext=(2.5, 2.3), arrowprops=dict(arrowstyle='->', lw=1.5, color='gray'))
ax.annotate('', xy=(6, 1.8), xytext=(7.5, 2.3), arrowprops=dict(arrowstyle='->', lw=1.5, color='gray'))

ax.text(5, 0.3, 'A = r + γV(s\') - V(s)', ha='center', fontsize=11, style='italic', color='#2c3e50')

plt.tight_layout()
plt.show()

---

## 8. The Entropy Bonus: Encouraging Exploration

Notice the `entropy_coef` term in A2C. The **entropy** of the policy measures how "spread out" the action distribution is:

$$H(\pi) = -\sum_a \pi(a|s) \log \pi(a|s)$$

- **High entropy**: Policy is uniform (maximum exploration)
- **Low entropy**: Policy is deterministic (maximum exploitation)

By adding an entropy bonus to the objective, we prevent the policy from collapsing to a deterministic distribution too early.

**F1 analogy:** Without the entropy bonus, a team's strategy would quickly converge to always making the same call — "always one-stop, always medium-hard" — even if the optimal strategy varies by circuit. The entropy bonus keeps the strategy distribution "open-minded" during training, like a team principal saying "I know the one-stop usually works, but keep the two-stop and three-stop options on the table — we haven't raced enough circuits to be certain." It prevents premature commitment to a single strategy before enough data has been collected.

In [None]:
# Visualize entropy's effect on policy distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))

# Three example policies with different entropies
policies = [
    ([0.97, 0.01, 0.01, 0.01], 'Nearly Deterministic'),
    ([0.5, 0.3, 0.15, 0.05], 'Moderate Entropy'),
    ([0.25, 0.25, 0.25, 0.25], 'Maximum Entropy (Uniform)'),
]

actions = ['A₁', 'A₂', 'A₃', 'A₄']
colors_map = ['#e74c3c', '#f39c12', '#2ecc71']

for ax, (probs, title), color in zip(axes, policies, colors_map):
    entropy = -sum(p * np.log(p + 1e-8) for p in probs)
    ax.bar(actions, probs, color=color, edgecolor='black', alpha=0.8)
    ax.set_ylim(0, 1.1)
    ax.set_ylabel('π(a|s)')
    ax.set_title(f'{title}\nH = {entropy:.3f}', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('Policy Entropy: From Deterministic to Uniform', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("The entropy bonus prevents premature convergence to a deterministic policy.")
print("Without it, the agent might stop exploring before finding the best strategy.")

---

## 9. The Trust Region Problem

Policy gradient methods have a subtle but critical problem: **large policy updates can be catastrophic**.

If we take too large a step in parameter space, the policy can change drastically — an action that was chosen 80% of the time might suddenly be chosen 5% of the time. This can destroy good behavior that took many episodes to learn.

### Why This Matters

In supervised learning, a bad gradient step loses a bit of accuracy, and the next batch corrects it. In RL, a bad policy update changes the *data distribution itself* — the agent starts visiting different states, getting different rewards, leading to further bad updates. This feedback loop can cause complete collapse.

**F1 analogy:** Imagine a team that's been running a successful medium-hard strategy all season. After one bad race (Abu Dhabi, unusual conditions), the strategy model over-corrects: "medium-hard is terrible, switch everything to soft-soft-medium." At the next race, the three-stop fails spectacularly, generating even worse data, causing another wild swing. This is the policy collapse feedback loop. In F1, experienced teams avoid this by making *conservative, incremental* strategy adjustments — they don't throw away a proven approach based on one race. PPO (Notebook 20) formalizes this wisdom mathematically.

### The Solution: Constrain Policy Updates

**TRPO** (Trust Region Policy Optimization) constrains the KL divergence between old and new policies:

$$\text{maximize } L(\theta) \quad \text{subject to } D_{KL}(\pi_{\theta_{old}} \| \pi_\theta) \leq \delta$$

**PPO** (Proximal Policy Optimization) achieves a similar effect much more simply using a clipped objective. This is what we'll build in Notebook 20 — stable strategy updates that don't swing too wildly between races.

In [None]:
# Demonstrate the trust region problem
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Policy before and after a large update
ax = axes[0]
x = np.linspace(-3, 3, 200)
old_policy = np.exp(-0.5 * (x - 0.5)**2) / np.sqrt(2 * np.pi)
# Small update
small_update = np.exp(-0.5 * (x - 0.8)**2) / np.sqrt(2 * np.pi)
# Large update
large_update = np.exp(-0.5 * (x + 1.5)**2 / 0.3) / np.sqrt(2 * np.pi * 0.3)

ax.plot(x, old_policy, 'b-', linewidth=2.5, label='Old policy')
ax.plot(x, small_update, 'g--', linewidth=2.5, label='Small update (safe)')
ax.plot(x, large_update, 'r--', linewidth=2.5, label='Large update (dangerous!)')
ax.fill_between(x, old_policy, large_update, alpha=0.1, color='red')
ax.set_xlabel('Action', fontsize=12)
ax.set_ylabel('π(a|s)', fontsize=12)
ax.set_title('Large Policy Updates Are Dangerous', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Right: The collapse feedback loop
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('Policy Collapse Feedback Loop', fontsize=13, fontweight='bold')

steps = [
    (5, 7, 'Large policy update'),
    (8.5, 5.5, 'Visit different states'),
    (7, 3, 'Get different rewards'),
    (3, 3, 'Worse gradient estimates'),
    (1.5, 5.5, 'Even larger bad update'),
]

for i, (x, y, text) in enumerate(steps):
    color = '#e74c3c' if i > 0 else '#f39c12'
    circle = plt.Circle((x, y), 0.8, color=color, alpha=0.3)
    ax.add_patch(circle)
    ax.text(x, y, f'{i+1}. {text}', ha='center', va='center', fontsize=8, fontweight='bold')

# Arrows forming a cycle
for i in range(len(steps) - 1):
    x1, y1, _ = steps[i]
    x2, y2, _ = steps[i + 1]
    ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
               arrowprops=dict(arrowstyle='->', lw=2, color='gray',
                              connectionstyle='arc3,rad=0.2'))
# Close the loop
ax.annotate('', xy=(steps[0][0], steps[0][1]), xytext=(steps[-1][0], steps[-1][1]),
           arrowprops=dict(arrowstyle='->', lw=2, color='gray',
                          connectionstyle='arc3,rad=0.2'))

ax.text(5, 0.5, 'PPO solves this with clipped updates (Notebook 20)',
       ha='center', fontsize=11, style='italic', color='#2ecc71',
       bbox=dict(boxstyle='round', facecolor='#ecf0f1', alpha=0.8))

plt.tight_layout()
plt.show()

---

## 10. Continuous Action Spaces (Preview)

One of policy gradient's biggest advantages: they naturally handle **continuous actions**. Instead of outputting a probability over discrete actions, the policy outputs the **parameters of a distribution** (e.g., mean and standard deviation of a Gaussian).

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$$

This is essential for robotics (joint torques), autonomous driving (steering angle), and any task where actions are continuous.

**F1 analogy:** This is the difference between discrete strategy calls ("pit" or "don't pit") and continuous driver inputs (steering angle: 12.7 degrees, throttle: 83%, brake pressure: 42%). A discrete DQN can decide *whether* to pit, but a continuous policy gradient can learn the optimal steering trajectory through a corner — output a Gaussian distribution over steering angles, centered on the ideal line with some variance for uncertainty. Modern F1 simulators use exactly this kind of continuous control policy for driver-in-the-loop testing.

In [None]:
class ContinuousPolicyNetwork(nn.Module):
    """Policy network for continuous action spaces.
    Outputs mean and log_std of a Gaussian distribution."""
    
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))  # Learnable
    
    def forward(self, state):
        features = self.shared(state)
        mean = self.mean_head(features)
        std = torch.exp(self.log_std)
        return mean, std
    
    def get_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0)
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1)
        return action.squeeze(0).numpy(), log_prob


# Demonstrate
policy = ContinuousPolicyNetwork(state_dim=4, action_dim=2)
state = np.random.randn(4)
action, log_prob = policy.get_action(state)

print("Continuous Policy Network:")
print(f"  State: {state}")
print(f"  Sampled action: {action}")
print(f"  Log probability: {log_prob.item():.4f}")

# Visualize the action distribution
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
state_tensor = torch.FloatTensor(state).unsqueeze(0)
mean, std = policy(state_tensor)
mean, std = mean.detach().numpy()[0], std.detach().numpy()

x = np.linspace(-4, 4, 200)
for i, (m, s, label) in enumerate(zip(mean, std, ['Action dim 1', 'Action dim 2'])):
    pdf = np.exp(-0.5 * ((x - m) / s)**2) / (s * np.sqrt(2 * np.pi))
    ax.plot(x, pdf, linewidth=2.5, label=f'{label}: μ={m:.2f}, σ={s:.2f}')
    ax.fill_between(x, pdf, alpha=0.2)

ax.set_xlabel('Action value', fontsize=12)
ax.set_ylabel('Probability density', fontsize=12)
ax.set_title('Continuous Policy: Gaussian Action Distribution', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: REINFORCE on a Harder Problem — Tighter Margins at Monaco

Modify the CartPole environment to have a tighter angle threshold (6 degrees instead of 12). Train REINFORCE with and without a baseline. This is like making the "track" narrower — Monaco's street circuit gives almost no margin for error compared to Silverstone's wide runoffs. How much does the baseline help on this harder, less-forgiving problem?

In [None]:
# Exercise 1: Your code here
# Hint: Create a HarderCartPole class with theta_threshold = 6 * np.pi / 180


### Exercise 2: n-Step Returns for A2C — How Many Laps of Hindsight?

Our A2C uses full episode returns. Implement **n-step returns** where the agent updates every n steps:

$$G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})$$

Compare n=1, n=5, n=20, and full episode. In F1 terms, n=1 is updating strategy after every single lap (fast feedback, but can't see multi-lap patterns). n=20 is waiting 20 laps to assess (sees longer-term tire degradation trends). Full episode is waiting until the checkered flag. Which n works best — how many laps of hindsight does the strategist need?

In [None]:
# Exercise 2: Your code here
# Hint: Modify A2CAgent.update() to compute n-step returns
# instead of full-episode returns


### Exercise 3: A2C vs DQN Head-to-Head — Instinct vs. Lookup Table

Import the DQNAgent from Notebook 18 (or recreate it) and run a fair comparison against A2C on CartPole. Use the same number of environment interactions for both. This is the ultimate question: does a driver who learns strategy by instinct (A2C/policy gradients) outperform one who memorizes a Q-value lookup table (DQN)? Which method:
- Converges faster? (Who learns the track quicker?)
- Achieves higher final performance? (Who's faster at the end of the season?)
- Has more stable training? (Who's more consistent race-to-race?)

In [None]:
# Exercise 3: Your code here
# Hint: Make sure both agents get the same total number of env.step() calls


---

## Summary

### Key Concepts

| Concept | What It Means | F1 Parallel |
|---------|--------------|-------------|
| **Policy gradient theorem** | Optimize policy directly: increase prob of high-return actions | Reinforce winning race decisions, suppress losing ones |
| **REINFORCE** | Simplest policy gradient, but high variance | Learn from full race outcomes — noisy but unbiased |
| **Baselines** | Subtract expected return to reduce variance | "How much better than expected was this decision?" |
| **Advantage function** | A(s,a) = Q(s,a) - V(s): how much better than average | "Was pitting NOW specifically better than our average action here?" |
| **Actor-Critic** | Actor (policy) + Critic (value function) | Driver (decisions) + Strategist (evaluation) |
| **Entropy bonus** | Prevent premature convergence to one strategy | Keep multiple strategy options open during learning |
| **Trust region problem** | Large updates cause policy collapse | Don't overhaul proven strategy based on one bad race |
| **Continuous actions** | Output distribution parameters instead of discrete probs | Steering angle, throttle percentage, brake pressure |

### Fundamental Insight

The policy gradient theorem elegantly solves the problem of differentiating through stochastic sampling. By using $\nabla \log \pi$ as a "score function," we can optimize any reward signal — not just differentiable losses. This generality is why policy gradients power RLHF: the reward model's output doesn't need to be differentiable with respect to the generated text. In F1 terms, you can optimize for "championship points" — a non-differentiable, delayed, sparse reward — by reinforcing the strategic decisions that led to them.

---

## Next Steps

We now have all the building blocks for the algorithm that aligns modern language models. In **Notebook 20: PPO and Modern RL**, we'll:

- Implement PPO's clipped surrogate objective — stable strategy updates that don't swing too wildly between races
- Build Generalized Advantage Estimation (GAE) for better advantage estimates
- Train a full PPO agent from scratch
- Connect everything to **RLHF**: reward models, KL penalties, and the complete pipeline
- See how RL makes language models helpful, harmless, and honest

PPO is the finish line of our RL journey — and the starting grid for understanding how modern AI assistants are aligned with human preferences.