# Part 6.1: Reinforcement Learning Fundamentals — The Formula 1 Edition

We've spent 16 notebooks learning how to build models that learn from **labeled data** (supervised) or **unlabeled data** (self-supervised). But there's a third paradigm — one that learns from **experience and rewards**, just like a child learning to walk by falling down and getting back up.

**Reinforcement Learning (RL)** is about an agent interacting with an environment, taking actions, receiving rewards, and learning a strategy to maximize long-term success. It's the foundation of game-playing AI (AlphaGo), robotics, and — critically — **RLHF**, the technique that makes language models like ChatGPT helpful and safe.

**The F1 Connection:** Every Formula 1 race is a reinforcement learning problem in disguise. The race engineer and driver together form an *agent* that must make real-time decisions — when to pit, how hard to push, when to conserve tires — in a stochastic *environment* (weather changes, safety cars, tire degradation). The *reward* is championship points. A race strategy is literally a *policy*: a mapping from the car's current situation to the optimal action. In this notebook, we'll build the mathematical framework behind these decisions.

## Learning Objectives

- [ ] Understand the agent-environment loop and how RL differs from supervised learning
- [ ] Define Markov Decision Processes (MDPs) and their components
- [ ] Derive and implement the Bellman equations for value functions
- [ ] Distinguish between state-value functions V(s) and action-value functions Q(s,a)
- [ ] Implement policy evaluation and policy iteration from scratch
- [ ] Understand the exploration vs. exploitation tradeoff
- [ ] Implement value iteration to solve a gridworld environment
- [ ] Compare policy-based vs. value-based methods at a high level
- [ ] Build intuition for temporal difference learning
- [ ] Connect RL concepts to the RLHF pipeline introduced in Notebook 16

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.colors import LinearSegmentedColormap
from collections import defaultdict
import random

# For reproducibility
np.random.seed(42)
random.seed(42)

print("Part 6.1: Reinforcement Learning Fundamentals")
print("=" * 50)

---

## 1. The Reinforcement Learning Paradigm

In supervised learning, we have input-output pairs and minimize a loss. In RL, there are no labels — the agent must **discover** which actions lead to rewards through trial and error.

### The Agent-Environment Loop

The core RL cycle works like this:

1. The **agent** observes the current **state** $s_t$
2. The agent selects an **action** $a_t$ based on its **policy** $\pi$
3. The **environment** transitions to a new state $s_{t+1}$
4. The environment returns a **reward** $r_t$
5. Repeat

The goal: find a policy $\pi^*$ that maximizes the **expected cumulative reward** over time.

**F1 analogy:** The agent is the race strategist. The state is the car's current situation — track position P3, tire age 15 laps, gap to leader 4.2 seconds, medium compound, fuel load 60%. The actions are: pit now, push hard, conserve tires, defend position, use DRS. The reward is positions gained (or championship points at race end). The policy is the strategy: "If tires are older than 20 laps AND gap to car ahead is under 1 second, pit for fresh softs."

| Concept | Supervised Learning | Reinforcement Learning | F1 Parallel |
|---------|-------------------|----------------------|-------------|
| **Feedback** | Correct labels provided | Scalar reward signal | Points scored at end of race |
| **Timing** | Immediate | Can be delayed | Pit stop pain now, position gain later |
| **Data** | Fixed dataset | Generated by agent's actions | Each race is unique — new data from new decisions |
| **Goal** | Minimize loss | Maximize cumulative reward | Maximize championship points |
| **Exploration** | Not needed | Critical for learning | Trying an aggressive undercut vs. known overcut |

### Visualization: The RL Loop

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('The Reinforcement Learning Loop', fontsize=16, fontweight='bold')

# Agent box
agent_box = mpatches.FancyBboxPatch((1, 5), 3, 2, boxstyle="round,pad=0.3",
                                     facecolor='#3498db', edgecolor='black', linewidth=2)
ax.add_patch(agent_box)
ax.text(2.5, 6, 'AGENT', ha='center', va='center', fontsize=14,
        fontweight='bold', color='white')
ax.text(2.5, 5.4, 'Policy π(a|s)', ha='center', va='center', fontsize=10, color='white')

# Environment box
env_box = mpatches.FancyBboxPatch((6, 5), 3, 2, boxstyle="round,pad=0.3",
                                   facecolor='#2ecc71', edgecolor='black', linewidth=2)
ax.add_patch(env_box)
ax.text(7.5, 6, 'ENVIRONMENT', ha='center', va='center', fontsize=14,
        fontweight='bold', color='white')
ax.text(7.5, 5.4, 'P(s\'|s,a), R(s,a)', ha='center', va='center', fontsize=10, color='white')

# Action arrow (agent -> environment)
ax.annotate('', xy=(6, 6.5), xytext=(4, 6.5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='#e74c3c'))
ax.text(5, 7.1, 'Action $a_t$', ha='center', va='center', fontsize=12,
        color='#e74c3c', fontweight='bold')

# State arrow (environment -> agent, bottom)
ax.annotate('', xy=(4, 5.3), xytext=(6, 5.3),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='#9b59b6'))
ax.text(5, 4.6, 'State $s_{t+1}$', ha='center', va='center', fontsize=12,
        color='#9b59b6', fontweight='bold')

# Reward arrow (environment -> agent, further below)
ax.annotate('', xy=(2.5, 3.5), xytext=(7.5, 3.5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='#f39c12',
                           connectionstyle='arc3,rad=0.3'))
ax.text(5, 2.5, 'Reward $r_t$', ha='center', va='center', fontsize=12,
        color='#f39c12', fontweight='bold')

# Time step indicator
ax.text(5, 1.2, 'At each timestep t = 0, 1, 2, ...', ha='center', va='center',
        fontsize=11, style='italic', color='gray')

plt.tight_layout()
plt.show()

### Key RL Terminology

| Term | Symbol | Definition | F1 Parallel |
|------|--------|------------|-------------|
| **State** | $s \in \mathcal{S}$ | Current situation of the agent | Position, tire condition, gap to rivals, weather, laps remaining |
| **Action** | $a \in \mathcal{A}$ | Decision the agent can make | Pit stop, push hard, conserve tires, defend position, use DRS |
| **Policy** | $\pi(a|s)$ | Strategy mapping states to actions | Race strategy: "given THIS situation, do THIS" |
| **Reward** | $r_t$ | Immediate feedback signal | Positions gained, time advantage, championship points |
| **Return** | $G_t$ | Cumulative discounted future reward | Total value of remaining race outcome |
| **Discount factor** | $\gamma \in [0,1]$ | How much we value future vs. present rewards | How much a position gain on lap 50 matters vs. lap 1 |
| **Episode** | — | One complete sequence from start to terminal state | One full race, lights out to checkered flag |
| **Trajectory** | $\tau$ | Sequence of (state, action, reward) tuples | Full race log: every decision and its outcome |

---

## 2. Markov Decision Processes (MDPs)

An MDP formalizes the RL problem mathematically. It's defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

- $\mathcal{S}$: Set of states
- $\mathcal{A}$: Set of actions
- $P(s'|s,a)$: **Transition probability** — probability of reaching state $s'$ from state $s$ after taking action $a$
- $R(s,a)$: **Reward function** — expected reward for taking action $a$ in state $s$
- $\gamma$: **Discount factor** — balances immediate vs. future rewards

### The Markov Property

The key assumption: the future depends only on the **current state**, not the history:

$$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)$$

This is powerful because it means we can make optimal decisions using only the current state — no need to remember the entire history.

### Intuitive Explanation

Think of chess: the current board position contains everything you need to make your next move. It doesn't matter *how* you got to that position — the optimal strategy depends only on where the pieces are *right now*.

**F1 analogy:** A race is a Markov decision process. The state is your car's *current* situation: P4, lap 32 of 57, medium tires at 60% life, 2.1 seconds behind P3, fuel load nominal. Your optimal strategy decision (pit now? push? conserve?) depends only on THIS snapshot, not on whether you gained two positions on lap 1 or started P4. The transition probabilities are stochastic — you choose "push hard," but there's a probability your tires degrade faster, or a safety car changes everything. The Markov property isn't perfect in F1 (tire history matters somewhat), but it's a powerful approximation — modern F1 strategy tools use exactly this framework.

### The F1 Race as an MDP

| MDP Component | F1 Mapping |
|---------------|-----------|
| $\mathcal{S}$ (states) | {(position, tire_age, tire_compound, gap_ahead, gap_behind, laps_remaining, weather)} |
| $\mathcal{A}$ (actions) | {pit_for_softs, pit_for_mediums, pit_for_hards, push, conserve, defend} |
| $P(s'|s,a)$ | Probability of new gaps/positions given current state and action — stochastic due to rivals, weather, safety cars |
| $R(s,a)$ | Positions gained, time advantage, championship points at race end |
| $\gamma$ | How much future laps matter vs. this lap (close to 1 in F1 — every lap counts) |

### Building a Gridworld MDP

Let's build a simple gridworld — the "hello world" of RL. Our agent navigates a 4×4 grid trying to reach a goal while avoiding traps.

In [None]:
class GridWorld:
    """A simple gridworld MDP environment."""
    
    # Cell types
    EMPTY = 0
    WALL = 1
    GOAL = 2
    TRAP = 3
    
    # Actions: up, down, left, right
    ACTIONS = ['up', 'down', 'left', 'right']
    ACTION_DELTAS = {
        'up': (-1, 0),
        'down': (1, 0),
        'left': (0, -1),
        'right': (0, 1)
    }
    
    def __init__(self, grid_size=4, slip_prob=0.1):
        self.grid_size = grid_size
        self.slip_prob = slip_prob  # Probability of slipping to a random adjacent cell
        
        # Define the grid
        self.grid = np.zeros((grid_size, grid_size), dtype=int)
        self.grid[0, 3] = self.GOAL   # Goal at top-right
        self.grid[1, 1] = self.WALL   # Wall
        self.grid[2, 3] = self.TRAP   # Trap
        
        # State and action spaces
        self.states = [(i, j) for i in range(grid_size) for j in range(grid_size)
                       if self.grid[i, j] != self.WALL]
        self.terminal_states = [(i, j) for i in range(grid_size) for j in range(grid_size)
                                if self.grid[i, j] in [self.GOAL, self.TRAP]]
        self.n_states = len(self.states)
        self.n_actions = len(self.ACTIONS)
        
        # Starting position
        self.start = (3, 0)
        self.agent_pos = self.start
    
    def reset(self):
        """Reset to starting position."""
        self.agent_pos = self.start
        return self.agent_pos
    
    def _is_valid(self, pos):
        """Check if a position is valid (in bounds and not a wall)."""
        r, c = pos
        return (0 <= r < self.grid_size and 0 <= c < self.grid_size 
                and self.grid[r, c] != self.WALL)
    
    def get_transitions(self, state, action):
        """Return list of (probability, next_state, reward) for a state-action pair."""
        if state in self.terminal_states:
            return [(1.0, state, 0.0)]  # Terminal states loop with zero reward
        
        transitions = []
        intended_delta = self.ACTION_DELTAS[action]
        intended_next = (state[0] + intended_delta[0], state[1] + intended_delta[1])
        
        # Intended action succeeds with probability (1 - slip_prob)
        if self._is_valid(intended_next):
            next_state = intended_next
        else:
            next_state = state  # Bounce off wall/boundary
        
        reward = self._get_reward(next_state)
        transitions.append((1.0 - self.slip_prob, next_state, reward))
        
        # With slip_prob, agent moves in a random perpendicular direction
        if self.slip_prob > 0:
            perpendicular = []
            if action in ['up', 'down']:
                perpendicular = ['left', 'right']
            else:
                perpendicular = ['up', 'down']
            
            for perp_action in perpendicular:
                perp_delta = self.ACTION_DELTAS[perp_action]
                perp_next = (state[0] + perp_delta[0], state[1] + perp_delta[1])
                if self._is_valid(perp_next):
                    perp_state = perp_next
                else:
                    perp_state = state
                perp_reward = self._get_reward(perp_state)
                transitions.append((self.slip_prob / 2, perp_state, perp_reward))
        
        return transitions
    
    def _get_reward(self, state):
        """Reward function."""
        if self.grid[state[0], state[1]] == self.GOAL:
            return +1.0
        elif self.grid[state[0], state[1]] == self.TRAP:
            return -1.0
        else:
            return -0.04  # Small step penalty to encourage efficiency
    
    def step(self, action):
        """Take an action, return (next_state, reward, done)."""
        transitions = self.get_transitions(self.agent_pos, action)
        probs = [t[0] for t in transitions]
        idx = np.random.choice(len(transitions), p=probs)
        _, next_state, reward = transitions[idx]
        
        self.agent_pos = next_state
        done = next_state in self.terminal_states
        return next_state, reward, done


# Create and display the gridworld
env = GridWorld(grid_size=4, slip_prob=0.1)

print("GridWorld MDP")
print(f"States: {env.n_states} (excluding walls)")
print(f"Actions: {env.n_actions} ({', '.join(env.ACTIONS)})")
print(f"Terminal states: {env.terminal_states}")
print(f"Slip probability: {env.slip_prob}")
print(f"Start: {env.start}")

### Visualization: The Gridworld

In [None]:
def visualize_gridworld(env, values=None, policy=None, title='GridWorld'):
    """Visualize the gridworld with optional value function and policy overlays."""
    fig, ax = plt.subplots(1, 1, figsize=(7, 7))
    n = env.grid_size
    
    # Color map for cell types
    colors = {
        GridWorld.EMPTY: '#f0f0f0',
        GridWorld.WALL: '#2c3e50',
        GridWorld.GOAL: '#2ecc71',
        GridWorld.TRAP: '#e74c3c'
    }
    
    # Draw cells
    for i in range(n):
        for j in range(n):
            cell_type = env.grid[i, j]
            color = colors[cell_type]
            
            # If we have values, shade empty cells by value
            if values is not None and cell_type == GridWorld.EMPTY:
                v = values.get((i, j), 0)
                # Normalize to [-1, 1] for coloring
                intensity = np.clip(v, -1, 1)
                if intensity >= 0:
                    color = plt.cm.RdYlGn(0.5 + intensity * 0.5)
                else:
                    color = plt.cm.RdYlGn(0.5 + intensity * 0.5)
            
            rect = plt.Rectangle((j, n - 1 - i), 1, 1, facecolor=color,
                                  edgecolor='black', linewidth=2)
            ax.add_patch(rect)
            
            # Labels
            if cell_type == GridWorld.GOAL:
                ax.text(j + 0.5, n - 1 - i + 0.7, 'GOAL', ha='center', va='center',
                        fontsize=11, fontweight='bold', color='white')
                ax.text(j + 0.5, n - 1 - i + 0.4, '+1.0', ha='center', va='center',
                        fontsize=10, color='white')
            elif cell_type == GridWorld.TRAP:
                ax.text(j + 0.5, n - 1 - i + 0.7, 'TRAP', ha='center', va='center',
                        fontsize=11, fontweight='bold', color='white')
                ax.text(j + 0.5, n - 1 - i + 0.4, '-1.0', ha='center', va='center',
                        fontsize=10, color='white')
            elif cell_type == GridWorld.WALL:
                ax.text(j + 0.5, n - 1 - i + 0.5, 'WALL', ha='center', va='center',
                        fontsize=11, fontweight='bold', color='white')
            
            # Show values
            if values is not None and (i, j) in values and cell_type not in [GridWorld.WALL]:
                v = values[(i, j)]
                ax.text(j + 0.5, n - 1 - i + 0.15, f'{v:.3f}', ha='center', va='center',
                        fontsize=9, color='black', style='italic')
            
            # Show policy arrows
            if policy is not None and (i, j) in policy and cell_type == GridWorld.EMPTY:
                action = policy[(i, j)]
                arrow_map = {
                    'up': (0, 0.25),
                    'down': (0, -0.25),
                    'left': (-0.25, 0),
                    'right': (0.25, 0)
                }
                dx, dy = arrow_map[action]
                ax.annotate('', xy=(j + 0.5 + dx, n - 1 - i + 0.5 + dy),
                           xytext=(j + 0.5, n - 1 - i + 0.5),
                           arrowprops=dict(arrowstyle='->', lw=2.5, color='#2c3e50'))
    
    # Mark start
    si, sj = env.start
    ax.text(sj + 0.5, n - 1 - si + 0.85, 'START', ha='center', va='center',
            fontsize=8, fontweight='bold', color='#3498db')
    
    ax.set_xlim(0, n)
    ax.set_ylim(0, n)
    ax.set_aspect('equal')
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.set_xticks(range(n))
    ax.set_yticks(range(n))
    ax.set_xticklabels(range(n))
    ax.set_yticklabels(range(n-1, -1, -1))
    ax.set_xlabel('Column')
    ax.set_ylabel('Row')
    plt.tight_layout()
    plt.show()


visualize_gridworld(env, title='4×4 GridWorld Environment')

The agent starts at the bottom-left and must navigate to the GOAL (+1.0) while avoiding the TRAP (-1.0). Each non-terminal step costs -0.04 to encourage the agent to find the goal quickly. There's a 10% chance of slipping perpendicular to the intended direction — this stochasticity is what makes the problem interesting.

**F1 analogy:** Think of the gridworld as a simplified race. The GOAL is the podium finish, the TRAP is a DNF (retirement), and the small step penalty is tire degradation — every lap costs you something, so you need to reach the finish efficiently. The 10% slip probability mirrors the unpredictability of racing: you plan to push through Turn 3, but you might get understeer and lose time. The wall is like a track limit violation that nullifies your move.

---

## 3. Returns and Discounting

The agent doesn't just want the next reward — it wants to maximize the **total reward over time**. We call this the **return**:

$$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$$

The **discount factor** $\gamma$ determines how much we value future rewards:

- $\gamma = 0$: Only care about immediate reward (greedy)
- $\gamma = 1$: Value all future rewards equally (far-sighted)
- $\gamma = 0.9$: A reward 10 steps away is worth $0.9^{10} \approx 0.35$ of an immediate reward

### Why discount?

1. **Mathematical convenience**: Makes infinite sums converge
2. **Uncertainty**: The further into the future, the less certain we are
3. **Human-like behavior**: We prefer rewards sooner rather than later

**F1 analogy:** The discount factor captures how much future laps matter compared to this one. With $\gamma$ close to 1 (say 0.99), a position gain on lap 50 is almost as valuable as one on lap 1 — which is how F1 strategists actually think. But with $\gamma = 0.5$, you'd heavily prioritize immediate gains, like a driver who burns through tires to lead lap 1 but fades by mid-race. A safety car on lap 40 is uncertain — discounting reflects that future rewards are less predictable. Teams running Monte Carlo race simulations with thousands of scenarios are implicitly reasoning about discounted returns across different possible futures.

### Visualization: Effect of Discount Factor

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Discount curves
steps = np.arange(0, 20)
gammas = [0.0, 0.5, 0.9, 0.99, 1.0]
colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(gammas)))

for gamma, color in zip(gammas, colors):
    weights = [gamma**k for k in steps]
    axes[0].plot(steps, weights, 'o-', label=f'γ = {gamma}', color=color, markersize=4)

axes[0].set_xlabel('Steps into the future (k)', fontsize=12)
axes[0].set_ylabel('Discount weight (γᵏ)', fontsize=12)
axes[0].set_title('How Much We Value Future Rewards', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Right: Cumulative return example
rewards = [0, -0.04, -0.04, -0.04, -0.04, 1.0]  # Path to goal
gamma_vals = [0.5, 0.9, 0.99]
bar_width = 0.25
x = np.arange(len(rewards))

for i, gamma in enumerate(gamma_vals):
    discounted = [rewards[k] * gamma**k for k in range(len(rewards))]
    axes[1].bar(x + i * bar_width, discounted, bar_width,
               label=f'γ = {gamma} (Return = {sum(discounted):.3f})',
               alpha=0.8)

axes[1].set_xlabel('Time step', fontsize=12)
axes[1].set_ylabel('Discounted reward', fontsize=12)
axes[1].set_title('Same Path, Different Discount Factors', fontsize=13, fontweight='bold')
axes[1].set_xticks(x + bar_width)
axes[1].set_xticklabels([f't={k}' for k in range(len(rewards))])
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Numeric example
print("Example: Agent takes 5 steps then reaches goal")
print(f"Rewards: {rewards}")
for gamma in [0.5, 0.9, 0.99]:
    G = sum(r * gamma**k for k, r in enumerate(rewards))
    print(f"  γ = {gamma}: Return G₀ = {G:.4f}")

---

## 4. Value Functions and the Bellman Equation

Value functions answer the question: **"How good is it to be in a particular state (or to take a particular action in a state)?"**

### State-Value Function V(s)

The **state-value function** $V^\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$:

$$V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]$$

**F1 analogy:** V(s) answers: "How valuable is this race position given current conditions?" If you're P3 with fresh mediums and 20 laps to go, V(s) is high — you have a great shot at the podium. If you're P15 with worn hards and 5 laps left, V(s) is low.

### Action-Value Function Q(s, a)

The **action-value function** $Q^\pi(s, a)$ is the expected return starting from state $s$, taking action $a$, then following policy $\pi$:

$$Q^\pi(s, a) = \mathbb{E}_\pi[G_t | s_t = s, a_t = a]$$

**F1 analogy:** Q(s, a) answers: "How good is pitting NOW vs. next lap, given we're P3 with worn tires?" Q(P3_worn_tires, pit_now) might be 0.7 (likely P4 finish after losing track position). Q(P3_worn_tires, push_one_more_lap) might be 0.6 (risk of tire failure, but could maintain position if tires hold). The strategist compares Q-values to make the call.

### The Bellman Equation

The key insight: we can express the value of a state **recursively** in terms of the values of successor states:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a) + \gamma V^\pi(s')\right]$$

This is the **Bellman expectation equation** — the foundation of almost every RL algorithm.

**In words**: The value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over all actions and transitions.

**F1 analogy:** The Bellman equation says: "The value of being P3 on lap 30 = the immediate reward from this lap's action + the discounted value of wherever we end up on lap 31." Today's position value = immediate reward + future race value. This is exactly how a strategy engineer thinks: "If we pit now, we lose 2 seconds (immediate cost) but gain tire advantage for the remaining laps (future value)."

### Deep Dive: Why the Bellman Equation Matters

The Bellman equation transforms an intractable problem (compute expected infinite sums) into a system of linear equations that can be solved iteratively. It's the RL equivalent of dynamic programming — breaking a hard problem into overlapping subproblems.

### Visualization: Bellman Equation Backup Diagram

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: V(s) backup
ax = axes[0]
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 4)
ax.axis('off')
ax.set_title('V(s) Bellman Backup', fontsize=13, fontweight='bold')

# Root state
ax.plot(0, 3.5, 'o', markersize=25, color='#3498db', zorder=5)
ax.text(0, 3.5, 's', ha='center', va='center', fontsize=12, fontweight='bold', color='white')

# Action nodes
action_x = [-1.2, 0, 1.2]
action_labels = ['a₁', 'a₂', 'a₃']
for x, label in zip(action_x, action_labels):
    ax.plot(x, 2, 's', markersize=15, color='#e74c3c', zorder=5)
    ax.text(x, 2, label, ha='center', va='center', fontsize=9, color='white', fontweight='bold')
    ax.plot([0, x], [3.2, 2.2], '-', color='gray', lw=1.5)
    ax.text((0 + x)/2 - 0.15, 2.7, 'π(a|s)', fontsize=7, color='gray', ha='center')

# Next states from action a2
next_x = [-0.5, 0.5]
for x in next_x:
    ax.plot(x, 0.5, 'o', markersize=20, color='#2ecc71', zorder=5)
    ax.text(x, 0.5, "s'", ha='center', va='center', fontsize=10, fontweight='bold', color='white')
    ax.plot([0, x], [1.8, 0.7], '-', color='gray', lw=1.5)

ax.text(0.5, 1.3, 'P(s\'|s,a)', fontsize=8, color='gray', ha='center')
ax.text(0, -0.3, 'r + γV(s\')', ha='center', fontsize=10, style='italic', color='#2c3e50')

# Right: Q(s,a) backup
ax = axes[1]
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 4)
ax.axis('off')
ax.set_title('Q(s,a) Bellman Backup', fontsize=13, fontweight='bold')

# Root action
ax.plot(0, 3.5, 's', markersize=20, color='#e74c3c', zorder=5)
ax.text(0, 3.5, 'a', ha='center', va='center', fontsize=12, fontweight='bold', color='white')

# Next states
next_x = [-1, 0, 1]
for x in next_x:
    ax.plot(x, 2, 'o', markersize=22, color='#2ecc71', zorder=5)
    ax.text(x, 2, "s'", ha='center', va='center', fontsize=10, fontweight='bold', color='white')
    ax.plot([0, x], [3.2, 2.2], '-', color='gray', lw=1.5)

ax.text(0.7, 2.7, 'P(s\'|s,a)', fontsize=8, color='gray', ha='center')

# Next actions from s'
for base_x in [-1, 1]:
    offsets = [-0.3, 0.3]
    for off in offsets:
        x = base_x + off
        ax.plot(x, 0.5, 's', markersize=12, color='#e74c3c', zorder=5)
        ax.text(x, 0.5, "a'", ha='center', va='center', fontsize=7, color='white', fontweight='bold')
        ax.plot([base_x, x], [1.8, 0.7], '-', color='gray', lw=1)

ax.text(0.7, 1.3, 'π(a\'|s\')', fontsize=8, color='gray', ha='center')
ax.text(0, -0.3, 'r + γ Σ π(a\'|s\')Q(s\',a\')', ha='center', fontsize=10, style='italic', color='#2c3e50')

plt.tight_layout()
plt.show()

---

## 5. Policy Evaluation

Given a policy $\pi$, **policy evaluation** computes $V^\pi(s)$ for every state. We do this by repeatedly applying the Bellman equation until convergence:

$$V_{k+1}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a) + \gamma V_k(s')\right]$$

Starting from $V_0(s) = 0$ for all states, this iterative process is guaranteed to converge to $V^\pi$.

In [None]:
def policy_evaluation(env, policy, gamma=0.9, theta=1e-6, max_iterations=1000):
    """Evaluate a policy by iteratively applying the Bellman expectation equation.
    
    Args:
        env: GridWorld environment
        policy: dict mapping state -> action (deterministic policy)
        gamma: discount factor
        theta: convergence threshold
        max_iterations: safety limit
    
    Returns:
        V: dict mapping state -> value
        history: list of value dicts at each iteration (for visualization)
    """
    # Initialize values to zero
    V = {s: 0.0 for s in env.states}
    history = [V.copy()]
    
    for iteration in range(max_iterations):
        delta = 0
        V_new = V.copy()
        
        for s in env.states:
            if s in env.terminal_states:
                # Terminal states have fixed values based on reward
                if env.grid[s[0], s[1]] == GridWorld.GOAL:
                    V_new[s] = 1.0
                elif env.grid[s[0], s[1]] == GridWorld.TRAP:
                    V_new[s] = -1.0
                continue
            
            # Bellman expectation equation for deterministic policy
            action = policy.get(s, 'right')  # Default action
            transitions = env.get_transitions(s, action)
            
            v = sum(prob * (reward + gamma * V[s_next])
                    for prob, s_next, reward in transitions)
            
            delta = max(delta, abs(v - V[s]))
            V_new[s] = v
        
        V = V_new
        history.append(V.copy())
        
        if delta < theta:
            print(f"Policy evaluation converged after {iteration + 1} iterations (Δ < {theta})")
            break
    
    return V, history


# Evaluate a simple policy: always go right
simple_policy = {s: 'right' for s in env.states}
V_simple, history_simple = policy_evaluation(env, simple_policy, gamma=0.9)

print("\nValues under 'always go right' policy:")
for i in range(env.grid_size):
    row_vals = []
    for j in range(env.grid_size):
        if (i, j) in V_simple:
            row_vals.append(f"{V_simple[(i,j)]:7.3f}")
        else:
            row_vals.append("  WALL ")
    print(" | ".join(row_vals))

In [None]:
# Visualize the values
visualize_gridworld(env, values=V_simple, policy=simple_policy,
                    title='Policy Evaluation: "Always Go Right"')

### Visualization: Convergence of Policy Evaluation

In [None]:
# Show how values converge over iterations
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

iterations_to_show = [0, 2, 10, len(history_simple) - 1]
n = env.grid_size

for ax, it in zip(axes, iterations_to_show):
    V_it = history_simple[it]
    grid_vals = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            if (i, j) in V_it:
                grid_vals[i, j] = V_it[(i, j)]
            else:
                grid_vals[i, j] = np.nan
    
    im = ax.imshow(grid_vals, cmap='RdYlGn', vmin=-1, vmax=1)
    ax.set_title(f'Iteration {it}', fontsize=12, fontweight='bold')
    
    for i in range(n):
        for j in range(n):
            if (i, j) in V_it:
                ax.text(j, i, f'{V_it[(i,j)]:.2f}', ha='center', va='center',
                       fontsize=9, fontweight='bold')
            elif env.grid[i, j] == GridWorld.WALL:
                ax.text(j, i, 'W', ha='center', va='center', fontsize=12, fontweight='bold')
    
    ax.set_xticks(range(n))
    ax.set_yticks(range(n))

plt.suptitle('Policy Evaluation Convergence', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Notice how the values propagate backward from the goal and trap states, iteration by iteration. This is the Bellman equation at work — each iteration, information about future rewards flows one more step backward through the state space.

**F1 analogy:** This is exactly how race strategy propagates backward from the finish. On the last lap, the only thing that matters is crossing the line. On the second-to-last lap, value depends on whether you're positioned to gain or lose a place at the flag. Each earlier lap's value builds on the laps that follow — just like the Bellman backup propagating from the GOAL cell outward through the grid.

---

## 6. Policy Improvement and Policy Iteration

Policy evaluation tells us *how good* a policy is. But we want the *best* policy. **Policy improvement** uses the value function to greedily select better actions:

$$\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a) \left[R(s,a) + \gamma V^\pi(s')\right]$$

**Policy iteration** alternates between:
1. **Evaluate**: Compute $V^\pi$ for the current policy
2. **Improve**: Update the policy greedily with respect to $V^\pi$

This is guaranteed to converge to the optimal policy $\pi^*$.

**F1 analogy:** Policy iteration is how teams improve their race strategy over a season. First, they *evaluate* a strategy — "our one-stop medium-hard plan at Silverstone scored 12 points" (policy evaluation). Then they *improve* — "given what we know about tire degradation, switching to a two-stop soft-medium plan would have scored 18 points" (policy improvement). They test the new strategy at the next race, evaluate it again, improve again. Over a season, the strategy converges toward the optimal approach for each circuit. This evaluate-improve cycle is exactly policy iteration.

In [None]:
def policy_improvement(env, V, gamma=0.9):
    """Improve policy greedily based on value function."""
    policy = {}
    
    for s in env.states:
        if s in env.terminal_states:
            continue
        
        best_action = None
        best_value = float('-inf')
        
        for action in env.ACTIONS:
            transitions = env.get_transitions(s, action)
            q_sa = sum(prob * (reward + gamma * V[s_next])
                       for prob, s_next, reward in transitions)
            
            if q_sa > best_value:
                best_value = q_sa
                best_action = action
        
        policy[s] = best_action
    
    return policy


def policy_iteration(env, gamma=0.9):
    """Full policy iteration algorithm."""
    # Start with random policy
    policy = {s: np.random.choice(env.ACTIONS) for s in env.states
              if s not in env.terminal_states}
    
    iteration = 0
    while True:
        # Policy evaluation
        V, _ = policy_evaluation(env, policy, gamma)
        
        # Policy improvement
        new_policy = policy_improvement(env, V, gamma)
        
        # Check for convergence
        if new_policy == policy:
            print(f"\nPolicy iteration converged after {iteration + 1} improvement steps!")
            break
        
        policy = new_policy
        iteration += 1
    
    return policy, V


optimal_policy, optimal_V = policy_iteration(env, gamma=0.9)

In [None]:
# Visualize the optimal policy
visualize_gridworld(env, values=optimal_V, policy=optimal_policy,
                    title='Optimal Policy (Policy Iteration, γ=0.9)')

The arrows show the optimal action in each state. Notice how the agent learns to navigate around the wall, move toward the goal, and stay away from the trap. The values decrease as we move further from the goal, reflecting the discounted future reward.

**F1 analogy:** The optimal policy arrows are like the "strategy map" a team builds for every possible race situation. From P8 with fresh tires (high value), the arrow points toward "push" — close the gap and overtake. From a position near the trap (running on worn tires near a rival), the arrow says "conserve" — avoid the risk. A great strategist has an arrow for every situation before the race even starts.

---

## 7. Value Iteration

**Value iteration** combines evaluation and improvement into a single step, updating values directly with the Bellman **optimality** equation:

$$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a) \left[R(s,a) + \gamma V_k(s')\right]$$

Instead of fully evaluating a policy before improving it, value iteration takes the max over actions at every step — essentially doing greedy improvement as part of the evaluation.

| Method | Steps per Iteration | Convergence | F1 Parallel |
|--------|-------------------|-------------|-------------|
| **Policy Iteration** | Full evaluation + improvement | Fewer outer iterations | Full season review then strategy overhaul |
| **Value Iteration** | Single Bellman optimality update | More iterations but simpler | Lap-by-lap real-time strategy adjustments |

In [None]:
def value_iteration(env, gamma=0.9, theta=1e-6, max_iterations=1000):
    """Value iteration: combine evaluation and improvement in one step."""
    V = {s: 0.0 for s in env.states}
    history = []
    
    for iteration in range(max_iterations):
        delta = 0
        V_new = V.copy()
        
        for s in env.states:
            if s in env.terminal_states:
                if env.grid[s[0], s[1]] == GridWorld.GOAL:
                    V_new[s] = 1.0
                elif env.grid[s[0], s[1]] == GridWorld.TRAP:
                    V_new[s] = -1.0
                continue
            
            # Bellman optimality equation: take the MAX over actions
            action_values = []
            for action in env.ACTIONS:
                transitions = env.get_transitions(s, action)
                q = sum(prob * (reward + gamma * V[s_next])
                        for prob, s_next, reward in transitions)
                action_values.append(q)
            
            best_value = max(action_values)
            delta = max(delta, abs(best_value - V[s]))
            V_new[s] = best_value
        
        V = V_new
        history.append({'iteration': iteration, 'delta': delta, 'V': V.copy()})
        
        if delta < theta:
            print(f"Value iteration converged after {iteration + 1} iterations")
            break
    
    # Extract policy from final values
    policy = policy_improvement(env, V, gamma)
    
    return V, policy, history


V_vi, policy_vi, history_vi = value_iteration(env, gamma=0.9)

# Compare with policy iteration
print("\nValue Iteration vs Policy Iteration values match:",
      all(abs(V_vi[s] - optimal_V[s]) < 1e-4 for s in env.states))
print("Policies match:",
      all(policy_vi.get(s) == optimal_policy.get(s) for s in env.states
          if s not in env.terminal_states))

In [None]:
# Visualize convergence speed
deltas = [h['delta'] for h in history_vi]

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.semilogy(range(len(deltas)), deltas, 'b-o', markersize=3)
ax.axhline(y=1e-6, color='r', linestyle='--', label='Convergence threshold')
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Max value change (Δ)', fontsize=12)
ax.set_title('Value Iteration Convergence', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 8. Exploration vs. Exploitation

One of the fundamental challenges in RL: should the agent **exploit** what it already knows works, or **explore** new actions that might lead to better outcomes?

- **Exploitation**: Choose the action with the highest estimated value
- **Exploration**: Try less-visited or uncertain actions

Too much exploitation → stuck in local optima (never discovers the best path)  
Too much exploration → wastes time on suboptimal actions

**F1 analogy:** This is the tension every race strategist lives with. *Exploitation* is sticking with the proven one-stop strategy that has worked all season. *Exploration* is trying an aggressive undercut, an untested tire compound, or a radically different pit window. Red Bull in 2021 often explored novel strategies against Mercedes — sometimes they found gold (Abu Dhabi), sometimes they lost out. A team that never explores gets predictable and loses; a team that always explores never capitalizes on what works. The sweet spot is exploring early in the season (high epsilon) and exploiting your best strategies for the championship-deciding races (low epsilon).

### Common Exploration Strategies

| Strategy | How it Works | Tradeoff | F1 Parallel |
|----------|-------------|----------|-------------|
| **epsilon-greedy** | With prob epsilon, random action; otherwise, best action | Simple, widely used | 10% of the time, try something unconventional |
| **epsilon-decay** | Start with high epsilon, decrease over time | Explores early, exploits later | Experiment in practice/early races, lock strategy for title fight |
| **Softmax/Boltzmann** | Sample actions proportional to estimated values | Smooth exploration | Weight new strategies by estimated value, not purely random |
| **UCB** | Bonus for under-explored actions | Principled, optimistic | "We haven't tried the hard compound at this track — give it a bonus" |

In [None]:
def epsilon_greedy(Q, state, epsilon, actions):
    """Select action using epsilon-greedy strategy."""
    if np.random.random() < epsilon:
        return np.random.choice(actions)  # Explore
    else:
        q_values = [Q.get((state, a), 0.0) for a in actions]
        return actions[np.argmax(q_values)]  # Exploit


def softmax_action(Q, state, temperature, actions):
    """Select action using softmax (Boltzmann) exploration."""
    q_values = np.array([Q.get((state, a), 0.0) for a in actions])
    # Numerical stability
    q_values = q_values - np.max(q_values)
    probs = np.exp(q_values / temperature)
    probs = probs / probs.sum()
    return np.random.choice(actions, p=probs)


# Demonstrate the multi-armed bandit problem — the simplest explore/exploit scenario
class MultiArmedBandit:
    """A simple multi-armed bandit with Gaussian rewards."""
    def __init__(self, n_arms=5):
        self.n_arms = n_arms
        self.true_means = np.random.randn(n_arms)  # True reward means
    
    def pull(self, arm):
        """Pull an arm, get noisy reward."""
        return self.true_means[arm] + np.random.randn() * 0.5


def run_bandit_experiment(n_steps=1000, n_runs=200):
    """Compare exploration strategies on a bandit problem."""
    strategies = {
        'ε=0 (pure greedy)': 0.0,
        'ε=0.01': 0.01,
        'ε=0.1': 0.1,
        'ε=0.5': 0.5,
    }
    
    results = {name: np.zeros(n_steps) for name in strategies}
    
    for run in range(n_runs):
        bandit = MultiArmedBandit(n_arms=10)
        best_mean = np.max(bandit.true_means)
        
        for name, epsilon in strategies.items():
            Q = np.zeros(10)  # Estimated values
            N = np.zeros(10)  # Action counts
            
            for t in range(n_steps):
                # Epsilon-greedy selection
                if np.random.random() < epsilon:
                    arm = np.random.randint(10)
                else:
                    arm = np.argmax(Q)
                
                reward = bandit.pull(arm)
                N[arm] += 1
                Q[arm] += (reward - Q[arm]) / N[arm]  # Running average
                
                results[name][t] += reward
    
    # Average over runs
    for name in results:
        results[name] /= n_runs
    
    return results


bandit_results = run_bandit_experiment()
print("Bandit experiment complete!")

### Visualization: Exploration vs. Exploitation

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Average reward over time
colors = ['#e74c3c', '#f39c12', '#2ecc71', '#3498db']
for (name, rewards), color in zip(bandit_results.items(), colors):
    # Smooth with running average
    window = 50
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    axes[0].plot(smoothed, label=name, color=color, linewidth=2)

axes[0].set_xlabel('Step', fontsize=12)
axes[0].set_ylabel('Average Reward', fontsize=12)
axes[0].set_title('Multi-Armed Bandit: Exploration Strategies', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Right: Epsilon decay schedule
steps = np.arange(1000)
decay_schedules = {
    'Constant ε=0.1': np.ones(1000) * 0.1,
    'Linear decay': np.maximum(0.01, 1.0 - steps / 500),
    'Exponential decay': np.maximum(0.01, np.exp(-steps / 200)),
}

for (name, schedule), color in zip(decay_schedules.items(), ['#e74c3c', '#2ecc71', '#3498db']):
    axes[1].plot(steps, schedule, label=name, color=color, linewidth=2)

axes[1].set_xlabel('Step', fontsize=12)
axes[1].set_ylabel('Epsilon (exploration rate)', fontsize=12)
axes[1].set_title('Common ε-Decay Schedules', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key insight: ε=0.1 finds a good balance — enough exploration to find")
print("the best arm, but not so much that it wastes pulls on bad arms.")
print("Pure greedy (ε=0) often gets stuck on a suboptimal arm early on.")

---

## 9. Temporal Difference Learning

So far, our methods require knowing the environment's transition dynamics $P(s'|s,a)$. In practice, the agent often doesn't have this information — it must learn from experience.

**Temporal Difference (TD) learning** bridges the gap between dynamic programming (which requires a model) and Monte Carlo methods (which require complete episodes).

### TD(0) Update Rule

$$V(s_t) \leftarrow V(s_t) + \alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t)\right]$$

The term in brackets is the **TD error** $\delta_t$:

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

**Intuition**: The TD error measures how "surprised" the agent is. If the actual reward plus estimated future value is higher than expected, the TD error is positive, and we increase the value estimate.

**F1 analogy:** The TD error is the gap between what the strategist *expected* and what *actually happened*. Before a pit stop, they predicted: "We'll lose 22 seconds, come out P5, and the value of P5 at this tire age is X." After the stop, they got an undercut and came out P4 — the TD error is positive. The strategist updates their model: "Pitting at that tire age is better than we thought." Over many races, these lap-by-lap surprises refine the strategy model. This is learning from experience, not from a simulator.

| Method | Updates | Requires | F1 Parallel |
|--------|---------|----------|-------------|
| **Dynamic Programming** | After full sweep of all states | Model of environment | Pre-race simulator with full track model |
| **Monte Carlo** | After complete episode | Complete episodes | Post-race review: "How did the whole race go?" |
| **TD Learning** | After each step | Only current transition | Lap-by-lap strategy updates during the race |

In [None]:
def td_zero(env, n_episodes=5000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """TD(0) prediction: learn V(s) from experience using an ε-greedy policy."""
    V = defaultdict(float)
    visit_counts = defaultdict(int)
    td_errors = []
    
    for episode in range(n_episodes):
        state = env.reset()
        episode_errors = []
        
        for _ in range(100):  # Max steps per episode
            # ε-greedy action selection (using current V to estimate Q)
            if np.random.random() < epsilon:
                action = np.random.choice(env.ACTIONS)
            else:
                # Greedy: pick action that leads to highest-value next state
                action_values = []
                for a in env.ACTIONS:
                    transitions = env.get_transitions(state, a)
                    q = sum(p * (r + gamma * V[s_]) for p, s_, r in transitions)
                    action_values.append(q)
                action = env.ACTIONS[np.argmax(action_values)]
            
            next_state, reward, done = env.step(action)
            visit_counts[state] += 1
            
            # TD(0) update
            td_target = reward + gamma * V[next_state] * (0 if done else 1)
            td_error = td_target - V[state]
            V[state] += alpha * td_error
            
            episode_errors.append(abs(td_error))
            
            if done:
                # Update terminal state values
                if env.grid[next_state[0], next_state[1]] == GridWorld.GOAL:
                    V[next_state] = 1.0
                elif env.grid[next_state[0], next_state[1]] == GridWorld.TRAP:
                    V[next_state] = -1.0
                break
            
            state = next_state
        
        if episode_errors:
            td_errors.append(np.mean(episode_errors))
    
    return dict(V), td_errors


V_td, td_errors = td_zero(env, n_episodes=10000, alpha=0.1, gamma=0.9)

# Compare TD-learned values with exact values
print("TD(0) learned values vs. exact (value iteration):")
print(f"{'State':<10} {'TD(0)':>8} {'Exact':>8} {'Diff':>8}")
print("-" * 36)
for s in sorted(env.states):
    td_val = V_td.get(s, 0)
    exact_val = V_vi.get(s, 0)
    print(f"{str(s):<10} {td_val:8.3f} {exact_val:8.3f} {abs(td_val - exact_val):8.3f}")

In [None]:
# Visualize TD learning convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: TD error over episodes
window = 100
smoothed_errors = np.convolve(td_errors, np.ones(window)/window, mode='valid')
axes[0].plot(smoothed_errors, color='#3498db', linewidth=1.5)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Average |TD Error|', fontsize=12)
axes[0].set_title('TD(0) Learning: Error Convergence', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Right: Comparison scatter plot
td_vals = [V_td.get(s, 0) for s in env.states]
exact_vals = [V_vi.get(s, 0) for s in env.states]
axes[1].scatter(exact_vals, td_vals, s=100, color='#2ecc71', edgecolor='black', zorder=5)
axes[1].plot([-1, 1], [-1, 1], 'r--', label='Perfect agreement', linewidth=2)
axes[1].set_xlabel('Exact V(s) (Value Iteration)', fontsize=12)
axes[1].set_ylabel('Learned V(s) (TD(0))', fontsize=12)
axes[1].set_title('TD(0) vs. Exact Values', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print("TD(0) learns values close to the exact solution, but from experience only!")
print("No knowledge of transition probabilities was needed.")

---

## 10. Value-Based vs. Policy-Based Methods

RL methods fall into two broad categories:

### Value-Based Methods
Learn a value function $V(s)$ or $Q(s,a)$, then derive a policy from it.
- Examples: Q-learning, DQN, SARSA
- **Pros**: Sample efficient, stable convergence
- **Cons**: Can only handle discrete actions (without extensions)

**F1 analogy:** Like building a massive lookup table — "for every possible race situation, here's how valuable each action is." Then the strategist just picks the highest-value action. Works great when the situations are enumerable, but F1 has effectively infinite states.

### Policy-Based Methods
Learn the policy $\pi(a|s)$ directly, without a value function.
- Examples: REINFORCE, PPO, A2C
- **Pros**: Handle continuous actions, can learn stochastic policies
- **Cons**: Higher variance, less sample efficient

**F1 analogy:** Like training a driver's instincts directly — "in this situation, do this." The driver doesn't compute values; they've internalized the optimal response. This works for continuous decisions like steering angle and throttle modulation.

### Actor-Critic Methods
Combine both: an **actor** (policy) and a **critic** (value function).
- Examples: A2C, PPO, SAC
- **Pros**: Lower variance than pure policy methods, more flexible than pure value methods

**F1 analogy:** The driver (actor) makes real-time decisions, while the strategist on the pit wall (critic) evaluates how those decisions affect the race outcome. The driver learns from the strategist's feedback, and the strategist refines their model from the driver's results.

In [None]:
# Visualization: taxonomy of RL methods
fig, ax = plt.subplots(1, 1, figsize=(12, 7))
ax.set_xlim(0, 12)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('Taxonomy of RL Methods', fontsize=16, fontweight='bold')

# RL root
root = mpatches.FancyBboxPatch((4.5, 6.5), 3, 1, boxstyle="round,pad=0.2",
                                facecolor='#2c3e50', edgecolor='black', linewidth=2)
ax.add_patch(root)
ax.text(6, 7, 'Reinforcement Learning', ha='center', va='center',
        fontsize=12, fontweight='bold', color='white')

# Three branches
branches = [
    (1, 4, 'Value-Based', '#3498db', ['Q-Learning', 'DQN', 'SARSA']),
    (4.5, 4, 'Actor-Critic', '#9b59b6', ['A2C/A3C', 'PPO', 'SAC']),
    (8, 4, 'Policy-Based', '#e74c3c', ['REINFORCE', 'TRPO', 'ES']),
]

for x, y, label, color, methods in branches:
    box = mpatches.FancyBboxPatch((x, y), 3, 1, boxstyle="round,pad=0.2",
                                   facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(box)
    ax.text(x + 1.5, y + 0.5, label, ha='center', va='center',
            fontsize=11, fontweight='bold', color='white')
    
    # Connect to root
    ax.plot([6, x + 1.5], [6.5, y + 1], '-', color='gray', lw=1.5)
    
    # Method labels
    for i, method in enumerate(methods):
        my = y - 0.7 - i * 0.6
        ax.text(x + 1.5, my, f'• {method}', ha='center', va='center',
                fontsize=10, color=color)

# Annotations
ax.text(2.5, 1.0, 'Learn Q(s,a)\nDerive policy', ha='center', va='center',
        fontsize=9, style='italic', color='gray',
        bbox=dict(boxstyle='round', facecolor='#ecf0f1', alpha=0.8))
ax.text(6, 1.0, 'Learn both V(s)\nand π(a|s)', ha='center', va='center',
        fontsize=9, style='italic', color='gray',
        bbox=dict(boxstyle='round', facecolor='#ecf0f1', alpha=0.8))
ax.text(9.5, 1.0, 'Learn π(a|s)\ndirectly', ha='center', va='center',
        fontsize=9, style='italic', color='gray',
        bbox=dict(boxstyle='round', facecolor='#ecf0f1', alpha=0.8))

plt.tight_layout()
plt.show()

print("In the next notebooks:")
print("  NB18: Q-Learning & DQN (value-based)")
print("  NB19: REINFORCE & Actor-Critic (policy-based)")
print("  NB20: PPO & Modern RL (actor-critic, RLHF)")

---

## 11. Connecting RL to LLM Alignment

In Notebook 16, we introduced **RLHF** (Reinforcement Learning from Human Feedback). Now you can see how it fits the RL framework:

| RL Concept | RLHF for LLMs | F1 Parallel |
|-----------|----------------|-------------|
| **Agent** | The language model | Driver + strategist |
| **State** | The prompt + tokens generated so far | Current race situation |
| **Action** | Choosing the next token | Pit now, push, conserve |
| **Policy** | The model's probability distribution over tokens | Race strategy mapping |
| **Reward** | Score from a trained reward model (human preferences) | Championship points, positions gained |
| **Environment** | The token generation process | The race: track, rivals, weather |

The **PPO** algorithm (Notebook 20) is the standard method for this optimization — it updates the LLM's policy to maximize the reward model's scores while staying close to the original model (to prevent degradation).

This is the bridge between everything we've learned about language models and the RL techniques we'll explore in this part of the curriculum.

In [None]:
# Quick simulation: how RL improves a "language model"
# Simplified example with discrete token choices

def simulate_rlhf_analogy():
    """Simulate how RL can steer a policy toward higher-reward outputs."""
    # Pretend we have 5 possible response "styles" with different reward scores
    styles = ['Verbose & Vague', 'Concise & Clear', 'Rude & Brief', 
              'Helpful & Detailed', 'Off-topic']
    true_rewards = [-0.2, 0.7, -0.8, 0.9, -0.5]  # Human preference scores
    
    # Initial policy: uniform over styles
    policy = np.ones(5) / 5
    policy_history = [policy.copy()]
    
    # Simple policy gradient update (simplified)
    learning_rate = 0.3
    for step in range(20):
        # Sample an action from policy
        action = np.random.choice(5, p=policy)
        reward = true_rewards[action] + np.random.randn() * 0.1
        
        # Update: increase probability of rewarded actions
        gradient = np.zeros(5)
        gradient[action] = reward
        
        # Softmax update
        logits = np.log(policy + 1e-8) + learning_rate * gradient
        policy = np.exp(logits) / np.exp(logits).sum()
        policy_history.append(policy.copy())
    
    return styles, true_rewards, np.array(policy_history)


styles, rewards, policy_hist = simulate_rlhf_analogy()

fig, ax = plt.subplots(1, 1, figsize=(10, 6))
colors = ['#e74c3c', '#2ecc71', '#e67e22', '#3498db', '#95a5a6']
for i, (style, color) in enumerate(zip(styles, colors)):
    ax.plot(policy_hist[:, i], label=f'{style} (r={rewards[i]})', 
            color=color, linewidth=2)

ax.set_xlabel('RL Update Step', fontsize=12)
ax.set_ylabel('Policy Probability', fontsize=12)
ax.set_title('How RL Steers a Model Toward Better Outputs', fontsize=14, fontweight='bold')
ax.legend(fontsize=9, loc='center left', bbox_to_anchor=(1, 0.5))
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The policy gradually shifts probability toward 'Helpful & Detailed'")
print("and 'Concise & Clear' — the responses humans prefer.")
print("This is the core idea behind RLHF!")

---

## Exercises

### Exercise 1: Custom Gridworld (The Monaco Grand Prix)

Create a 5x5 gridworld that represents a simplified Monaco street circuit — multiple goals (podium positions) and traps (barriers/DNF zones). Run value iteration and visualize the optimal policy. Experiment with different discount factors (gamma = 0.5, 0.9, 0.99) and observe how the policy changes. How does a short-sighted agent (low gamma) differ from a far-sighted one (high gamma) — does the far-sighted agent take longer detours to avoid traps?

In [None]:
# Exercise 1: Your code here
# Hint: Modify the GridWorld class to accept a custom grid layout
# Then run value_iteration with different gamma values and compare


### Exercise 2: Monte Carlo vs. TD(0) — Post-Race Review vs. Lap-by-Lap Learning

Implement first-visit Monte Carlo prediction alongside TD(0) for the same gridworld. Compare their convergence rates and final value estimates. Monte Carlo is like doing a full post-race debrief — you wait until the checkered flag, then review the whole race. TD(0) is like the strategist updating their model after every single lap. Which converges faster? Which has lower variance?

In [None]:
# Exercise 2: Your code here
# Hint: Monte Carlo waits until the end of an episode to update V(s)
# using the actual return G_t, while TD(0) updates after each step


### Exercise 3: UCB Exploration — The Untested Tire Compound

Implement Upper Confidence Bound (UCB) exploration for the multi-armed bandit problem and compare it against epsilon-greedy. In F1 terms, UCB adds a "curiosity bonus" for strategies that haven't been tried much — like giving extra optimism to a tire compound you've never raced with at this circuit. The less data you have, the bigger the bonus.

UCB selects:

$$a_t = \arg\max_a \left[Q(a) + c\sqrt{\frac{\ln t}{N(a)}}\right]$$

where $c$ controls exploration strength and $N(a)$ is the number of times action $a$ has been selected.

In [None]:
# Exercise 3: Your code here
# Hint: The UCB bonus term goes to infinity for unvisited actions,
# ensuring every action is tried at least once


---

## Summary

### Key Concepts

| Concept | What It Means | F1 Parallel |
|---------|--------------|-------------|
| **Reinforcement Learning** | Agent learns to maximize cumulative reward through interaction | Driver/strategist learns optimal decisions across races |
| **MDPs** | Formalize RL with states, actions, transitions, rewards, discount | Race as a Markov process: position, tires, gaps, weather |
| **Bellman equation** | Value = immediate reward + discounted future value | Today's position value = this lap's gain + remaining race value |
| **Policy evaluation** | Compute how good a policy is | Assess: "How many points does a one-stop strategy average?" |
| **Policy improvement** | Greedily make the policy better | Switch to two-stop when evaluation shows it scores higher |
| **Value iteration** | Combine evaluation and improvement in one step | Lap-by-lap strategy optimization |
| **Exploration vs. exploitation** | Try new things or stick with what works | New undercut strategy vs. proven conservative approach |
| **TD learning** | Learn from experience without a model | Update strategy beliefs after each lap, not just post-race |
| **Value-based vs. policy-based** | Learn Q-values vs. learn policy directly | Lookup table vs. driver instinct |

### Fundamental Insight

The Bellman equation transforms the RL problem from "predict the infinite future" into "look one step ahead and use your current estimate." This simple recursive trick — combined with sufficient exploration — is powerful enough to learn optimal behavior in complex environments. In F1 terms, the Bellman equation says you don't need to simulate the entire remaining race — just evaluate the next lap and trust your value estimates for the rest.

---

## Next Steps

Now that we understand the RL framework, value functions, and the Bellman equation, we're ready to build **practical RL agents**. In **Notebook 18: Q-Learning & Deep Q-Networks**, we'll:

- Implement tabular Q-learning (model-free control) — learning optimal pit stop timing from experience alone
- Scale to function approximation with neural networks (DQN) — a deep network learning strategy from thousands of simulated races
- Learn key techniques: experience replay and target networks
- Train a DQN agent to solve a control task from raw observations

The journey from Bellman equations to DQN is one of the most elegant progressions in all of machine learning — and it mirrors how F1 strategy has evolved from gut instinct to simulation-driven decision-making.