# Temporal Difference Learning: The Best of Both Worlds

Welcome to TD learning - the algorithm that combines the best of Monte Carlo and Dynamic Programming!

## What You'll Learn

By the end of this notebook, you'll understand:
- Why waiting until episode end is inefficient (with a weather analogy!)
- The TD update rule: bootstrapping from estimates
- TD(0) prediction: learning every step
- TD error: the surprise signal
- Comparing TD vs Monte Carlo
- Why TD is the foundation for Q-learning!

**Prerequisites:** Notebook 1 (Monte Carlo Methods)

**Time:** ~35 minutes

---
## The Big Picture: The Weather Forecast Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          TD LEARNING: THE WEATHER FORECAST ANALOGY             │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  You want to predict tomorrow's temperature.                   │
    │                                                                │
    │  MONTE CARLO APPROACH:                                        │
    │    Wait until tomorrow ends                                   │
    │    → "Tomorrow was 75°F"                                      │
    │    → Update your model                                        │
    │    PROBLEM: You can only update ONCE per day!                 │
    │                                                                │
    │  TD APPROACH:                                                 │
    │    At noon, check the current temperature (72°F)              │
    │    → "If it's 72°F now, tomorrow will probably be ~74°F"     │
    │    → Update your model based on this ESTIMATE                 │
    │    ADVANTAGE: You can update multiple times per day!          │
    │                                                                │
    │  KEY INSIGHT:                                                 │
    │    TD uses ESTIMATES to update ESTIMATES                      │
    │    This is called BOOTSTRAPPING                               │
    │                                                                │
    │    "I don't know the final outcome, but I can guess          │
    │     based on what I know so far!"                            │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, Circle, FancyArrowPatch
from matplotlib.colors import LinearSegmentedColormap
from collections import defaultdict

# Visualize MC vs TD
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Monte Carlo
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('Monte Carlo\n"Wait until the end"', fontsize=14, fontweight='bold', color='#d32f2f')

# Episode timeline
for i, (label, color) in enumerate([('s₀', '#bbdefb'), ('s₁', '#bbdefb'), ('s₂', '#bbdefb'), 
                                     ('s₃', '#bbdefb'), ('END', '#c8e6c9')]):
    x = 1 + i * 1.8
    box = FancyBboxPatch((x - 0.4, 6), 0.8, 0.8, boxstyle="round,pad=0.05",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax1.add_patch(box)
    ax1.text(x, 6.4, label, ha='center', va='center', fontsize=11, fontweight='bold')
    if i < 4:
        ax1.annotate('', xy=(x + 0.6, 6.4), xytext=(x + 0.4, 6.4),
                    arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Arrow showing update only at end
ax1.annotate('', xy=(5, 4.5), xytext=(8.6, 5.9),
            arrowprops=dict(arrowstyle='->', lw=3, color='#d32f2f',
                           connectionstyle='arc3,rad=0.2'))
ax1.text(5, 4, 'Update V(s₀), V(s₁),\nV(s₂), V(s₃) at END', 
         ha='center', fontsize=11, color='#d32f2f')

ax1.text(5, 2, '❌ Must wait for episode to finish\n❌ High variance (noisy returns)\n❌ Can\'t use for continuing tasks',
         ha='center', fontsize=10, color='#666')

# Right: TD
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('Temporal Difference\n"Learn every step"', fontsize=14, fontweight='bold', color='#388e3c')

# Episode timeline
for i, (label, color) in enumerate([('s₀', '#bbdefb'), ('s₁', '#bbdefb'), ('s₂', '#bbdefb'), 
                                     ('s₃', '#bbdefb'), ('END', '#c8e6c9')]):
    x = 1 + i * 1.8
    box = FancyBboxPatch((x - 0.4, 6), 0.8, 0.8, boxstyle="round,pad=0.05",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax2.add_patch(box)
    ax2.text(x, 6.4, label, ha='center', va='center', fontsize=11, fontweight='bold')
    if i < 4:
        ax2.annotate('', xy=(x + 0.6, 6.4), xytext=(x + 0.4, 6.4),
                    arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Arrows showing updates at each step
for i in range(4):
    x = 1 + i * 1.8
    ax2.annotate('', xy=(x, 5.3), xytext=(x, 5.9),
                arrowprops=dict(arrowstyle='->', lw=2, color='#388e3c'))
    ax2.text(x, 4.8, 'Update\nnow!', ha='center', fontsize=8, color='#388e3c')

ax2.text(5, 2, '✓ Update after EVERY step\n✓ Lower variance\n✓ Works for continuing tasks',
         ha='center', fontsize=10, color='#388e3c')

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("THE KEY DIFFERENCE")
print("="*70)
print("""
MONTE CARLO: Uses actual complete return G
    V(s) ← V(s) + α × (G - V(s))
    ↑ Must wait for G (episode must end)

TEMPORAL DIFFERENCE: Uses estimated return (r + γV(s'))
    V(s) ← V(s) + α × (r + γV(s') - V(s))
    ↑ Can update immediately (bootstrapping!)
""")
print("="*70)

---
## The TD Update Rule

The heart of TD learning is this update rule:

```
    ┌────────────────────────────────────────────────────────────────┐
    │                   THE TD(0) UPDATE RULE                        │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  V(s) ← V(s) + α × ( r + γ×V(s') - V(s) )                     │
    │                     └─────┬─────┘  └──┬──┘                     │
    │                      TD target    current estimate             │
    │                                                                │
    │  Where:                                                       │
    │    α = learning rate (how much to update)                     │
    │    r = immediate reward                                       │
    │    γ = discount factor                                        │
    │    V(s') = estimated value of NEXT state (bootstrapping!)     │
    │                                                                │
    │  The TD ERROR (δ):                                            │
    │                                                                │
    │    δ = r + γ×V(s') - V(s)                                     │
    │        └────┬────┘   └──┬──┘                                  │
    │         better      what we                                   │
    │        estimate    predicted                                  │
    │                                                                │
    │    δ > 0: "Things are better than expected!" (surprise!)     │
    │    δ < 0: "Things are worse than expected" (disappointment)  │
    │    δ = 0: "Exactly as expected" (no update needed)           │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

**Analogy:** The TD error is like a "surprise" signal. Dopamine neurons in the brain produce a similar signal!

In [None]:
# Visualize the TD update components

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')

ax.text(7, 9.5, 'The TD(0) Update: Breaking It Down', ha='center', fontsize=16, fontweight='bold')

# Current state
s_box = FancyBboxPatch((1, 5.5), 2, 2, boxstyle="round,pad=0.1",
                        facecolor='#bbdefb', edgecolor='#1976d2', linewidth=3)
ax.add_patch(s_box)
ax.text(2, 6.8, 'State s', ha='center', fontsize=12, fontweight='bold', color='#1976d2')
ax.text(2, 6.0, 'V(s) = 5.0', ha='center', fontsize=11)

# Action and reward
ax.annotate('', xy=(5, 6.5), xytext=(3.2, 6.5),
            arrowprops=dict(arrowstyle='->', lw=3, color='#666'))
ax.text(4.1, 7.2, 'Take action a', ha='center', fontsize=10)
ax.text(4.1, 5.8, 'Get reward r = 2', ha='center', fontsize=10, color='#388e3c', fontweight='bold')

# Next state
sp_box = FancyBboxPatch((5.5, 5.5), 2, 2, boxstyle="round,pad=0.1",
                         facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=3)
ax.add_patch(sp_box)
ax.text(6.5, 6.8, "State s'", ha='center', fontsize=12, fontweight='bold', color='#388e3c')
ax.text(6.5, 6.0, "V(s') = 8.0", ha='center', fontsize=11)

# TD Target calculation
calc_box = FancyBboxPatch((9, 5.5), 4.5, 2, boxstyle="round,pad=0.1",
                           facecolor='#fff3e0', edgecolor='#f57c00', linewidth=2)
ax.add_patch(calc_box)
ax.text(11.25, 7.0, 'TD Target (γ=0.9):', ha='center', fontsize=11, fontweight='bold')
ax.text(11.25, 6.2, 'r + γ×V(s\') = 2 + 0.9×8', ha='center', fontsize=10)
ax.text(11.25, 5.6, '= 9.2', ha='center', fontsize=12, fontweight='bold', color='#f57c00')

ax.annotate('', xy=(8.9, 6.5), xytext=(7.6, 6.5),
            arrowprops=dict(arrowstyle='->', lw=2, color='#f57c00'))

# TD Error
error_box = FancyBboxPatch((4, 2), 6, 2.5, boxstyle="round,pad=0.1",
                            facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=2)
ax.add_patch(error_box)
ax.text(7, 4.0, 'TD Error δ = TD Target - V(s)', ha='center', fontsize=12, fontweight='bold')
ax.text(7, 3.2, 'δ = 9.2 - 5.0 = 4.2', ha='center', fontsize=11)
ax.text(7, 2.5, '"Things are BETTER than expected!"', ha='center', fontsize=10, 
        style='italic', color='#388e3c')

# Update
ax.text(7, 0.8, 'Update (α=0.1): V(s) ← 5.0 + 0.1 × 4.2 = 5.42', 
        ha='center', fontsize=12, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='#c8e6c9', edgecolor='#388e3c'))

plt.tight_layout()
plt.show()

print("\nThe value increased because we got a better-than-expected outcome!")
print("TD error > 0 → Increase value (positive surprise)")
print("TD error < 0 → Decrease value (disappointment)")

---
## Setting Up the Environment

In [None]:
class GridWorld:
    """
    4x4 Grid World for TD learning.
    
    Same as Monte Carlo notebook for comparison.
    """
    
    def __init__(self):
        self.size = 4
        self.goal = (3, 3)
        self.action_names = ['UP', 'RIGHT', 'DOWN', 'LEFT']
        self.action_symbols = ['↑', '→', '↓', '←']
        self.reset()
    
    def reset(self):
        self.pos = (0, 0)
        return self.pos
    
    def step(self, action):
        row, col = self.pos
        
        if action == 0: row = max(0, row - 1)
        elif action == 1: col = min(3, col + 1)
        elif action == 2: row = min(3, row + 1)
        elif action == 3: col = max(0, col - 1)
        
        self.pos = (row, col)
        done = self.pos == self.goal
        reward = 10 if done else -1
        
        return self.pos, reward, done


def random_policy(state):
    return np.random.randint(0, 4)


env = GridWorld()
print("Grid World Environment Created!")
print(f"Goal: {env.goal}")
print(f"Rewards: -1 per step, +10 at goal")

---
## TD(0) Prediction: The Algorithm

```
    ┌────────────────────────────────────────────────────────────────┐
    │                   TD(0) PREDICTION ALGORITHM                   │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Initialize V(s) = 0 for all states                           │
    │                                                                │
    │  For each episode:                                            │
    │      s ← initial state                                        │
    │                                                                │
    │      For each step in episode:                                │
    │          a ← action from policy                               │
    │          r, s' ← take action a                                │
    │                                                                │
    │          # THE KEY UPDATE (happens EVERY step!)               │
    │          V(s) ← V(s) + α × (r + γ×V(s') - V(s))               │
    │                                                                │
    │          s ← s'                                               │
    │                                                                │
    │  Repeat until convergence                                     │
    │                                                                │
    │  KEY DIFFERENCE FROM MC:                                      │
    │    • MC: Update after ENTIRE episode                          │
    │    • TD: Update after EACH step                               │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
def td_0_prediction(env, policy, n_episodes=1000, alpha=0.1, gamma=0.9, verbose=False):
    """
    TD(0) Prediction: Estimate V(s) for a given policy.
    
    Update rule: V(s) ← V(s) + α × (r + γ×V(s') - V(s))
    
    Args:
        env: The environment
        policy: Function mapping state -> action
        n_episodes: Number of episodes
        alpha: Learning rate
        gamma: Discount factor
        verbose: Print progress
    
    Returns:
        V: Dictionary mapping state -> value
        history: V[(0,0)] at each episode
        td_errors: List of TD errors for analysis
    """
    V = defaultdict(float)
    history = []
    td_errors = []
    
    for episode in range(n_episodes):
        state = env.reset()
        episode_errors = []
        
        for step in range(100):  # Max steps per episode
            # Take action from policy
            action = policy(state)
            next_state, reward, done = env.step(action)
            
            # ========================================
            # THE TD(0) UPDATE - happens EVERY step!
            # ========================================
            
            # TD Target: r + γV(s')
            td_target = reward + gamma * V[next_state]
            
            # TD Error: δ = target - prediction
            td_error = td_target - V[state]
            episode_errors.append(abs(td_error))
            
            # Update value
            V[state] = V[state] + alpha * td_error
            
            # Move to next state
            state = next_state
            
            if done:
                break
        
        history.append(V[(0, 0)])
        td_errors.append(np.mean(episode_errors))
        
        if verbose and (episode + 1) % 500 == 0:
            print(f"Episode {episode+1:5d} | V(start) = {V[(0,0)]:.2f} | Avg TD Error: {td_errors[-1]:.3f}")
    
    return dict(V), history, td_errors


# Run TD(0) prediction
print("TD(0) PREDICTION")
print("="*60)
print("\nEvaluating the random policy with TD(0)...\n")

V_td, history_td, errors_td = td_0_prediction(
    env, random_policy, n_episodes=5000, alpha=0.1, verbose=True
)

print("\n" + "="*60)
print("Estimated Value Function V(s):")
print("-"*40)
for row in range(4):
    values = [V_td.get((row, col), 0.0) for col in range(4)]
    print(" ".join([f"{v:8.2f}" for v in values]))
print("-"*40)

In [None]:
# Visualize TD learning

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Left: Learning curve
ax1 = axes[0]
ax1.plot(history_td, alpha=0.4, color='#2196f3', label='Raw')

window = 100
smoothed = np.convolve(history_td, np.ones(window)/window, mode='valid')
ax1.plot(range(window-1, len(history_td)), smoothed, color='#2196f3', 
         linewidth=2, label=f'Smoothed')

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('V(start state)', fontsize=12)
ax1.set_title('TD(0) Learning Curve', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Middle: TD Errors over time
ax2 = axes[1]
ax2.plot(errors_td, alpha=0.4, color='#f57c00')
smoothed_errors = np.convolve(errors_td, np.ones(window)/window, mode='valid')
ax2.plot(range(window-1, len(errors_td)), smoothed_errors, color='#f57c00', 
         linewidth=2, label='Smoothed')

ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Average |TD Error|', fontsize=12)
ax2.set_title('TD Errors Decrease Over Time\n(Learning is working!)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Right: Value heatmap
ax3 = axes[2]
V_grid = np.array([[V_td.get((r, c), 0.0) for c in range(4)] for r in range(4)])

im = ax3.imshow(V_grid, cmap='RdYlGn')
for i in range(4):
    for j in range(4):
        value = V_grid[i, j]
        color = 'white' if value < np.mean(V_grid) else 'black'
        if (i, j) == env.goal:
            ax3.text(j, i, 'GOAL\n0', ha='center', va='center', 
                    fontsize=10, fontweight='bold', color='white')
        else:
            ax3.text(j, i, f'{value:.1f}', ha='center', va='center', 
                    fontsize=11, fontweight='bold', color=color)

ax3.set_title('Estimated V(s)', fontsize=14, fontweight='bold')
ax3.set_xticks(range(4))
ax3.set_yticks(range(4))
plt.colorbar(im, ax=ax3)

plt.tight_layout()
plt.show()

print("\nTD errors decrease as the value function converges!")
print("When TD error ≈ 0, our predictions match our observations.")

---
## Comparing TD vs Monte Carlo

Let's see which learns faster!

In [None]:
# Monte Carlo for comparison

def generate_episode(env, policy, max_steps=100):
    """Generate a complete episode."""
    episode = []
    state = env.reset()
    
    for _ in range(max_steps):
        action = policy(state)
        next_state, reward, done = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        if done:
            break
    
    return episode


def mc_prediction(env, policy, n_episodes=1000, gamma=0.9):
    """
    First-visit Monte Carlo prediction.
    """
    V = defaultdict(float)
    returns = defaultdict(list)
    history = []
    
    for ep in range(n_episodes):
        episode = generate_episode(env, policy)
        
        G = 0
        visited = set()
        
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward
            
            if state not in visited:
                visited.add(state)
                returns[state].append(G)
                V[state] = np.mean(returns[state])
        
        history.append(V[(0, 0)])
    
    return dict(V), history


# Compare learning curves
print("COMPARING TD(0) vs MONTE CARLO")
print("="*60)

n_episodes = 3000

# Run both algorithms
print("Running TD(0)...")
V_td, history_td, _ = td_0_prediction(env, random_policy, n_episodes)

print("Running Monte Carlo...")
V_mc, history_mc = mc_prediction(env, random_policy, n_episodes)

print("Done!\n")

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Learning curves
ax1 = axes[0]

# Raw
ax1.plot(history_td, alpha=0.2, color='#2196f3')
ax1.plot(history_mc, alpha=0.2, color='#f44336')

# Smoothed
window = 100
td_smooth = np.convolve(history_td, np.ones(window)/window, mode='valid')
mc_smooth = np.convolve(history_mc, np.ones(window)/window, mode='valid')

ax1.plot(range(window-1, len(history_td)), td_smooth, color='#2196f3', 
         linewidth=3, label='TD(0)')
ax1.plot(range(window-1, len(history_mc)), mc_smooth, color='#f44336', 
         linewidth=3, label='Monte Carlo')

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('V(start state)', fontsize=12)
ax1.set_title('Learning Speed: TD(0) vs Monte Carlo', fontsize=14, fontweight='bold')
ax1.legend(fontsize=12)
ax1.grid(True, alpha=0.3)

# Right: Variance comparison
ax2 = axes[1]

# Calculate rolling variance
var_window = 100
td_vars = [np.var(history_td[max(0,i-var_window):i+1]) for i in range(len(history_td))]
mc_vars = [np.var(history_mc[max(0,i-var_window):i+1]) for i in range(len(history_mc))]

ax2.plot(td_vars, alpha=0.6, color='#2196f3', linewidth=2, label='TD(0) variance')
ax2.plot(mc_vars, alpha=0.6, color='#f44336', linewidth=2, label='MC variance')

ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Variance (rolling window)', fontsize=12)
ax2.set_title('Variance: TD(0) vs Monte Carlo\n(Lower is better)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nOBSERVATIONS:")
print("-"*60)
print("1. TD(0) often converges faster (bootstrapping helps!)")
print("2. TD(0) has lower variance (uses estimates, not noisy returns)")
print("3. MC has no bias but high variance")
print("4. TD has some bias but lower variance")
print("-"*60)

---
## Why TD Works: Bootstrapping Explained

```
    ┌────────────────────────────────────────────────────────────────┐
    │              WHY BOOTSTRAPPING WORKS                           │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  The BELLMAN EQUATION says:                                   │
    │    V(s) = E[r + γV(s')]                                       │
    │                                                                │
    │  TD uses this relationship!                                   │
    │    TD Target = r + γV(s') ← an ESTIMATE of V(s)              │
    │                                                                │
    │  Even though V(s') is itself an estimate, it works because:   │
    │                                                                │
    │  1. Nearby states have similar values                         │
    │     → V(s') is often a good estimate                          │
    │                                                                │
    │  2. Errors cancel out over many updates                       │
    │     → Overestimates and underestimates average out            │
    │                                                                │
    │  3. Information propagates backward                           │
    │     → Terminal state values → nearby states → all states     │
    │                                                                │
    │  ANALOGY: Grading by Comparison                               │
    │    Instead of waiting for the "true" answer,                 │
    │    compare your answer to a classmate's                      │
    │    → You both improve over time!                             │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize value propagation

def td_with_snapshots(env, policy, n_episodes=500, alpha=0.1, gamma=0.9):
    """TD(0) that saves value function snapshots."""
    V = defaultdict(float)
    snapshots = []
    
    snapshot_episodes = [0, 10, 50, 100, 200, 500]
    
    for episode in range(n_episodes + 1):
        if episode in snapshot_episodes:
            # Save snapshot
            V_grid = np.array([[V.get((r, c), 0.0) for c in range(4)] for r in range(4)])
            snapshots.append((episode, V_grid.copy()))
        
        if episode == n_episodes:
            break
            
        state = env.reset()
        for _ in range(100):
            action = policy(state)
            next_state, reward, done = env.step(action)
            
            td_target = reward + gamma * V[next_state]
            V[state] += alpha * (td_target - V[state])
            
            state = next_state
            if done:
                break
    
    return snapshots


# Get snapshots
snapshots = td_with_snapshots(env, random_policy)

# Plot snapshots
fig, axes = plt.subplots(2, 3, figsize=(14, 9))

for idx, (ax, (ep, V_grid)) in enumerate(zip(axes.flat, snapshots)):
    im = ax.imshow(V_grid, cmap='RdYlGn', vmin=-10, vmax=5)
    
    for i in range(4):
        for j in range(4):
            color = 'white' if V_grid[i, j] < 0 else 'black'
            ax.text(j, i, f'{V_grid[i, j]:.1f}', ha='center', va='center', 
                    fontsize=10, fontweight='bold', color=color)
    
    ax.set_title(f'Episode {ep}', fontsize=12, fontweight='bold')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('TD(0): Values Propagate Backward from Goal\n(Watch how values spread over episodes!)', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nNotice how:")
print("• Goal-adjacent states get values first")
print("• Values propagate backward through the grid")
print("• Eventually all states have reasonable values")

---
## The TD Error as a Learning Signal

The TD error δ is remarkably similar to dopamine signals in the brain!

In [None]:
# Visualize TD error patterns

def run_episodes_with_errors(env, policy, n_episodes=5, alpha=0.1, gamma=0.9):
    """Run episodes and collect TD errors."""
    V = defaultdict(float)
    
    # Pre-train for reasonable values
    for _ in range(500):
        state = env.reset()
        for _ in range(100):
            action = policy(state)
            next_state, reward, done = env.step(action)
            td_target = reward + gamma * V[next_state]
            V[state] += alpha * (td_target - V[state])
            state = next_state
            if done:
                break
    
    # Collect errors from a few episodes
    episode_data = []
    
    for _ in range(n_episodes):
        errors = []
        rewards = []
        states = []
        
        state = env.reset()
        for _ in range(50):  # Limit for visualization
            action = policy(state)
            next_state, reward, done = env.step(action)
            
            td_target = reward + gamma * V[next_state]
            td_error = td_target - V[state]
            
            errors.append(td_error)
            rewards.append(reward)
            states.append(state)
            
            state = next_state
            if done:
                break
        
        episode_data.append({'errors': errors, 'rewards': rewards, 'states': states})
    
    return episode_data, V


episode_data, V_learned = run_episodes_with_errors(env, random_policy, n_episodes=3)

# Plot TD errors for episodes
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, (ax, data) in enumerate(zip(axes, episode_data)):
    errors = data['errors']
    rewards = data['rewards']
    
    x = range(len(errors))
    colors = ['#4caf50' if e > 0 else '#f44336' for e in errors]
    
    ax.bar(x, errors, color=colors, edgecolor='black', alpha=0.7)
    ax.axhline(y=0, color='black', linewidth=1)
    
    # Mark when we hit the goal
    if rewards[-1] == 10:
        ax.axvline(x=len(errors)-1, color='gold', linewidth=2, linestyle='--', label='Reached Goal!')
        ax.scatter([len(errors)-1], [errors[-1]], s=200, color='gold', zorder=5, marker='*')
    
    ax.set_xlabel('Step', fontsize=11)
    ax.set_ylabel('TD Error δ', fontsize=11)
    ax.set_title(f'Episode {idx+1}', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    if rewards[-1] == 10:
        ax.legend()

plt.suptitle('TD Errors: Green = Positive (better than expected), Red = Negative (worse than expected)',
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nTD Error Interpretation:")
print("-"*50)
print("• δ > 0 (green): \"This is BETTER than I expected!\"")
print("• δ < 0 (red): \"This is WORSE than I expected\"")
print("• δ ≈ 0: \"Exactly as expected\"")
print("\nNotice the big positive spike when reaching the goal!")
print("The brain's dopamine works similarly - reward prediction error!")

---
## TD vs MC: A Detailed Comparison

```
    ┌────────────────────────────────────────────────────────────────┐
    │              TD vs MONTE CARLO: DETAILED COMPARISON            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  PROPERTY              │  TD(0)           │  MONTE CARLO       │
    │  ─────────────────────────────────────────────────────────────│
    │  When to update        │  Every step      │  Episode end      │
    │  What it uses          │  Estimates       │  Actual returns   │
    │  Bootstraps            │  Yes             │  No               │
    │  Bias                  │  Some (initial)  │  None             │
    │  Variance              │  Lower           │  Higher           │
    │  Continuing tasks      │  Yes             │  No               │
    │  Sample efficiency     │  Better          │  Worse            │
    │  Sensitivity to α      │  More sensitive  │  Less sensitive   │
    │                                                                │
    │  WHEN TO USE EACH:                                            │
    │                                                                │
    │  TD(0):                                                       │
    │    • Continuing tasks (no clear episode end)                  │
    │    • Need fast learning                                       │
    │    • Can tolerate some bias                                   │
    │                                                                │
    │  Monte Carlo:                                                 │
    │    • Episodic tasks with clear endings                        │
    │    • Need unbiased estimates                                  │
    │    • Short episodes (otherwise too slow)                      │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize the comparison table

fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlim(0, 12)
ax.set_ylim(0, 10)
ax.axis('off')

ax.text(6, 9.5, 'TD(0) vs Monte Carlo', ha='center', fontsize=18, fontweight='bold')

# Headers
ax.text(1.5, 8.5, 'Property', ha='center', fontsize=12, fontweight='bold')
ax.text(5, 8.5, 'TD(0)', ha='center', fontsize=12, fontweight='bold', color='#2196f3')
ax.text(9, 8.5, 'Monte Carlo', ha='center', fontsize=12, fontweight='bold', color='#f44336')

# Draw header line
ax.axhline(y=8.2, xmin=0.05, xmax=0.95, color='black', linewidth=2)

# Properties
properties = [
    ('Updates', 'Every step ✓', 'Episode end'),
    ('Uses', 'Estimates', 'Actual returns'),
    ('Bias', 'Some bias', 'Unbiased ✓'),
    ('Variance', 'Lower ✓', 'Higher'),
    ('Continuing tasks', 'Yes ✓', 'No'),
    ('Sample efficiency', 'Better ✓', 'Worse'),
]

for i, (prop, td_val, mc_val) in enumerate(properties):
    y = 7.5 - i * 1.0
    ax.text(1.5, y, prop, ha='center', fontsize=11)
    
    td_color = '#388e3c' if '✓' in td_val else 'black'
    mc_color = '#388e3c' if '✓' in mc_val else 'black'
    
    ax.text(5, y, td_val, ha='center', fontsize=11, color=td_color)
    ax.text(9, y, mc_val, ha='center', fontsize=11, color=mc_color)

# Summary boxes
td_box = FancyBboxPatch((3, 0.3), 3, 1.5, boxstyle="round,pad=0.1",
                         facecolor='#e3f2fd', edgecolor='#2196f3', linewidth=2)
ax.add_patch(td_box)
ax.text(4.5, 1.3, 'TD(0)', ha='center', fontsize=11, fontweight='bold', color='#2196f3')
ax.text(4.5, 0.7, 'Fast, flexible', ha='center', fontsize=10)

mc_box = FancyBboxPatch((7, 0.3), 3, 1.5, boxstyle="round,pad=0.1",
                         facecolor='#ffebee', edgecolor='#f44336', linewidth=2)
ax.add_patch(mc_box)
ax.text(8.5, 1.3, 'Monte Carlo', ha='center', fontsize=11, fontweight='bold', color='#f44336')
ax.text(8.5, 0.7, 'Simple, unbiased', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

---
## TD Learning is the Foundation for Q-Learning!

```
    ┌────────────────────────────────────────────────────────────────┐
    │              TD → Q-LEARNING → DQN → ChatGPT                   │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  TD(0) for V(s):                                              │
    │    V(s) ← V(s) + α × (r + γV(s') - V(s))                      │
    │                                                                │
    │  Q-LEARNING for Q(s,a):                                       │
    │    Q(s,a) ← Q(s,a) + α × (r + γ×max Q(s',a') - Q(s,a))       │
    │                                                                │
    │  The key insight: Same update structure!                      │
    │    TD uses V(s') for next state value                         │
    │    Q-learning uses max Q(s',a') for next state value         │
    │                                                                │
    │  Evolution:                                                   │
    │    TD(0) → Q-Learning → DQN → PPO → RLHF → ChatGPT           │
    │                                                                │
    │  Everything builds on the TD idea!                            │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

---
## Summary: Key Takeaways

### The TD Update

```
V(s) ← V(s) + α × (r + γV(s') - V(s))
              └────────┬────────┘
                  TD error δ
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Bootstrapping** | Using estimates to update estimates |
| **TD Target** | r + γV(s') - what we expect |
| **TD Error** | δ = target - prediction (surprise signal) |

### TD vs Monte Carlo

| Property | TD | MC |
|----------|-------|--------|
| Updates | Every step | Episode end |
| Variance | Lower | Higher |
| Bias | Some | None |
| Continuing tasks | Yes | No |

### Why TD Matters

TD is the foundation for:
- Q-learning
- SARSA
- DQN
- All of modern deep RL!

---
## Test Your Understanding

**1. What is bootstrapping in TD learning?**
<details>
<summary>Click to reveal answer</summary>
Bootstrapping means using estimates to update estimates. In TD, we use V(s') (an estimate) to update V(s) (another estimate). We don't wait for the true return - we use our current best guess of future values.
</details>

**2. What is the TD error and what does it represent?**
<details>
<summary>Click to reveal answer</summary>
TD error δ = r + γV(s') - V(s). It represents the "surprise" - how different reality is from our prediction. If δ > 0, things were better than expected. If δ < 0, things were worse. It's similar to dopamine signals in the brain!
</details>

**3. Why can TD learn from continuing (non-episodic) tasks but MC cannot?**
<details>
<summary>Click to reveal answer</summary>
MC needs complete returns (sum of all rewards until episode end). In continuing tasks, there's no episode end, so you can never compute the complete return. TD only needs one step of experience (r + γV(s')), so it works for any task.
</details>

**4. What's the trade-off between bias and variance in TD vs MC?**
<details>
<summary>Click to reveal answer</summary>
MC has zero bias (uses actual returns) but high variance (returns are noisy). TD has some bias (uses estimated V(s')) but lower variance (estimates are smoother than actual returns). TD's bias decreases as learning progresses.
</details>

**5. How does TD relate to Q-learning?**
<details>
<summary>Click to reveal answer</summary>
Q-learning is TD applied to action-values instead of state-values:
- TD: V(s) ← V(s) + α(r + γV(s') - V(s))
- Q-learning: Q(s,a) ← Q(s,a) + α(r + γ×max Q(s',a') - Q(s,a))

Same TD update structure, just for Q instead of V!
</details>

---
## What's Next?

Excellent work! You now understand TD learning - the foundation of modern RL!

In the next notebook, we'll learn **Q-Learning** - applying TD to learn optimal action-values directly!

**Continue to:** [Notebook 3: Q-Learning](03_q_learning.ipynb)

---

*TD learning: "Don't wait for the end - learn from every step!"*