# Q-Learning: Learning the Best Actions

Welcome to Q-Learning - one of the most famous and influential RL algorithms! This is the foundation for Deep Q-Networks (DQN) which famously learned to play Atari games.

## What You'll Learn

By the end of this notebook, you'll understand:
- What Q-values are and why they're useful (with a restaurant analogy!)
- The Q-Learning update rule, step by step
- Off-policy learning (what makes Q-learning special)
- How to implement Q-Learning from scratch
- Watch the agent learn in real-time with visualizations!

**Prerequisites:** Notebook 2 (TD Learning)

**Time:** ~35 minutes

---
## The Big Picture: The Restaurant Guide Analogy

Imagine you're new to a city and want to find the best restaurants:

```
    ┌────────────────────────────────────────────────────────┐
    │         Q-LEARNING: THE RESTAURANT GUIDE               │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  You have a NOTEBOOK (Q-table) with ratings:           │
    │                                                         │
    │  ┌─────────────────────────────────────────────┐       │
    │  │ Location       │ Pizza │ Sushi │ Tacos │   │       │
    │  ├─────────────────────────────────────────────┤       │
    │  │ Downtown       │  8.5  │  7.2  │  9.1  │   │       │
    │  │ Uptown         │  6.3  │  9.0  │  7.8  │   │       │
    │  │ Suburb         │  7.1  │  5.5  │  8.2  │   │       │
    │  └─────────────────────────────────────────────┘       │
    │                                                         │
    │  Q(state, action) = "Expected enjoyment if I eat       │
    │                      this food in this location"       │
    │                                                         │
    │  HOW YOU LEARN:                                        │
    │    1. Try a restaurant                                 │
    │    2. Rate your experience                             │
    │    3. Update your notebook                             │
    │    4. Repeat!                                          │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

This is exactly what Q-learning does:
- **Q-table:** A lookup table of (state, action) → expected value
- **Learning:** Update estimates based on actual experiences
- **Goal:** Find the best action for every state!

---
## What is a Q-Value?

**Q(s, a)** = "How good is it to take action **a** in state **s**?"

More precisely:
> **Q(s, a) = Expected total future reward if I take action a in state s, and then act optimally forever after.**

```
    ┌────────────────────────────────────────────────────────┐
    │                   Q-VALUE INTUITION                     │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  State: You're at position (1, 1) in a grid            │
    │                                                         │
    │           Q(state, UP)    = 5.2                        │
    │           Q(state, RIGHT) = 8.7  ← BEST!               │
    │           Q(state, DOWN)  = 3.1                        │
    │           Q(state, LEFT)  = 4.5                        │
    │                                                         │
    │  Interpretation:                                       │
    │    "If I go RIGHT, I expect to eventually get 8.7      │
    │     total reward. That's the best option!"             │
    │                                                         │
    │  Optimal Policy:                                       │
    │    π*(s) = argmax_a Q(s, a) = RIGHT                    │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle
from collections import defaultdict
from IPython.display import clear_output
import time

# Visualize Q-values for a single state
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Central state
center = (5, 5)
state_circle = plt.Circle(center, 1.5, facecolor='#bbdefb', edgecolor='#1976d2', linewidth=3)
ax.add_patch(state_circle)
ax.text(5, 5, 'State\n(1,1)', ha='center', va='center', fontsize=14, fontweight='bold')

# Q-values for each action
actions = [
    {'name': 'UP', 'pos': (5, 8), 'q': 5.2, 'color': '#4caf50'},
    {'name': 'RIGHT', 'pos': (8, 5), 'q': 8.7, 'color': '#ff9800'},  # Best!
    {'name': 'DOWN', 'pos': (5, 2), 'q': 3.1, 'color': '#f44336'},
    {'name': 'LEFT', 'pos': (2, 5), 'q': 4.5, 'color': '#9c27b0'},
]

for act in actions:
    x, y = act['pos']
    # Draw action box
    size = 0.3 + act['q'] / 10  # Size proportional to Q-value
    color = '#ff9800' if act['q'] == 8.7 else '#e0e0e0'
    box = FancyBboxPatch((x-0.8, y-0.5), 1.6, 1, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(box)
    ax.text(x, y, f'{act["name"]}\nQ={act["q"]}', ha='center', va='center', 
            fontsize=11, fontweight='bold' if act['q'] == 8.7 else 'normal')
    
    # Draw arrow from state to action
    dx, dy = (x - 5) * 0.4, (y - 5) * 0.4
    ax.annotate('', xy=(5 + dx * 2.5, 5 + dy * 2.5), xytext=(5 + dx, 5 + dy),
               arrowprops=dict(arrowstyle='->', lw=2, color=act['color']))

# Highlight best action
ax.text(8, 6.5, '← BEST ACTION!', fontsize=12, color='#ff9800', fontweight='bold')

ax.set_title('Q-Values: Expected Reward for Each Action', fontsize=16, fontweight='bold', pad=20)
ax.text(5, 0.5, 'Policy: π*(s) = argmax Q(s, a) = RIGHT', ha='center', fontsize=12, 
        style='italic', color='#666')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("KEY INSIGHT")
print("="*60)
print("\nQ-values tell us the expected TOTAL FUTURE reward.")
print("The best policy simply picks the action with highest Q-value!")
print("\n  π*(s) = argmax_a Q(s, a)")
print("\n" + "="*60)

---
## The Q-Learning Update Rule

Q-learning updates Q-values using this rule:

```
Q(s, a) ← Q(s, a) + α × [r + γ × max Q(s', a') - Q(s, a)]
           ───────       ─   ─   ────────────   ─────────
           old value     │   │   best future   current estimate
                         │   │   value
                         │   └── discount
                         └── immediate reward
```

Let's break this down:

```
    ┌────────────────────────────────────────────────────────┐
    │              THE Q-LEARNING UPDATE                      │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  TD Target = r + γ × max_a' Q(s', a')                  │
    │              ─   ─────────────────────                  │
    │              │   "best possible future"                 │
    │              └── "what I got now"                       │
    │                                                         │
    │  TD Error = TD Target - Q(s, a)                        │
    │           = "how wrong was my prediction?"             │
    │                                                         │
    │  New Q = Old Q + α × TD Error                          │
    │        = "nudge my estimate toward the truth"          │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
# Step-by-step demonstration of ONE Q-learning update

print("="*70)
print("STEP-BY-STEP Q-LEARNING UPDATE")
print("="*70)

# Set up the scenario
state = (1, 1)
action = 1  # RIGHT
next_state = (1, 2)
reward = -1  # Step cost

# Current Q-values (before update)
Q_current = 3.0
Q_next_state = [2.5, 5.0, 1.5, 2.0]  # Q-values for next state (UP, RIGHT, DOWN, LEFT)

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor

print(f"\nScenario:")
print(f"  State:       {state}")
print(f"  Action:      RIGHT")
print(f"  Next State:  {next_state}")
print(f"  Reward:      {reward}")
print(f"\nHyperparameters:")
print(f"  α (learning rate): {alpha}")
print(f"  γ (discount):      {gamma}")

print("\n" + "-"*70)
print("THE UPDATE CALCULATION:")
print("-"*70)

# Step 1: Find max Q for next state
max_Q_next = max(Q_next_state)
print(f"\n1. Find BEST action in next state:")
print(f"   Q-values at {next_state}: {Q_next_state}")
print(f"   max Q(s', a') = {max_Q_next}")

# Step 2: Calculate TD Target
td_target = reward + gamma * max_Q_next
print(f"\n2. Calculate TD TARGET:")
print(f"   TD Target = r + γ × max Q(s', a')")
print(f"             = {reward} + {gamma} × {max_Q_next}")
print(f"             = {td_target}")

# Step 3: Calculate TD Error
td_error = td_target - Q_current
print(f"\n3. Calculate TD ERROR:")
print(f"   TD Error = TD Target - Q(s, a)")
print(f"            = {td_target} - {Q_current}")
print(f"            = {td_error}")
if td_error > 0:
    print(f"   Interpretation: We UNDERESTIMATED! Should increase Q.")
else:
    print(f"   Interpretation: We OVERESTIMATED! Should decrease Q.")

# Step 4: Update Q-value
Q_new = Q_current + alpha * td_error
print(f"\n4. UPDATE Q-value:")
print(f"   Q_new = Q_old + α × TD Error")
print(f"         = {Q_current} + {alpha} × {td_error}")
print(f"         = {Q_new}")

print("\n" + "="*70)
print(f"RESULT: Q({state}, RIGHT) updated from {Q_current} → {Q_new}")
print("="*70)

---
## Off-Policy Learning: The Secret Sauce

What makes Q-learning special is that it's **off-policy**:

```
    ┌────────────────────────────────────────────────────────┐
    │            OFF-POLICY vs ON-POLICY                      │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  ON-POLICY (e.g., SARSA):                              │
    │    "Learn about the policy I'm currently following"    │
    │    Update uses: Q(s', action_I_actually_took)          │
    │                                                         │
    │  OFF-POLICY (Q-Learning):                              │
    │    "Learn the OPTIMAL policy, regardless of what I do" │
    │    Update uses: max Q(s', a')  (best possible action)  │
    │                                                         │
    │  Why this matters:                                     │
    │    • I can EXPLORE (try random actions)                │
    │    • But still LEARN the optimal policy                │
    │    • Separates exploration from optimization!          │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

**Analogy:** You're learning the best routes through a city. You can take random detours and explore (behavior policy), but your map still records the actual shortest paths (target policy).

---
## Let's Build a Grid World Environment!

In [None]:
class GridWorld:
    """
    A simple 4x4 Grid World for demonstrating Q-Learning.
    
    Layout:
        ┌───┬───┬───┬───┐
        │ S │   │   │   │   S = Start (0,0)
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │ G │   G = Goal (3,3)
        └───┴───┴───┴───┘
    
    Actions: 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT
    Rewards: -1 per step, +10 at goal
    """
    
    def __init__(self, size=4):
        self.size = size
        self.goal = (size-1, size-1)
        self.action_names = ['UP', 'RIGHT', 'DOWN', 'LEFT']
        self.action_symbols = ['↑', '→', '↓', '←']
        self.reset()
    
    def reset(self):
        """Reset agent to start position."""
        self.pos = (0, 0)
        return self.pos
    
    def step(self, action):
        """
        Take an action.
        
        Returns: (next_state, reward, done)
        """
        row, col = self.pos
        
        if action == 0:    # UP
            row = max(0, row - 1)
        elif action == 1:  # RIGHT
            col = min(self.size - 1, col + 1)
        elif action == 2:  # DOWN
            row = min(self.size - 1, row + 1)
        elif action == 3:  # LEFT
            col = max(0, col - 1)
        
        self.pos = (row, col)
        done = self.pos == self.goal
        reward = 10 if done else -1
        
        return self.pos, reward, done
    
    def render(self, Q=None, path=None):
        """Visualize the grid with optional Q-values and path."""
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Draw grid
        for row in range(self.size):
            for col in range(self.size):
                # Color cells
                if (row, col) == self.goal:
                    color = '#c8e6c9'
                elif (row, col) == (0, 0):
                    color = '#bbdefb'
                else:
                    color = 'white'
                
                rect = Rectangle((col, self.size - 1 - row), 1, 1, 
                                   facecolor=color, edgecolor='black', linewidth=2)
                ax.add_patch(rect)
                
                # Add Q-value arrows if provided
                if Q is not None and (row, col) in Q:
                    q_vals = Q[(row, col)]
                    best_action = np.argmax(q_vals)
                    cx, cy = col + 0.5, self.size - 1 - row + 0.5
                    
                    # Draw best action arrow
                    dx, dy = [(0, 0.3), (0.3, 0), (0, -0.3), (-0.3, 0)][best_action]
                    ax.arrow(cx - dx/2, cy - dy/2, dx, dy, head_width=0.15, 
                            head_length=0.1, fc='#f44336', ec='#f44336', linewidth=2)
        
        # Draw path if provided
        if path is not None:
            path_x = [col + 0.5 for (row, col) in path]
            path_y = [self.size - 1 - row + 0.5 for (row, col) in path]
            ax.plot(path_x, path_y, 'b-o', linewidth=3, markersize=10, alpha=0.5)
        
        # Labels
        ax.text(0.5, self.size - 0.5, 'START', ha='center', va='center', 
                fontsize=10, fontweight='bold', color='#1976d2')
        ax.text(self.size - 0.5, 0.5, 'GOAL', ha='center', va='center', 
                fontsize=10, fontweight='bold', color='#388e3c')
        
        ax.set_xlim(0, self.size)
        ax.set_ylim(0, self.size)
        ax.set_aspect('equal')
        ax.axis('off')
        ax.set_title('Grid World', fontsize=14, fontweight='bold')
        
        plt.tight_layout()
        plt.show()

# Create and display environment
env = GridWorld(size=4)
print("Grid World Environment Created!")
print("="*40)
print(f"Grid size: {env.size}x{env.size}")
print(f"Start: (0, 0)")
print(f"Goal: {env.goal}")
print(f"Actions: {env.action_names}")
print(f"Rewards: -1 per step, +10 at goal")

env.render()

---
## The Complete Q-Learning Algorithm

In [None]:
def q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1, verbose=False):
    """
    Q-Learning Algorithm.
    
    Args:
        env: The environment
        n_episodes: Number of episodes to train
        alpha: Learning rate (how much to update Q-values)
        gamma: Discount factor (how much to value future rewards)
        epsilon: Exploration rate (probability of random action)
        verbose: Whether to print progress
    
    Returns:
        Q: Learned Q-values
        rewards_history: Total reward per episode
    """
    # Initialize Q-table with zeros
    # Q[state] = [Q(s,UP), Q(s,RIGHT), Q(s,DOWN), Q(s,LEFT)]
    Q = defaultdict(lambda: np.zeros(4))
    
    rewards_history = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(100):  # Max steps per episode
            # ==========================================
            # STEP 1: Choose action (epsilon-greedy)
            # ==========================================
            if np.random.random() < epsilon:
                # EXPLORE: Random action
                action = np.random.randint(0, 4)
            else:
                # EXPLOIT: Best known action
                action = np.argmax(Q[state])
            
            # ==========================================
            # STEP 2: Take action, observe result
            # ==========================================
            next_state, reward, done = env.step(action)
            total_reward += reward
            
            # ==========================================
            # STEP 3: Q-Learning update (OFF-POLICY!)
            # ==========================================
            # Uses max Q(s', a') regardless of what action we'll actually take
            best_next_q = np.max(Q[next_state])
            td_target = reward + gamma * best_next_q
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error
            
            # ==========================================
            # STEP 4: Move to next state
            # ==========================================
            state = next_state
            
            if done:
                break
        
        rewards_history.append(total_reward)
        
        # Progress printing
        if verbose and (episode + 1) % 100 == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode+1:4d} | Avg Reward (last 100): {avg_reward:.2f}")
    
    return dict(Q), rewards_history

# Display the algorithm structure
print("Q-LEARNING ALGORITHM")
print("="*60)
print("""
For each episode:
    state = start
    
    While not done:
        1. CHOOSE ACTION (epsilon-greedy)
           - With prob epsilon: random action (explore)
           - With prob 1-epsilon: best action (exploit)
        
        2. TAKE ACTION
           - Get reward and next_state
        
        3. UPDATE Q-VALUE (the key step!)
           - TD Target = r + γ × max Q(s', a')
           - TD Error = TD Target - Q(s, a)
           - Q(s, a) += α × TD Error
        
        4. MOVE TO NEXT STATE
           - state = next_state
""")
print("="*60)

---
## Training the Agent - Watch It Learn!

In [None]:
# Train the Q-learning agent
env = GridWorld(size=4)

print("Training Q-Learning Agent...")
print("="*60)

Q, rewards_history = q_learning(
    env,
    n_episodes=2000,
    alpha=0.1,
    gamma=0.99,
    epsilon=0.1,
    verbose=True
)

print("\n" + "="*60)
print("Training Complete!")
print("="*60)

In [None]:
# Visualize learning curve
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Raw rewards
ax1 = axes[0]
ax1.plot(rewards_history, alpha=0.3, color='blue')
ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Total Reward', fontsize=12)
ax1.set_title('Learning Curve (Raw)', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Right: Smoothed rewards
ax2 = axes[1]
window = 50
smoothed = np.convolve(rewards_history, np.ones(window)/window, mode='valid')
ax2.plot(smoothed, color='blue', linewidth=2)
ax2.axhline(y=4, color='green', linestyle='--', linewidth=2, label='Optimal (≈4)')
ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Total Reward (Smoothed)', fontsize=12)
ax2.set_title(f'Learning Curve (Smoothed, window={window})', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal average reward (last 100 episodes): {np.mean(rewards_history[-100:]):.2f}")
print(f"Optimal reward: 4 (6 steps × -1 + 10 for goal = 4)")

---
## Visualizing the Learned Q-Values

In [None]:
def visualize_q_table(Q, env):
    """Create a detailed visualization of the Q-table."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Left: Grid with best actions (arrows)
    ax1 = axes[0]
    ax1.set_xlim(0, env.size)
    ax1.set_ylim(0, env.size)
    ax1.set_aspect('equal')
    ax1.axis('off')
    ax1.set_title('Learned Policy (Best Actions)', fontsize=14, fontweight='bold')
    
    for row in range(env.size):
        for col in range(env.size):
            # Draw cell
            if (row, col) == env.goal:
                color = '#c8e6c9'
            elif (row, col) == (0, 0):
                color = '#bbdefb'
            else:
                color = 'white'
            
            y = env.size - 1 - row
            rect = Rectangle((col, y), 1, 1, facecolor=color, 
                               edgecolor='black', linewidth=2)
            ax1.add_patch(rect)
            
            # Draw arrow for best action
            if (row, col) in Q and (row, col) != env.goal:
                q_vals = Q[(row, col)]
                best_action = np.argmax(q_vals)
                cx, cy = col + 0.5, y + 0.5
                
                # Arrow directions
                arrows = [(0, 0.3), (0.3, 0), (0, -0.3), (-0.3, 0)]
                dx, dy = arrows[best_action]
                
                ax1.arrow(cx - dx/2, cy - dy/2, dx, dy, 
                         head_width=0.15, head_length=0.1, 
                         fc='#f44336', ec='#f44336', linewidth=2)
    
    ax1.text(0.5, env.size - 0.2, 'START', ha='center', fontsize=9, color='#1976d2')
    ax1.text(env.size - 0.5, 0.2, 'GOAL', ha='center', fontsize=9, color='#388e3c')
    
    # Right: Q-value heatmap for best Q-value in each cell
    ax2 = axes[1]
    q_max_grid = np.zeros((env.size, env.size))
    
    for row in range(env.size):
        for col in range(env.size):
            if (row, col) in Q:
                q_max_grid[row, col] = np.max(Q[(row, col)])
    
    im = ax2.imshow(q_max_grid, cmap='RdYlGn', origin='upper')
    ax2.set_title('Maximum Q-Value per State', fontsize=14, fontweight='bold')
    
    # Add text annotations
    for row in range(env.size):
        for col in range(env.size):
            value = q_max_grid[row, col]
            color = 'white' if value < np.mean(q_max_grid) else 'black'
            ax2.text(col, row, f'{value:.1f}', ha='center', va='center', 
                    fontsize=12, color=color, fontweight='bold')
    
    plt.colorbar(im, ax=ax2, label='Q-value')
    ax2.set_xticks(range(env.size))
    ax2.set_yticks(range(env.size))
    
    plt.tight_layout()
    plt.show()

# Visualize the learned Q-values
visualize_q_table(Q, env)

In [None]:
# Print the Q-table in detail
print("DETAILED Q-TABLE")
print("="*70)
print(f"{'State':<10} {'UP':>8} {'RIGHT':>8} {'DOWN':>8} {'LEFT':>8} {'Best':>8}")
print("-"*70)

action_symbols = ['↑', '→', '↓', '←']

for row in range(env.size):
    for col in range(env.size):
        state = (row, col)
        if state in Q:
            q_vals = Q[state]
            best = np.argmax(q_vals)
            print(f"{str(state):<10} {q_vals[0]:>8.2f} {q_vals[1]:>8.2f} "
                  f"{q_vals[2]:>8.2f} {q_vals[3]:>8.2f} {action_symbols[best]:>8}")

print("="*70)

---
## Watch the Trained Agent Play!

In [None]:
def run_episode(env, Q, render_steps=True):
    """Run a single episode using the learned Q-values (greedy policy)."""
    state = env.reset()
    path = [state]
    total_reward = 0
    actions_taken = []
    
    for step in range(20):  # Max 20 steps
        # Choose best action (greedy)
        if state in Q:
            action = np.argmax(Q[state])
        else:
            action = np.random.randint(0, 4)
        
        actions_taken.append(env.action_names[action])
        
        # Take action
        next_state, reward, done = env.step(action)
        total_reward += reward
        path.append(next_state)
        
        if render_steps:
            print(f"Step {step+1}: {state} → {env.action_names[action]} → {next_state} "
                  f"(reward: {reward:+d})")
        
        state = next_state
        
        if done:
            break
    
    return path, total_reward, actions_taken

# Run and visualize an episode
print("WATCHING THE TRAINED AGENT")
print("="*60)
print("\nStep-by-step execution:")
print("-"*60)

env = GridWorld(size=4)
path, total_reward, actions = run_episode(env, Q, render_steps=True)

print("-"*60)
print(f"\nTotal steps: {len(path) - 1}")
print(f"Total reward: {total_reward}")
print(f"Actions: {' → '.join(actions)}")
print(f"\nOptimal path length: 6 steps")
print(f"Optimal reward: 4 (6 × -1 + 10 = 4)")

In [None]:
# Visualize the path taken
fig, ax = plt.subplots(figsize=(8, 8))

# Draw grid
for row in range(env.size):
    for col in range(env.size):
        if (row, col) == env.goal:
            color = '#c8e6c9'
        elif (row, col) == (0, 0):
            color = '#bbdefb'
        else:
            color = 'white'
        
        y = env.size - 1 - row
        rect = Rectangle((col, y), 1, 1, facecolor=color, 
                           edgecolor='black', linewidth=2)
        ax.add_patch(rect)

# Draw path
path_x = [col + 0.5 for (row, col) in path]
path_y = [env.size - 1 - row + 0.5 for (row, col) in path]

ax.plot(path_x, path_y, 'b-', linewidth=4, alpha=0.5)

for i, ((row, col), action) in enumerate(zip(path[:-1], actions)):
    x, y = col + 0.5, env.size - 1 - row + 0.5
    ax.scatter(x, y, s=200, c='blue', zorder=5)
    ax.text(x, y, str(i+1), ha='center', va='center', fontsize=10, 
            color='white', fontweight='bold')

# Mark start and end
ax.scatter(path_x[0], path_y[0], s=300, c='green', marker='s', zorder=6, label='Start')
ax.scatter(path_x[-1], path_y[-1], s=300, c='red', marker='*', zorder=6, label='Goal')

ax.set_xlim(0, env.size)
ax.set_ylim(0, env.size)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title(f'Agent\'s Path ({len(path)-1} steps, reward={total_reward})', 
             fontsize=14, fontweight='bold')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

---
## Exploring Hyperparameters

Let's see how different hyperparameters affect learning:

In [None]:
# Compare different learning rates
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Learning rate comparison
alphas = [0.01, 0.1, 0.5]
ax = axes[0]

for alpha in alphas:
    env = GridWorld()
    _, rewards = q_learning(env, n_episodes=1000, alpha=alpha, gamma=0.99, epsilon=0.1)
    smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
    ax.plot(smoothed, label=f'α={alpha}')

ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('Learning Rate (α) Comparison', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Discount factor comparison
gammas = [0.5, 0.9, 0.99]
ax = axes[1]

for gamma in gammas:
    env = GridWorld()
    _, rewards = q_learning(env, n_episodes=1000, alpha=0.1, gamma=gamma, epsilon=0.1)
    smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
    ax.plot(smoothed, label=f'γ={gamma}')

ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('Discount Factor (γ) Comparison', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Exploration rate comparison
epsilons = [0.01, 0.1, 0.3]
ax = axes[2]

for epsilon in epsilons:
    env = GridWorld()
    _, rewards = q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, epsilon=epsilon)
    smoothed = np.convolve(rewards, np.ones(50)/50, mode='valid')
    ax.plot(smoothed, label=f'ε={epsilon}')

ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('Exploration Rate (ε) Comparison', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nHYPERPARAMETER INSIGHTS:")
print("="*60)
print("\nα (Learning Rate):")
print("  - Too small (0.01): Slow learning")
print("  - Too large (0.5): Unstable, may oscillate")
print("  - Just right (0.1): Good balance")
print("\nγ (Discount Factor):")
print("  - Low (0.5): Short-sighted, may not plan well")
print("  - High (0.99): Considers future, better planning")
print("\nε (Exploration Rate):")
print("  - Too low (0.01): May get stuck in suboptimal policy")
print("  - Too high (0.3): Too random, poor performance")
print("  - Just right (0.1): Good exploration-exploitation balance")

---
## Summary: Key Takeaways

### What is Q-Learning?

Q-learning learns **action values Q(s, a)** - how good each action is in each state.

### The Update Rule

```
Q(s, a) ← Q(s, a) + α × [r + γ × max Q(s', a') - Q(s, a)]
```

| Component | Meaning |
|-----------|----------|
| α | Learning rate: how fast to update |
| r | Immediate reward received |
| γ | Discount: how much to value future |
| max Q(s', a') | Best possible future value |

### Off-Policy Learning

- The agent can **explore** (random actions)
- But still **learns the optimal policy** (uses max Q)
- Separates behavior from learning!

### Epsilon-Greedy

- With probability ε: take random action (explore)
- With probability 1-ε: take best action (exploit)

---
## Test Your Understanding

**1. What does Q(s, a) represent?**
<details>
<summary>Click to reveal answer</summary>
Q(s, a) represents the expected total future reward if we take action a in state s, and then follow the optimal policy thereafter. It tells us how "good" an action is in a given state.
</details>

**2. Why does Q-learning use `max Q(s', a')` in its update?**
<details>
<summary>Click to reveal answer</summary>
It uses max because Q-learning is off-policy - it always estimates the value assuming we'll take the BEST action in the future, regardless of what action we actually take for exploration. This allows us to learn the optimal policy while following an exploratory policy.
</details>

**3. What is the TD error in Q-learning?**
<details>
<summary>Click to reveal answer</summary>
TD Error = (r + γ × max Q(s', a')) - Q(s, a)

It's the difference between what we predicted [Q(s,a)] and what we actually observed [r + γ × max Q(s',a')]. If positive, we underestimated; if negative, we overestimated.
</details>

**4. Why do we need epsilon-greedy exploration?**
<details>
<summary>Click to reveal answer</summary>
Without exploration, the agent might get stuck in a suboptimal policy because it never tries actions that might be better. Epsilon-greedy ensures the agent sometimes takes random actions to discover potentially better strategies, while mostly exploiting what it has learned.
</details>

**5. What happens if γ = 0?**
<details>
<summary>Click to reveal answer</summary>
If γ = 0, the agent becomes completely myopic (short-sighted). It only cares about immediate rewards and ignores all future rewards. The update becomes Q(s,a) ← Q(s,a) + α(r - Q(s,a)), which only considers the immediate reward r.
</details>

---
## What's Next?

Excellent work! You've learned one of the most important algorithms in RL!

In the next notebook, we'll learn **SARSA** - the on-policy cousin of Q-learning:
- How SARSA differs from Q-learning
- When to use each algorithm
- The cliff walking problem

**Continue to:** [Notebook 4: SARSA](04_sarsa.ipynb)

---

*Q-learning is the foundation for Deep Q-Networks (DQN), which achieved superhuman performance on Atari games. You're now ready to understand how that works!*