# Monte Carlo Methods: Learning from Experience

Welcome to Monte Carlo methods - the intuitive approach to learning by trial and error!

## What You'll Learn

By the end of this notebook, you'll understand:
- What Monte Carlo means (with a casino analogy!)
- How to learn from complete episodes
- First-visit vs every-visit MC
- Monte Carlo prediction (evaluating policies)
- Monte Carlo control (finding optimal policies)
- Epsilon-greedy exploration
- Why exploration is crucial!

**Prerequisites:** Fundamentals section (especially Bellman equations)

**Time:** ~35 minutes

---
## The Big Picture: The Casino Analogy

Monte Carlo methods are named after the famous casino in Monaco. Here's why:

```
    ┌────────────────────────────────────────────────────────────────┐
    │            THE MONTE CARLO ANALOGY: THE GAMBLER                │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Imagine you're a gambler trying to figure out:                │
    │  "What's my expected return at this casino?"                   │
    │                                                                │
    │  METHOD 1: Mathematics (Dynamic Programming)                   │
    │    - Study all the rules and probabilities                     │
    │    - Calculate expected values mathematically                  │
    │    - Requires knowing all the rules!                          │
    │                                                                │
    │  METHOD 2: Experience (Monte Carlo)                            │
    │    - Just PLAY many games                                     │
    │    - Track your winnings/losses                               │
    │    - Average your results                                     │
    │    - No need to know the rules!                               │
    │                                                                │
    │  MONTE CARLO = "Learn by trying many times and averaging"     │
    │                                                                │
    │  Key insight: If you play enough games, your average          │
    │  will converge to the TRUE expected value!                    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

**Monte Carlo RL: Learn value functions by averaging returns from many episodes!**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, Circle, FancyArrowPatch
from matplotlib.colors import LinearSegmentedColormap
from collections import defaultdict

# Visualize the Monte Carlo idea
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Simulating many trials
ax1 = axes[0]
np.random.seed(42)

# Simulate dice rolls to estimate expected value (should be 3.5)
n_trials = [10, 100, 500, 1000, 5000]
estimates = []
for n in n_trials:
    rolls = np.random.randint(1, 7, n)
    estimates.append(np.mean(rolls))

ax1.bar(range(len(n_trials)), estimates, color=['#f44336', '#ff9800', '#ffeb3b', '#8bc34a', '#4caf50'],
        edgecolor='black', linewidth=2)
ax1.axhline(y=3.5, color='blue', linestyle='--', linewidth=2, label='True Expected Value (3.5)')
ax1.set_xticks(range(len(n_trials)))
ax1.set_xticklabels([f'{n} trials' for n in n_trials])
ax1.set_ylabel('Estimated Expected Value', fontsize=12)
ax1.set_title('Monte Carlo: More Trials = Better Estimate\n(Estimating E[dice roll])', fontsize=14, fontweight='bold')
ax1.legend()
ax1.set_ylim(2.5, 4.5)
ax1.grid(True, alpha=0.3)

# Add value labels
for i, est in enumerate(estimates):
    ax1.text(i, est + 0.1, f'{est:.2f}', ha='center', fontsize=11, fontweight='bold')

# Right: Convergence visualization
ax2 = axes[1]
n_samples = 2000
rolls = np.random.randint(1, 7, n_samples)
running_avg = np.cumsum(rolls) / np.arange(1, n_samples + 1)

ax2.plot(running_avg, color='#2196f3', linewidth=2, label='Running Average')
ax2.axhline(y=3.5, color='red', linestyle='--', linewidth=2, label='True Value (3.5)')
ax2.fill_between(range(n_samples), running_avg, 3.5, alpha=0.2, color='blue')
ax2.set_xlabel('Number of Trials', fontsize=12)
ax2.set_ylabel('Running Average', fontsize=12)
ax2.set_title('Monte Carlo Convergence\n(More samples → Closer to truth)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.set_xlim(0, n_samples)
ax2.set_ylim(2.5, 4.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("THE MONTE CARLO PRINCIPLE")
print("="*70)
print("""
To estimate the EXPECTED VALUE of something:
  1. Sample many times (run many trials/episodes)
  2. Compute the average
  3. As samples → ∞, average → true expected value!

For RL:
  V(s) = E[Return | starting from s]
       ≈ Average of returns observed from state s across many episodes
""")
print("="*70)

---
## Monte Carlo for RL: The Core Idea

In RL, Monte Carlo estimates value functions by averaging **actual returns** from **complete episodes**:

```
    ┌────────────────────────────────────────────────────────────────┐
    │              MONTE CARLO FOR REINFORCEMENT LEARNING            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  STEP 1: Generate a complete episode                          │
    │                                                                │
    │    s₀ → a₀ → r₁ → s₁ → a₁ → r₂ → ... → sT (terminal)         │
    │                                                                │
    │  STEP 2: Calculate returns for each state                     │
    │                                                                │
    │    G(s₀) = r₁ + γr₂ + γ²r₃ + ...                              │
    │    G(s₁) = r₂ + γr₃ + γ²r₄ + ...                              │
    │    ...                                                        │
    │                                                                │
    │  STEP 3: Update value estimates                               │
    │                                                                │
    │    V(s) ← average of all returns observed from s              │
    │                                                                │
    │  REPEAT many episodes!                                        │
    │                                                                │
    │  KEY REQUIREMENT: Need COMPLETE episodes (must reach end)     │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
class GridWorld:
    """
    4x4 Grid World for demonstrating Monte Carlo methods.
    
    Layout:
        ┌───┬───┬───┬───┐
        │ S │   │   │   │   S = Start (0,0)
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │ G │   G = Goal (3,3)
        └───┴───┴───┴───┘
    
    Actions: 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT
    Rewards: -1 per step, +10 at goal
    """
    
    def __init__(self):
        self.size = 4
        self.goal = (3, 3)
        self.action_names = ['UP', 'RIGHT', 'DOWN', 'LEFT']
        self.action_symbols = ['↑', '→', '↓', '←']
        self.reset()
    
    def reset(self):
        """Reset to start position."""
        self.pos = (0, 0)
        return self.pos
    
    def step(self, action):
        """Take an action and return (next_state, reward, done)."""
        row, col = self.pos
        
        if action == 0:    # UP
            row = max(0, row - 1)
        elif action == 1:  # RIGHT
            col = min(3, col + 1)
        elif action == 2:  # DOWN
            row = min(3, row + 1)
        elif action == 3:  # LEFT
            col = max(0, col - 1)
        
        self.pos = (row, col)
        done = self.pos == self.goal
        reward = 10 if done else -1
        
        return self.pos, reward, done


def random_policy(state):
    """Random policy: pick any action with equal probability."""
    return np.random.randint(0, 4)


def generate_episode(env, policy, max_steps=100):
    """
    Generate a complete episode by following the policy.
    
    Returns:
        episode: List of (state, action, reward) tuples
    """
    episode = []
    state = env.reset()
    
    for _ in range(max_steps):
        action = policy(state)
        next_state, reward, done = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        
        if done:
            break
    
    return episode


# Generate and visualize an example episode
env = GridWorld()
np.random.seed(42)
episode = generate_episode(env, random_policy)

print("EXAMPLE EPISODE")
print("="*70)
print("\nFollowing a random policy from Start (0,0) to Goal (3,3):\n")
print(f"{'Step':<6} {'State':<12} {'Action':<10} {'Reward':<8}")
print("-"*40)

for i, (s, a, r) in enumerate(episode[:15]):
    print(f"{i:<6} {str(s):<12} {env.action_names[a]:<10} {r:<8}")

if len(episode) > 15:
    print(f"  ... ({len(episode) - 15} more steps)")

print(f"\nTotal steps: {len(episode)}")
print(f"Total reward: {sum(r for _, _, r in episode)}")
print("="*70)

In [None]:
# Visualize the episode as a path on the grid

fig, ax = plt.subplots(figsize=(10, 10))

# Draw grid
for row in range(4):
    for col in range(4):
        if (row, col) == env.goal:
            color = '#c8e6c9'
        elif (row, col) == (0, 0):
            color = '#bbdefb'
        else:
            color = 'white'
        
        y = 3 - row
        rect = Rectangle((col, y), 1, 1, facecolor=color, 
                           edgecolor='black', linewidth=2)
        ax.add_patch(rect)

# Labels
ax.text(0.5, 3.5, 'START', ha='center', va='center', fontsize=11, 
        fontweight='bold', color='#1976d2')
ax.text(3.5, 0.5, 'GOAL', ha='center', va='center', fontsize=11, 
        fontweight='bold', color='#388e3c')

# Draw path from episode
path_states = [(0, 0)] + [s for s, _, _ in episode[1:]] + [env.goal]

# Draw arrows for path (only first 20 steps to avoid clutter)
max_draw = min(20, len(episode))
for i in range(max_draw):
    s, a, _ = episode[i]
    row, col = s
    x1, y1 = col + 0.5, 3 - row + 0.5
    
    # Get direction
    dx, dy = [(0, 0.25), (0.25, 0), (0, -0.25), (-0.25, 0)][a]
    
    # Draw arrow
    alpha = 1.0 - (i / max_draw) * 0.6  # Fade older arrows
    ax.arrow(x1, y1, dx, dy, head_width=0.1, head_length=0.05,
            fc='#f44336', ec='#f44336', alpha=alpha, linewidth=2)

ax.set_xlim(0, 4)
ax.set_ylim(0, 4)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title(f'Episode Path (showing first {max_draw} steps)\n'
             f'Random policy took {len(episode)} steps to reach goal',
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("The random policy wanders around before finding the goal!")
print("MC will learn that states closer to the goal have higher values.")

---
## First-Visit vs Every-Visit Monte Carlo

When a state appears multiple times in an episode, how do we count it?

```
    ┌────────────────────────────────────────────────────────────────┐
    │          FIRST-VISIT vs EVERY-VISIT MONTE CARLO                │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Example episode: s₁ → s₂ → s₃ → s₂ → s₃ → s₄ (goal)          │
    │                    ↑         ↑                                 │
    │                 1st visit  2nd visit                          │
    │                                                                │
    │  FIRST-VISIT MC:                                              │
    │    Only count the FIRST time we visit each state              │
    │    V(s₂) ← average of returns from first visits only          │
    │                                                                │
    │    ✓ Simpler to understand                                    │
    │    ✓ Unbiased estimate                                        │
    │    ✓ Most commonly used                                       │
    │                                                                │
    │  EVERY-VISIT MC:                                              │
    │    Count EVERY time we visit each state                       │
    │    V(s₂) ← average of returns from ALL visits                 │
    │                                                                │
    │    ✓ More data per episode                                    │
    │    ✓ Also converges to correct value                          │
    │    ✓ Better for states visited rarely                         │
    │                                                                │
    │  BOTH converge to true V(s) as episodes → ∞                   │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize First-Visit vs Every-Visit

fig, ax = plt.subplots(figsize=(14, 6))
ax.set_xlim(0, 14)
ax.set_ylim(0, 8)
ax.axis('off')

ax.text(7, 7.5, 'First-Visit vs Every-Visit Monte Carlo', 
        ha='center', fontsize=16, fontweight='bold')

# Example episode
states = ['A', 'B', 'C', 'B', 'C', 'D']
rewards = [-1, -1, -1, -1, -1, 10]
gamma = 0.9

# Draw episode as boxes
ax.text(7, 6.5, 'Episode: A → B → C → B → C → D(goal)', ha='center', fontsize=12)

for i, (s, r) in enumerate(zip(states, rewards)):
    x = 2 + i * 2
    color = '#c8e6c9' if s == 'D' else '#bbdefb' if s in ['B', 'C'] else '#fff3e0'
    
    box = FancyBboxPatch((x - 0.5, 5.2), 1, 1, boxstyle="round,pad=0.05",
                          facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(box)
    ax.text(x, 5.7, s, ha='center', va='center', fontsize=14, fontweight='bold')
    ax.text(x, 5.0, f'r={r}', ha='center', fontsize=9, color='#666')
    
    if i < len(states) - 1:
        ax.annotate('', xy=(x + 0.7, 5.7), xytext=(x + 0.5, 5.7),
                   arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Mark first vs second visits
ax.text(4, 4.4, '1st visit', ha='center', fontsize=9, color='#388e3c', fontweight='bold')
ax.text(6, 4.4, '1st visit', ha='center', fontsize=9, color='#388e3c', fontweight='bold')
ax.text(8, 4.4, '2nd visit', ha='center', fontsize=9, color='#d32f2f', fontweight='bold')
ax.text(10, 4.4, '2nd visit', ha='center', fontsize=9, color='#d32f2f', fontweight='bold')

# First-visit explanation
fv_box = FancyBboxPatch((0.5, 1.5), 6, 2.5, boxstyle="round,pad=0.1",
                         facecolor='#e8f5e9', edgecolor='#388e3c', linewidth=2)
ax.add_patch(fv_box)
ax.text(3.5, 3.5, 'FIRST-VISIT MC', ha='center', fontsize=12, fontweight='bold', color='#388e3c')
ax.text(3.5, 2.9, 'For state B:', ha='center', fontsize=10)
ax.text(3.5, 2.4, 'Only use return from step 1', ha='center', fontsize=10)
ax.text(3.5, 1.9, 'G = -1 + 0.9×(-1) + 0.9²×(-1) + ...', ha='center', fontsize=9, color='#666')

# Every-visit explanation
ev_box = FancyBboxPatch((7.5, 1.5), 6, 2.5, boxstyle="round,pad=0.1",
                         facecolor='#fff3e0', edgecolor='#f57c00', linewidth=2)
ax.add_patch(ev_box)
ax.text(10.5, 3.5, 'EVERY-VISIT MC', ha='center', fontsize=12, fontweight='bold', color='#f57c00')
ax.text(10.5, 2.9, 'For state B:', ha='center', fontsize=10)
ax.text(10.5, 2.4, 'Use returns from BOTH visits', ha='center', fontsize=10)
ax.text(10.5, 1.9, 'Average both returns together', ha='center', fontsize=9, color='#666')

plt.tight_layout()
plt.show()

---
## Monte Carlo Prediction: Evaluating a Policy

**Goal:** Given a policy π, estimate V^π(s) for all states.

```
    ┌────────────────────────────────────────────────────────────────┐
    │                 MONTE CARLO PREDICTION                         │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  ALGORITHM (First-Visit MC Prediction):                        │
    │                                                                │
    │  Initialize:                                                   │
    │    V(s) = 0 for all states                                    │
    │    Returns(s) = [] for all states (to store observed returns) │
    │                                                                │
    │  For each episode:                                            │
    │    1. Generate episode: s₀,a₀,r₁, s₁,a₁,r₂, ... sT           │
    │    2. G ← 0 (return accumulator)                              │
    │    3. For t = T-1, T-2, ... 0 (backwards!):                   │
    │         G ← γ×G + r_{t+1}                                     │
    │         If s_t is FIRST visit in episode:                     │
    │           Append G to Returns(s_t)                            │
    │           V(s_t) ← average(Returns(s_t))                      │
    │                                                                │
    │  Return V                                                      │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
def first_visit_mc_prediction(env, policy, n_episodes=1000, gamma=0.9, verbose=False):
    """
    First-visit Monte Carlo prediction.
    
    Estimates V(s) by averaging returns from first visits to each state.
    
    Args:
        env: The environment
        policy: Function mapping state -> action
        n_episodes: Number of episodes to run
        gamma: Discount factor
        verbose: Whether to print progress
    
    Returns:
        V: Dictionary mapping state -> value estimate
        history: List of V[(0,0)] at each episode (for visualization)
    """
    V = defaultdict(float)
    returns = defaultdict(list)  # Store all returns for each state
    history = []
    
    for ep in range(n_episodes):
        # ========================================
        # STEP 1: Generate a complete episode
        # ========================================
        episode = generate_episode(env, policy)
        
        # ========================================
        # STEP 2: Calculate returns backwards
        # ========================================
        G = 0  # Return accumulator
        visited = set()  # Track first visits
        
        # Go backwards through the episode
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward  # Discounted return
            
            # ========================================
            # STEP 3: First-visit check
            # ========================================
            if state not in visited:
                visited.add(state)
                returns[state].append(G)
                V[state] = np.mean(returns[state])
        
        # Track progress
        history.append(V[(0, 0)])
        
        if verbose and (ep + 1) % 500 == 0:
            print(f"Episode {ep+1:5d} | V(start) = {V[(0,0)]:.2f}")
    
    return dict(V), history


# Run Monte Carlo prediction
env = GridWorld()

print("FIRST-VISIT MONTE CARLO PREDICTION")
print("="*60)
print("\nEvaluating the random policy...\n")

V_random, history = first_visit_mc_prediction(
    env, random_policy, n_episodes=5000, verbose=True
)

print("\n" + "="*60)
print("Estimated Value Function V^π(s):")
print("-"*40)
for row in range(4):
    values = [V_random.get((row, col), 0.0) for col in range(4)]
    print(" ".join([f"{v:8.2f}" for v in values]))
print("-"*40)
print("(Higher values = closer to goal = better!)")

In [None]:
# Visualize the learning process and value function

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Learning curve
ax1 = axes[0]
ax1.plot(history, alpha=0.4, color='blue', label='Raw')

# Smoothed version
window = 100
smoothed = np.convolve(history, np.ones(window)/window, mode='valid')
ax1.plot(range(window-1, len(history)), smoothed, color='blue', linewidth=2, label=f'Smoothed (window={window})')

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('V(start state)', fontsize=12)
ax1.set_title('Monte Carlo Learning Curve\n(Value of starting state over time)', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Value function heatmap
ax2 = axes[1]
V_grid = np.array([[V_random.get((r, c), 0.0) for c in range(4)] for r in range(4)])

im = ax2.imshow(V_grid, cmap='RdYlGn')
for i in range(4):
    for j in range(4):
        value = V_grid[i, j]
        color = 'white' if value < np.mean(V_grid) else 'black'
        if (i, j) == env.goal:
            ax2.text(j, i, 'GOAL\n0', ha='center', va='center', 
                    fontsize=10, fontweight='bold', color='white')
        else:
            ax2.text(j, i, f'{value:.1f}', ha='center', va='center', 
                    fontsize=11, fontweight='bold', color=color)

ax2.set_title('Estimated V(s) for Random Policy\n(Green = Higher Value)', fontsize=14, fontweight='bold')
ax2.set_xticks(range(4))
ax2.set_yticks(range(4))
plt.colorbar(im, ax=ax2, label='Value')

plt.tight_layout()
plt.show()

print("\nObservations:")
print("• States closer to the goal have higher values")
print("• The random policy results in relatively low values")
print("• MC converges but has high variance (noisy learning curve)")

---
## Monte Carlo Control: Finding the Optimal Policy

Now let's use MC to **find** the best policy, not just evaluate a given one!

```
    ┌────────────────────────────────────────────────────────────────┐
    │                    MONTE CARLO CONTROL                         │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  THE EXPLORATION PROBLEM:                                      │
    │                                                                │
    │    If we always take the "best" action, we might miss         │
    │    better actions we haven't tried!                           │
    │                                                                │
    │    ANALOGY: Always ordering the same dish at a restaurant    │
    │    You might miss your new favorite dish!                     │
    │                                                                │
    │  SOLUTION: EPSILON-GREEDY EXPLORATION                          │
    │                                                                │
    │    With probability (1-ε): Take the BEST action (exploit)     │
    │    With probability ε:     Take a RANDOM action (explore)     │
    │                                                                │
    │    Example with ε = 0.1:                                      │
    │    • 90% of the time: Best action                             │
    │    • 10% of the time: Random action                           │
    │                                                                │
    │  MC CONTROL LEARNS Q(s,a) instead of V(s)!                    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize epsilon-greedy exploration

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Epsilon-greedy distribution
ax1 = axes[0]
epsilon = 0.1
n_actions = 4

# Assume action 1 (RIGHT) is best
probs = np.ones(n_actions) * (epsilon / n_actions)
probs[1] += (1 - epsilon)  # Best action gets extra probability

colors = ['#90caf9', '#4caf50', '#90caf9', '#90caf9']
bars = ax1.bar(['UP', 'RIGHT\n(best)', 'DOWN', 'LEFT'], probs * 100, 
              color=colors, edgecolor='black', linewidth=2)

ax1.set_ylabel('Probability (%)', fontsize=12)
ax1.set_title(f'ε-Greedy Action Selection (ε={epsilon})\n"Mostly exploit, sometimes explore"', 
              fontsize=14, fontweight='bold')
ax1.set_ylim(0, 100)

for i, p in enumerate(probs):
    ax1.text(i, p*100 + 2, f'{p*100:.1f}%', ha='center', fontsize=12, fontweight='bold')

# Add annotations
ax1.annotate('EXPLOIT\n(best action)', xy=(1, 92), xytext=(2.5, 85),
            fontsize=10, ha='center', arrowprops=dict(arrowstyle='->', color='#388e3c'))
ax1.annotate('EXPLORE\n(try others)', xy=(0, 2.5), xytext=(-0.5, 30),
            fontsize=10, ha='center', arrowprops=dict(arrowstyle='->', color='#1976d2'))

# Right: Effect of epsilon
ax2 = axes[1]
epsilons = np.linspace(0, 1, 100)
exploit_probs = 1 - epsilons + epsilons/4  # Prob of taking best action
explore_probs = epsilons * 3/4  # Prob of taking non-best action

ax2.plot(epsilons, exploit_probs * 100, label='Best action probability', 
         color='#4caf50', linewidth=3)
ax2.plot(epsilons, explore_probs * 100, label='Other actions probability',
         color='#f44336', linewidth=3)

ax2.axvline(x=0.1, color='gray', linestyle='--', alpha=0.5)
ax2.text(0.12, 80, 'ε=0.1\n(typical)', fontsize=10)

ax2.set_xlabel('Epsilon (ε)', fontsize=12)
ax2.set_ylabel('Probability (%)', fontsize=12)
ax2.set_title('Effect of Epsilon Value\n(Trade-off: Exploration vs Exploitation)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nEpsilon controls the exploration-exploitation trade-off:")
print("• ε = 0: Pure exploitation (greedy, might get stuck)")
print("• ε = 1: Pure exploration (random, never learns)")
print("• ε = 0.1: Good balance (mostly exploit, occasionally explore)")

In [None]:
def mc_control_epsilon_greedy(env, n_episodes=10000, gamma=0.9, epsilon=0.1, verbose=False):
    """
    Monte Carlo Control with epsilon-greedy exploration.
    
    Learns the optimal Q(s,a) and extracts the optimal policy.
    
    Args:
        env: The environment
        n_episodes: Number of episodes
        gamma: Discount factor
        epsilon: Exploration rate
        verbose: Print progress
    
    Returns:
        Q: Action-value function (dictionary)
        policy: Optimal policy (dictionary)
        history: Episode rewards over time
    """
    # Initialize Q(s,a) = 0 for all state-action pairs
    Q = defaultdict(lambda: np.zeros(4))
    returns = defaultdict(list)  # Store returns for each (s,a)
    episode_rewards = []  # Track performance
    
    def epsilon_greedy_policy(state):
        """Choose action using epsilon-greedy strategy."""
        if np.random.random() < epsilon:
            return np.random.randint(0, 4)  # Explore
        else:
            return np.argmax(Q[state])  # Exploit
    
    for ep in range(n_episodes):
        # ========================================
        # STEP 1: Generate episode with epsilon-greedy
        # ========================================
        episode = generate_episode(env, epsilon_greedy_policy)
        episode_rewards.append(sum(r for _, _, r in episode))
        
        # ========================================
        # STEP 2: Calculate returns and update Q
        # ========================================
        G = 0
        visited = set()
        
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward
            
            sa_pair = (state, action)
            if sa_pair not in visited:
                visited.add(sa_pair)
                returns[sa_pair].append(G)
                Q[state][action] = np.mean(returns[sa_pair])
        
        if verbose and (ep + 1) % 2000 == 0:
            avg_reward = np.mean(episode_rewards[-500:])
            print(f"Episode {ep+1:5d} | Avg Reward (last 500): {avg_reward:.1f}")
    
    # ========================================
    # STEP 3: Extract greedy policy from Q
    # ========================================
    policy = {s: np.argmax(Q[s]) for s in Q}
    
    return dict(Q), policy, episode_rewards


# Run MC Control
print("MONTE CARLO CONTROL")
print("="*60)
print("\nLearning the optimal policy with ε-greedy exploration...\n")

Q, policy, rewards = mc_control_epsilon_greedy(
    env, n_episodes=10000, epsilon=0.1, verbose=True
)

print("\n" + "="*60)
print("LEARNED OPTIMAL POLICY:")
print("-"*30)
for row in range(4):
    actions = [env.action_symbols[policy.get((row, col), 0)] for col in range(4)]
    print("     ".join(actions))
print("-"*30)
print("(Arrows show the best action in each state)")
print("="*60)

In [None]:
# Visualize learning progress and learned policy

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Learning curve
ax1 = axes[0]
ax1.plot(rewards, alpha=0.3, color='blue', label='Raw')

# Smoothed
window = 200
smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
ax1.plot(range(window-1, len(rewards)), smoothed, color='blue', linewidth=2, 
         label=f'Smoothed (window={window})')

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Episode Reward', fontsize=12)
ax1.set_title('MC Control Learning Curve\n(Reward improves as policy improves)', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Policy visualization
ax2 = axes[1]

# Draw grid
for row in range(4):
    for col in range(4):
        color = '#c8e6c9' if (row, col) == env.goal else '#e3f2fd' if (row, col) == (0, 0) else 'white'
        rect = Rectangle((col, 3 - row), 1, 1, facecolor=color, edgecolor='black', linewidth=2)
        ax2.add_patch(rect)

# Draw arrows for policy
arrow_dx = [0, 0.35, 0, -0.35]  # up, right, down, left
arrow_dy = [0.35, 0, -0.35, 0]

for row in range(4):
    for col in range(4):
        if (row, col) == env.goal:
            ax2.text(col + 0.5, 3 - row + 0.5, 'GOAL', ha='center', va='center', 
                     fontsize=10, fontweight='bold', color='#388e3c')
        else:
            a = policy.get((row, col), 0)
            cx, cy = col + 0.5, 3 - row + 0.5
            ax2.arrow(cx - arrow_dx[a]/2, cy - arrow_dy[a]/2, 
                      arrow_dx[a], arrow_dy[a],
                      head_width=0.15, head_length=0.1, 
                      fc='#f44336', ec='#f44336', linewidth=2)

ax2.set_xlim(0, 4)
ax2.set_ylim(0, 4)
ax2.set_aspect('equal')
ax2.axis('off')
ax2.set_title('Learned Optimal Policy\n(Arrows = Best Actions)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nThe learned policy moves directly toward the goal!")
print("Much better than the random policy we evaluated earlier.")

---
## Comparing Random vs Optimal Policy

Let's see the difference in performance:

In [None]:
def evaluate_policy(env, policy_func, n_episodes=1000):
    """Evaluate a policy by running episodes and averaging returns."""
    returns = []
    steps = []
    
    for _ in range(n_episodes):
        episode = generate_episode(env, policy_func)
        returns.append(sum(r for _, _, r in episode))
        steps.append(len(episode))
    
    return np.mean(returns), np.std(returns), np.mean(steps)


# Create a greedy policy from learned Q
def learned_policy(state):
    return policy.get(state, 0)


print("POLICY COMPARISON")
print("="*60)

# Evaluate random policy
rand_return, rand_std, rand_steps = evaluate_policy(env, random_policy)
print(f"\nRandom Policy:")
print(f"  Average Return: {rand_return:.1f} ± {rand_std:.1f}")
print(f"  Average Steps:  {rand_steps:.1f}")

# Evaluate learned policy
learn_return, learn_std, learn_steps = evaluate_policy(env, learned_policy)
print(f"\nLearned (Optimal) Policy:")
print(f"  Average Return: {learn_return:.1f} ± {learn_std:.1f}")
print(f"  Average Steps:  {learn_steps:.1f}")

print(f"\nImprovement:")
print(f"  Return: {learn_return - rand_return:+.1f} ({(learn_return - rand_return)/abs(rand_return)*100:+.0f}%)")
print(f"  Steps:  {rand_steps - learn_steps:.1f} fewer steps on average")
print("="*60)

---
## Why Exploration Matters: A Demonstration

Without exploration, the agent might get stuck with a suboptimal policy:

In [None]:
# Compare different epsilon values

epsilons = [0.0, 0.05, 0.1, 0.2, 0.5]
results = []

print("EFFECT OF EXPLORATION (ε)")
print("="*60)

for eps in epsilons:
    _, pol, rewards = mc_control_epsilon_greedy(env, n_episodes=5000, epsilon=eps)
    final_reward = np.mean(rewards[-500:])
    results.append(final_reward)
    print(f"ε = {eps:.2f}: Final avg reward = {final_reward:.1f}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))

colors = ['#f44336' if e == 0 else '#4caf50' if e == 0.1 else '#2196f3' for e in epsilons]
bars = ax.bar([f'ε={e}' for e in epsilons], results, color=colors, edgecolor='black', linewidth=2)

ax.set_ylabel('Final Average Reward', fontsize=12)
ax.set_title('Effect of Exploration Rate on Learning\n(Higher is better)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add value labels
for i, v in enumerate(results):
    ax.text(i, v + 0.2, f'{v:.1f}', ha='center', fontsize=11, fontweight='bold')

# Annotate
if results[0] < max(results):
    ax.annotate('No exploration!\n(might get stuck)', xy=(0, results[0]), xytext=(0.5, results[0] - 2),
                fontsize=10, ha='center', arrowprops=dict(arrowstyle='->', color='#d32f2f'))

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("LESSON: Some exploration is essential!")
print("• ε = 0: May never discover better actions")
print("• ε = 0.1: Good balance for this problem")
print("• ε too high: Wastes time on random actions")
print("="*60)

---
## Monte Carlo: Pros and Cons

```
    ┌────────────────────────────────────────────────────────────────┐
    │           MONTE CARLO: ADVANTAGES & DISADVANTAGES              │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  ADVANTAGES:                                                   │
    │  ✓ No need for environment model (model-free)                 │
    │  ✓ Unbiased estimates (averages actual returns)               │
    │  ✓ Simple to understand and implement                         │
    │  ✓ Works well for episodic tasks with clear endings           │
    │                                                                │
    │  DISADVANTAGES:                                                │
    │  ✗ Needs COMPLETE episodes (can't learn mid-episode)          │
    │  ✗ High variance (noisy returns)                              │
    │  ✗ Slow learning (many episodes needed)                       │
    │  ✗ Can't be used for continuing (non-episodic) tasks          │
    │                                                                │
    │  BEST FOR:                                                     │
    │  • Card games (Blackjack, Poker)                              │
    │  • Board games (episodes end with win/loss)                   │
    │  • Tasks with short episodes                                  │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

---
## Summary: Key Takeaways

### The Monte Carlo Principle

**Estimate by averaging**: V(s) ≈ average of returns observed from state s

### Two Variants

| Method | What it counts | Properties |
|--------|---------------|------------|
| **First-Visit MC** | Only first visit to each state | Simpler, most common |
| **Every-Visit MC** | All visits to each state | More data per episode |

### Prediction vs Control

| Task | Goal | Learns |
|------|------|--------|
| **MC Prediction** | Evaluate a policy | V(s) |
| **MC Control** | Find optimal policy | Q(s,a) |

### Key Concepts

| Concept | Description |
|---------|-------------|
| **ε-Greedy** | Explore with probability ε, exploit otherwise |
| **Returns** | Sum of discounted rewards from a state |
| **Model-Free** | No need to know transition probabilities |

---
## Test Your Understanding

**1. What does "Monte Carlo" mean in the context of RL?**
<details>
<summary>Click to reveal answer</summary>
Monte Carlo means estimating values by averaging many random samples. In RL, we estimate V(s) or Q(s,a) by running many episodes and averaging the actual returns observed. The name comes from the Monte Carlo casino, emphasizing the use of randomness and repetition.
</details>

**2. Why does MC require complete episodes?**
<details>
<summary>Click to reveal answer</summary>
MC computes returns by summing rewards from a state to the end of the episode: G = r₁ + γr₂ + γ²r₃ + ... We need the episode to end to know all these future rewards. This is different from TD learning, which uses estimates instead.
</details>

**3. What's the difference between First-Visit and Every-Visit MC?**
<details>
<summary>Click to reveal answer</summary>
First-Visit MC only counts the return from the FIRST time we visit a state in each episode. Every-Visit MC counts returns from EVERY visit. Both converge to the true value, but First-Visit is simpler and more commonly used.
</details>

**4. Why is epsilon-greedy exploration important?**
<details>
<summary>Click to reveal answer</summary>
Without exploration, the agent might never try actions that seem worse initially but are actually better. ε-greedy ensures we occasionally try random actions, discovering potentially better strategies. With ε=0.1, we exploit 90% of the time and explore 10%.
</details>

**5. Why does MC Control learn Q(s,a) instead of V(s)?**
<details>
<summary>Click to reveal answer</summary>
With Q(s,a), we can directly pick the best action: π*(s) = argmax_a Q(s,a). With V(s) alone, we'd need to know the transition probabilities to figure out which action leads to high-value states. Q makes the policy extraction model-free.
</details>

---
## What's Next?

Great work! You've learned Monte Carlo methods - the foundation of learning from experience!

**The main limitation of MC:** We must wait until the episode ends to learn.

In the next notebook, we'll learn **Temporal Difference (TD) Learning** - which can learn from **every step**, not just at the end!

**Continue to:** [Notebook 2: Temporal Difference Learning](02_temporal_difference_learning.ipynb)

---

*Monte Carlo: "Try many times, average the results." Simple but powerful!*