# Implementing Monte Carlo Methods for Estimating Value Functions

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand Monte Carlo methods for value estimation
- Implement first-visit and every-visit Monte Carlo
- Estimate state value functions using Monte Carlo
- Compare Monte Carlo with other methods
- Apply Monte Carlo to RL environments

## ðŸ”— Prerequisites

- âœ… Understanding of value functions (V(s), Q(s,a))
- âœ… Understanding of episodes and returns
- âœ… Python knowledge (functions, dictionaries, loops)
- âœ… NumPy, Matplotlib knowledge
- âœ… Basic RL concepts (states, actions, rewards, policies)

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 2**:
- Implementing Monte Carlo methods for estimating value functions
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Monte Carlo methods** learn value functions from experience (sample episodes). They don't require a model of the environment and use actual returns (sum of rewards) observed from episodes.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import random

print("âœ… Libraries imported!")
print("\nImplementing Monte Carlo Methods for Value Estimation")
print("=" * 60)

## Part 1: Understanding Monte Carlo Methods


In [None]:
print("=" * 60)
print("Part 1: Understanding Monte Carlo Methods")
print("=" * 60)

print("\nMonte Carlo Key Concepts:")
print(" 1. Learn from complete episodes (must wait until episode ends)")
print(" 2. Use actual returns: G_t = R_{t+1} + Î³R_{t+2} + Î³Â²R_{t+3} + ...")
print(" 3. Update value estimates using: V(s) = average of returns")
print(" 4. No model required (model-free method)")

print("\nTwo Approaches:")
print(" - First-visit MC: Average returns only for first occurrence of state in episode")
print(" - Every-visit MC: Average returns for every occurrence of state in episode")

print("\nAlgorithm:")
print(" 1. Generate episode following policy Ï€")
print(" 2. For each state s in episode:")
print(" - Calculate return G from that state")
print(" - Append G to Returns(s)")
print(" - V(s) = average(Returns(s))")

print("\nâœ… Monte Carlo concepts understood!")

## Part 2: First-Visit Monte Carlo Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 2: First-Visit Monte Carlo Implementation")
print("=" * 60)

def generate_episode(policy, env, max_steps=100):
 
    """Generate an episode following the policy."""
 episode = []
 state = env.reset()[0] if hasattr(env.reset(), '__len__') else env.reset()
 
 for step in range(max_steps):
 # Choose action based on policy
 if isinstance(policy, dict):
 action = policy.get(state, random.choice(range(env.action_space.n)))
 else:
 action = policy(state)
 
 # Take action (simplified - assuming env.step returns tuple)
 if hasattr(env, 'step'):
 next_state, reward, done, truncated, info = env.step(action) if hasattr(env.step(action), '__len__') and len(env.step(action)) > 1 else (None, 0, True, False, {})
 if isinstance(env.step(action), tuple) and len(env.step(action)) >= 2:
 next_state, reward = env.step(action)[:2]
 done = env.step(action)[2] if len(env.step(action)) > 2 else Falseelse:
 next_state, reward, done = state, 0, Trueelse:
 next_state, reward, done = state, 0, True
 
 episode.append((state, action, reward))
 state = next_state
 
 if done:
 break
 
 return episode

def first_visit_mc(policy, env, n_episodes=1000, gamma=0.99):
 
    """
 First-visit Monte Carlo for estimating state values.
 """
 returns = defaultdict(list)
 V = defaultdict(float)
 
 for episode_num in range(n_episodes):
 episode = generate_episode(policy, env)
 
 # Calculate returns
 G = 0
 visited_states = set()
 
 # Process episode backwards
 for t in reversed(range(len(episode))):
 state, action, reward = episode[t]
 G = gamma * G + reward
 
 # First-visit: only update if state not visited yet in this episode
 if state not in visited_states:
 visited_states.add(state)
 returns[state].append(G)
 V[state] = np.mean(returns[state])
 
 return V, returns

# Simple example: Random walk
print("\nExample: Simple Random Walk")
print(" States: [0, 1, 2, 3, 4]")
print(" Actions: Move left (-1) or right (+1)")
print(" Goal: Estimate state values")

# Simplified environment simulation
class SimpleRandomWalk:
 def __init__(self):
 self.state = 2
 self.n_states = 5
 
 def reset(self):
 self.state = 2
 return self.state
 
 def step(self, action):
 self.state = max(0, min(4, self.state + action))
 reward = 1.0 if self.state == 4 else 0.0
 done = self.state in [0, 4]
 return self.state, reward, done
 
 @property
 def action_space(self):
 class Space:
 n = 2
 return Space()

# Random policy
def random_policy(state):
 return random.choice([-1, 1])

env_simple = SimpleRandomWalk()
V_mc, returns_mc = first_visit_mc(random_policy, env_simple, n_episodes=100, gamma=1.0)

print(f"\nEstimated State Values (First-Visit MC):")
for state in sorted(V_mc.keys()):
 print(f" V({state}) = {V_mc[state]:.4f} (from {len(returns_mc[state])} visits)")

print("\nâœ… First-visit Monte Carlo implemented!")

## Part 3: Every-Visit Monte Carlo Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 3: Every-Visit Monte Carlo Implementation")
print("=" * 60)

def every_visit_mc(policy, env, n_episodes=1000, gamma=0.99):
 
    """
 Every-visit Monte Carlo for estimating state values.
 """
 returns = defaultdict(list)
 V = defaultdict(float)
 
 for episode_num in range(n_episodes):
 episode = generate_episode(policy, env)
 
 # Calculate returns
 G = 0
 
 # Process episode backwards
 for t in reversed(range(len(episode))):
 state, action, reward = episode[t]
 G = gamma * G + reward
 
 # Every-visit: update for every occurrence
 returns[state].append(G)
 V[state] = np.mean(returns[state])
 
 return V, returns

# Compare first-visit vs every-visit
env_simple2 = SimpleRandomWalk()
V_every, returns_every = every_visit_mc(random_policy, env_simple2, n_episodes=100, gamma=1.0)

print(f"\nEstimated State Values (Every-Visit MC):")
for state in sorted(V_every.keys()):
 print(f" V({state}) = {V_every[state]:.4f} (from {len(returns_every[state])} visits)")

print(f"\nComparison:")
print(f" First-visit: Fewer samples per state, more focused")
print(f" Every-visit: More samples per state, can be more efficient")

print("\nâœ… Every-visit Monte Carlo implemented!")

## Summary

### Key Concepts:
1. **Monte Carlo Methods**: Learn value functions from sample episodes
2. **Returns**: G_t = R_{t+1} + Î³R_{t+2} + Î³Â²R_{t+3} + ...
3. **First-Visit MC**: Average returns only for first occurrence in episode
4. **Every-Visit MC**: Average returns for every occurrence in episode
5. **Model-Free**: Don't require environment dynamics model

### Advantages:
- Simple and intuitive
- No model required
- Works well with function approximation
- Can focus on specific states

### Disadvantages:
- Requires complete episodes (can't be incremental)
- High variance in estimates
- Slow convergence
- Only works for episodic tasks

### Applications:
- Policy evaluation
- Game playing (episodic)
- Episodic control problems

### Next Steps:
- Monte Carlo control (policy improvement)
- Compare with TD methods
- Apply to more complex environments

**Reference:** Course 09, Unit 2: "Prediction and Control without a Model" - Monte Carlo methods practical content