# Policies and Value Functions

Welcome back! Now we'll learn about the two most important concepts in RL: **policies** (what to do) and **value functions** (how good is it).

## What You'll Learn

By the end of this notebook, you'll understand:
- What a policy is (with a GPS navigation analogy!)
- Deterministic vs stochastic policies
- State-value function V(s) - "How good is being here?"
- Action-value function Q(s,a) - "How good is this action?"
- Optimal policies and why we want them
- How to estimate values with Monte Carlo

**Prerequisites:** Notebooks 1-3

**Time:** ~30 minutes

---
## The Big Picture: GPS and Property Values

Two simple analogies explain policies and value functions:

```
    ┌────────────────────────────────────────────────────────┐
    │                    POLICY = GPS                         │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  GPS Navigation:                                       │
    │    "At this location, go THIS direction"               │
    │                                                         │
    │  Policy:                                               │
    │    "In this state, take THIS action"                   │
    │                                                         │
    │  Both map a current situation to a decision!           │
    │                                                         │
    └────────────────────────────────────────────────────────┘

    ┌────────────────────────────────────────────────────────┐
    │            VALUE FUNCTION = PROPERTY VALUES             │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  Property Values:                                      │
    │    "This neighborhood is worth more money"             │
    │    → Better location = higher value                    │
    │                                                         │
    │  State-Value V(s):                                     │
    │    "This state is worth more reward"                   │
    │    → Better state = higher value                       │
    │                                                         │
    │  Both measure how "good" a location/state is!          │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

---
## What is a Policy?

A **policy** π (pi) is the agent's strategy - it tells the agent what to do in each situation.

```
    ┌────────────────────────────────────────────────────────┐
    │                      POLICY                             │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  A policy is a MAPPING from states to actions:         │
    │                                                         │
    │       State ────────> Action                           │
    │                                                         │
    │  Examples:                                             │
    │                                                         │
    │  Chess:                                                │
    │    Board position → Best move                          │
    │                                                         │
    │  Self-driving car:                                     │
    │    Camera image + sensors → Steering/acceleration      │
    │                                                         │
    │  Robot vacuum:                                         │
    │    Current location + dirt sensors → Movement direction│
    │                                                         │
    │  THE GOAL OF RL: Find the BEST policy!                 │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

---
## Two Types of Policies

### 1. Deterministic Policy

Always picks the **same action** for a given state.

```
π(s) = a      "In state s, ALWAYS do action a"
```

**Example:** A GPS that always tells you the shortest route.

### 2. Stochastic Policy

Picks actions with **probabilities**.

```
π(a|s) = P(action = a | state = s)    "In state s, do action a with probability p"
```

**Example:** A GPS that sometimes suggests scenic routes (exploration!).

```
    ┌────────────────────────────────────────────────────────┐
    │       DETERMINISTIC vs STOCHASTIC POLICY                │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  DETERMINISTIC:                                        │
    │    State (1,1) → Action: RIGHT (always)               │
    │                                                         │
    │  STOCHASTIC (epsilon-greedy with ε=0.1):               │
    │    State (1,1) → Action probabilities:                 │
    │      • RIGHT: 92.5%  ← Best action (high probability)  │
    │      • UP:     2.5%                                    │
    │      • DOWN:   2.5%                                    │
    │      • LEFT:   2.5%                                    │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, FancyArrowPatch
from matplotlib.colors import LinearSegmentedColormap

# Demonstrate deterministic vs stochastic policies

def deterministic_policy(state):
    """
    Deterministic policy: Always takes the same action.
    
    Strategy: Go toward the goal at (3, 3)
    - If not at goal column, go RIGHT
    - If at goal column, go DOWN
    """
    row, col = state
    if col < 3:
        return 1  # RIGHT
    elif row < 3:
        return 2  # DOWN
    else:
        return 0  # At goal, any action


def stochastic_policy(state, epsilon=0.1):
    """
    Stochastic policy (epsilon-greedy):
    - With probability (1-epsilon): take best action
    - With probability epsilon: take random action
    """
    if np.random.random() < epsilon:
        return np.random.randint(0, 4)  # Random action
    else:
        return deterministic_policy(state)  # Best action


def get_policy_distribution(state, epsilon=0.1):
    """
    Get the probability distribution over actions for a stochastic policy.
    """
    optimal_action = deterministic_policy(state)
    probs = np.ones(4) * (epsilon / 4)  # Base probability for exploration
    probs[optimal_action] += (1 - epsilon)  # High prob for optimal action
    return probs


# Visualize policy distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
action_names = ['Up', 'Right', 'Down', 'Left']
colors = ['#f44336', '#4caf50', '#ff9800', '#2196f3']

# Left: Deterministic policy
ax1 = axes[0]
state = (1, 1)
det_probs = np.zeros(4)
det_probs[deterministic_policy(state)] = 1.0

bars = ax1.bar(action_names, det_probs, color=colors, edgecolor='black', linewidth=2)
ax1.set_ylabel('Probability', fontsize=12)
ax1.set_ylim(0, 1.1)
ax1.set_title(f'Deterministic Policy at State {state}\n"Always go RIGHT"', 
              fontsize=14, fontweight='bold')

for i, p in enumerate(det_probs):
    ax1.text(i, p + 0.03, f'{p:.0%}', ha='center', fontsize=12, fontweight='bold')

# Right: Stochastic policy
ax2 = axes[1]
stoch_probs = get_policy_distribution(state, epsilon=0.1)

bars = ax2.bar(action_names, stoch_probs, color=colors, edgecolor='black', linewidth=2)
ax2.set_ylabel('Probability', fontsize=12)
ax2.set_ylim(0, 1.1)
ax2.set_title(f'Stochastic Policy at State {state}\nε-greedy (ε=0.1)', 
              fontsize=14, fontweight='bold')

for i, p in enumerate(stoch_probs):
    ax2.text(i, p + 0.03, f'{p:.1%}', ha='center', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("WHY USE STOCHASTIC POLICIES?")
print("="*60)
print("\n1. EXPLORATION: Sometimes try new things to find better actions")
print("2. GAME THEORY: Unpredictable behavior is harder to exploit")
print("3. UNCERTAINTY: When unsure, hedging bets can be optimal")
print("="*60)

---
## State-Value Function V(s): "How Good Is Being Here?"

The **state-value function** V^π(s) tells us the expected total future reward starting from state s, when following policy π.

```
    ┌────────────────────────────────────────────────────────┐
    │            STATE-VALUE FUNCTION V(s)                    │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  V^π(s) = Expected Return starting from state s        │
    │         = E[G_t | s_t = s, following policy π]         │
    │         = E[r_t + γr_{t+1} + γ²r_{t+2} + ... | s]      │
    │                                                         │
    │  In plain English:                                     │
    │    "How much total reward can I expect to get          │
    │     if I start HERE and follow THIS strategy?"         │
    │                                                         │
    │  ANALOGY: Property Values                              │
    │    V(nice neighborhood) = HIGH   (good location)       │
    │    V(bad neighborhood)  = LOW    (bad location)        │
    │                                                         │
    │    The "value" depends on what you can get from there! │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

### Key Insight

V(s) depends on TWO things:
1. **The state s** - some states are naturally better
2. **The policy π** - your strategy affects your outcomes

In [None]:
class GridWorld:
    """
    Simple 4x4 Grid World for demonstrating value functions.
    
    Layout:
        ┌───┬───┬───┬───┐
        │ S │   │   │   │   S = Start (0,0)
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │   │
        ├───┼───┼───┼───┤
        │   │   │   │ G │   G = Goal (3,3)
        └───┴───┴───┴───┘
    
    Rewards: -1 per step, +10 at goal
    """
    
    def __init__(self):
        self.size = 4
        self.goal = (3, 3)
        self.gamma = 0.9  # Discount factor
        self.action_names = ['Up', 'Right', 'Down', 'Left']
    
    def get_next_state(self, state, action):
        """Get the next state given current state and action."""
        row, col = state
        
        if action == 0:    # Up
            row = max(0, row - 1)
        elif action == 1:  # Right
            col = min(3, col + 1)
        elif action == 2:  # Down
            row = min(3, row + 1)
        elif action == 3:  # Left
            col = max(0, col - 1)
        
        return (row, col)
    
    def get_reward(self, state, next_state):
        """Get the reward for a transition."""
        if next_state == self.goal:
            return 10  # Big reward at goal!
        return -1  # Small penalty for each step


def estimate_value_monte_carlo(env, policy, state, n_episodes=1000, max_steps=50):
    """
    Estimate V(s) using Monte Carlo simulation.
    
    Method: Run many episodes, average the returns.
    
    This is like: "Try the strategy many times, see what you get on average"
    """
    returns = []
    
    for episode in range(n_episodes):
        current_state = state
        episode_return = 0
        discount = 1.0
        
        for step in range(max_steps):
            # Get action from policy
            action = policy(current_state)
            
            # Take action
            next_state = env.get_next_state(current_state, action)
            reward = env.get_reward(current_state, next_state)
            
            # Accumulate discounted reward
            episode_return += discount * reward
            discount *= env.gamma
            
            # Check if done
            if next_state == env.goal:
                break
            
            current_state = next_state
        
        returns.append(episode_return)
    
    return np.mean(returns)


# Create environment and compute values
env = GridWorld()

print("COMPUTING STATE VALUES V(s)")
print("="*60)
print("\nUsing Monte Carlo simulation with optimal policy...")
print("(Running 500 episodes per state)")

V = np.zeros((4, 4))
for row in range(4):
    for col in range(4):
        state = (row, col)
        if state == env.goal:
            V[row, col] = 0  # Terminal state has no future rewards
        else:
            V[row, col] = estimate_value_monte_carlo(
                env, deterministic_policy, state, n_episodes=500
            )

print("\nState Values V(s):")
print("-"*60)
print("         Col 0    Col 1    Col 2    Col 3")
for row in range(4):
    values = " ".join([f"{V[row, col]:8.2f}" for col in range(4)])
    print(f"Row {row}:  {values}")
print("-"*60)
print("(Goal at (3,3) has value 0 - it's the terminal state)")

In [None]:
# Visualize the value function as a heatmap

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Heatmap
ax1 = axes[0]
im = ax1.imshow(V, cmap='RdYlGn', origin='upper')

# Add value annotations
for i in range(4):
    for j in range(4):
        value = V[i, j]
        color = 'white' if value < np.mean(V) else 'black'
        if (i, j) == env.goal:
            ax1.text(j, i, 'GOAL\n0', ha='center', va='center', 
                    fontsize=11, fontweight='bold', color='white')
        else:
            ax1.text(j, i, f'{value:.1f}', ha='center', va='center', 
                    fontsize=12, fontweight='bold', color=color)

ax1.set_xticks(range(4))
ax1.set_yticks(range(4))
ax1.set_xlabel('Column', fontsize=12)
ax1.set_ylabel('Row', fontsize=12)
ax1.set_title('State Values V(s)\n(Higher = Better State)', fontsize=14, fontweight='bold')
plt.colorbar(im, ax=ax1, label='Value')

# Right: Interpretation
ax2 = axes[1]
ax2.axis('off')
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)

interpretation = """
INTERPRETATION:

• States CLOSER to goal have HIGHER values
  (Less distance = less negative reward)

• State (3,2) has highest value (~3.1)
  (One step from goal: -1 + 10 × 0.9 = 8.0)

• State (0,0) has lowest value (~-0.9)
  (Farthest from goal, most steps needed)

• Values tell us: "How good is it to BE here?"

ANALOGY:
• Goal = Beach
• Value = "How nice is this location?"
• High value = Close to beach (desirable)
• Low value = Far from beach (less desirable)
"""

ax2.text(0.1, 0.95, interpretation, transform=ax2.transAxes,
         fontsize=11, verticalalignment='top', family='monospace',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

---
## Action-Value Function Q(s, a): "How Good Is This Action?"

The **action-value function** Q^π(s, a) tells us the expected return of taking action a in state s, then following policy π.

```
    ┌────────────────────────────────────────────────────────┐
    │            ACTION-VALUE FUNCTION Q(s, a)                │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  Q^π(s, a) = Expected Return starting from (s, a)      │
    │            = E[G_t | s_t = s, a_t = a, then follow π]  │
    │                                                         │
    │  In plain English:                                     │
    │    "If I'm in state s and take action a,               │
    │     how much reward will I get in total?"              │
    │                                                         │
    │  ANALOGY: Restaurant Ratings                           │
    │    Q(downtown, pizza) = 8.5  "Pizza downtown is great!"│
    │    Q(downtown, sushi) = 7.2  "Sushi downtown is okay"  │
    │    Q(downtown, tacos) = 9.1  "Tacos downtown are best!"│
    │                                                         │
    │    Best action = argmax Q(s, a) = TACOS!               │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

### Why Q is More Useful Than V

- **V(s)** tells you how good a state is
- **Q(s, a)** tells you how good each action is

With Q-values, choosing the best action is easy:
```
Best action = argmax_a Q(s, a)
```

In [None]:
def estimate_q_value(env, policy, state, action, n_episodes=500, max_steps=50):
    """
    Estimate Q(s, a) using Monte Carlo simulation.
    
    Method: Start with action a in state s, then follow policy.
    """
    returns = []
    
    for episode in range(n_episodes):
        # First step: take the specified action
        current_state = state
        next_state = env.get_next_state(current_state, action)
        episode_return = env.get_reward(current_state, next_state)
        discount = env.gamma
        current_state = next_state
        
        # Then follow the policy
        for step in range(max_steps - 1):
            if current_state == env.goal:
                break
            
            action_t = policy(current_state)
            next_state = env.get_next_state(current_state, action_t)
            reward = env.get_reward(current_state, next_state)
            
            episode_return += discount * reward
            discount *= env.gamma
            current_state = next_state
        
        returns.append(episode_return)
    
    return np.mean(returns)


# Compute Q-values for a specific state
state = (1, 1)

print(f"Q-VALUES AT STATE {state}")
print("="*60)
print("\n'How good is each action from this state?'\n")

q_values = []
for action in range(4):
    q = estimate_q_value(env, deterministic_policy, state, action, n_episodes=500)
    q_values.append(q)
    print(f"  Q({state}, {env.action_names[action]:5s}) = {q:6.2f}")

best_action = np.argmax(q_values)
print(f"\n  Best action: {env.action_names[best_action]} (highest Q-value!)")
print("="*60)

In [None]:
# Visualize Q-values for a state

fig, ax = plt.subplots(figsize=(10, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Title
ax.text(5, 9.5, f'Q-Values at State {state}', 
        ha='center', fontsize=18, fontweight='bold')

# Central state box
state_box = FancyBboxPatch((3.5, 4), 3, 2, boxstyle="round,pad=0.1",
                            facecolor='#bbdefb', edgecolor='#1976d2', linewidth=3)
ax.add_patch(state_box)
ax.text(5, 5.2, f'State', ha='center', fontsize=14, fontweight='bold', color='#1976d2')
ax.text(5, 4.5, f'{state}', ha='center', fontsize=12, color='#1976d2')

# Action Q-values around the state
actions_info = [
    {'name': 'UP', 'pos': (5, 7.5), 'q': q_values[0], 'color': '#f44336'},
    {'name': 'RIGHT', 'pos': (8, 5), 'q': q_values[1], 'color': '#4caf50'},
    {'name': 'DOWN', 'pos': (5, 2), 'q': q_values[2], 'color': '#ff9800'},
    {'name': 'LEFT', 'pos': (2, 5), 'q': q_values[3], 'color': '#9c27b0'},
]

best_q = max(q_values)

for info in actions_info:
    x, y = info['pos']
    
    # Highlight best action
    if info['q'] == best_q:
        face_color = '#c8e6c9'
        edge_color = '#388e3c'
        text = f"{info['name']}\nQ = {info['q']:.2f}\n★ BEST"
    else:
        face_color = '#fff3e0'
        edge_color = info['color']
        text = f"{info['name']}\nQ = {info['q']:.2f}"
    
    box = FancyBboxPatch((x-1, y-0.8), 2, 1.6, boxstyle="round,pad=0.1",
                          facecolor=face_color, edgecolor=edge_color, linewidth=2)
    ax.add_patch(box)
    ax.text(x, y, text, ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Draw arrow from state to action
    dx = (x - 5) * 0.4
    dy = (y - 5) * 0.4
    ax.annotate('', xy=(5 + dx * 2, 5 + dy * 2), xytext=(5 + dx * 0.8, 5 + dy * 0.8),
               arrowprops=dict(arrowstyle='->', lw=2, color=edge_color))

ax.text(5, 0.5, 'Optimal policy: Always pick the action with highest Q-value!',
        ha='center', fontsize=12, style='italic', color='#388e3c')

plt.tight_layout()
plt.show()

---
## The Relationship Between V and Q

V and Q are related by:

```
    ┌────────────────────────────────────────────────────────┐
    │              V(s) AND Q(s,a) RELATIONSHIP               │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  V^π(s) = Σ_a π(a|s) × Q^π(s, a)                       │
    │                                                         │
    │  In plain English:                                     │
    │    "The value of a state = weighted average of         │
    │     Q-values for all actions"                          │
    │                                                         │
    │  Example (ε-greedy with ε=0.1):                        │
    │    V(s) = 0.925×Q(s,best) + 0.025×Q(s,a₁)             │
    │         + 0.025×Q(s,a₂) + 0.025×Q(s,a₃)               │
    │                                                         │
    │  For deterministic policy:                             │
    │    V(s) = Q(s, π(s))   (just the Q-value of chosen    │
    │                         action)                        │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate the V-Q relationship

state = (1, 1)

print("V(s) AND Q(s,a) RELATIONSHIP")
print("="*60)
print(f"\nState: {state}")

# Q-values we computed earlier
print(f"\nQ-values:")
for i, name in enumerate(env.action_names):
    print(f"  Q(s, {name:5s}) = {q_values[i]:6.2f}")

# For deterministic policy: V(s) = Q(s, π(s))
best_action = deterministic_policy(state)
v_deterministic = q_values[best_action]
print(f"\nFor DETERMINISTIC policy (always RIGHT):")
print(f"  V(s) = Q(s, RIGHT) = {v_deterministic:.2f}")

# For epsilon-greedy policy: V(s) = sum of π(a|s) * Q(s,a)
epsilon = 0.1
probs = get_policy_distribution(state, epsilon)
v_stochastic = sum(p * q for p, q in zip(probs, q_values))

print(f"\nFor STOCHASTIC policy (ε-greedy, ε={epsilon}):")
print(f"  V(s) = Σ π(a|s) × Q(s,a)")
print(f"       = 0.925×{q_values[1]:.2f} + 0.025×{q_values[0]:.2f} + 0.025×{q_values[2]:.2f} + 0.025×{q_values[3]:.2f}")
print(f"       = {v_stochastic:.2f}")

print(f"\nNote: Stochastic V < Deterministic V because exploration sometimes picks worse actions!")
print("="*60)

---
## Optimal Value Functions: The Best Possible

The **optimal value functions** are the best values achievable by ANY policy:

```
    ┌────────────────────────────────────────────────────────┐
    │              OPTIMAL VALUE FUNCTIONS                    │
    ├────────────────────────────────────────────────────────┤
    │                                                         │
    │  V*(s) = max over all policies π of V^π(s)             │
    │        = "Best possible value from state s"            │
    │                                                         │
    │  Q*(s, a) = max over all policies π of Q^π(s, a)       │
    │           = "Best possible value for action a in s"    │
    │                                                         │
    │  The OPTIMAL POLICY π* is the one that achieves V*     │
    │                                                         │
    │  Key insight: Once you have Q*, the optimal policy is: │
    │                                                         │
    │       π*(s) = argmax_a Q*(s, a)                        │
    │                                                         │
    │  "Just pick the action with the highest Q-value!"      │
    │                                                         │
    └────────────────────────────────────────────────────────┘
```

### The Goal of RL

The goal of reinforcement learning is to find:
1. The optimal value functions (V* and Q*), OR
2. The optimal policy π* directly

In [None]:
# Compare optimal vs random policy values

def random_policy(state):
    """Random policy: pick any action with equal probability."""
    return np.random.randint(0, 4)

print("OPTIMAL vs RANDOM POLICY")
print("="*60)
print("\nComparing values for the same states under different policies...\n")

# Compute values for a few states
test_states = [(0, 0), (1, 1), (2, 2), (0, 3)]

print(f"{'State':<10} {'V(optimal)':<15} {'V(random)':<15} {'Difference':<15}")
print("-"*55)

for state in test_states:
    v_optimal = estimate_value_monte_carlo(env, deterministic_policy, state, n_episodes=500)
    v_random = estimate_value_monte_carlo(env, random_policy, state, n_episodes=500)
    diff = v_optimal - v_random
    print(f"{str(state):<10} {v_optimal:<15.2f} {v_random:<15.2f} {diff:<15.2f}")

print("\n" + "="*60)
print("KEY INSIGHT:")
print("  Optimal policy gives MUCH higher values!")
print("  This is why finding the optimal policy matters!")
print("="*60)

In [None]:
# Visualize the optimal policy as arrows on the grid

fig, ax = plt.subplots(figsize=(8, 8))

# Draw grid
for row in range(4):
    for col in range(4):
        if (row, col) == env.goal:
            color = '#c8e6c9'
        elif (row, col) == (0, 0):
            color = '#bbdefb'
        else:
            color = 'white'
        
        y = 3 - row
        rect = Rectangle((col, y), 1, 1, facecolor=color, 
                           edgecolor='black', linewidth=2)
        ax.add_patch(rect)
        
        # Add policy arrows (except at goal)
        if (row, col) != env.goal:
            action = deterministic_policy((row, col))
            cx, cy = col + 0.5, y + 0.5
            
            # Arrow direction based on action
            arrows = [(0, 0.3), (0.3, 0), (0, -0.3), (-0.3, 0)]  # UP, RIGHT, DOWN, LEFT
            dx, dy = arrows[action]
            
            ax.arrow(cx - dx/2, cy - dy/2, dx, dy,
                    head_width=0.15, head_length=0.1,
                    fc='#f44336', ec='#f44336', linewidth=2)

# Labels
ax.text(0.5, 3.5, 'START', ha='center', va='center', fontsize=9, 
        fontweight='bold', color='#1976d2')
ax.text(3.5, 0.5, 'GOAL', ha='center', va='center', fontsize=10, 
        fontweight='bold', color='#388e3c')

ax.set_xlim(0, 4)
ax.set_ylim(0, 4)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Optimal Policy π*(s)\n(Arrows show best action in each state)', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("The optimal policy always moves toward the goal!")
print("• When not at goal column: go RIGHT")
print("• When at goal column: go DOWN")

---
## Summary: Key Takeaways

### What is a Policy?

| Type | Symbol | Description | Example |
|------|--------|-------------|----------|
| Deterministic | π(s) = a | Same action every time | GPS: "Turn right" |
| Stochastic | π(a\|s) | Probability of each action | ε-greedy exploration |

### Value Functions

| Function | Symbol | Question It Answers |
|----------|--------|---------------------|
| State-Value | V(s) | "How good is being in state s?" |
| Action-Value | Q(s,a) | "How good is action a in state s?" |

### Optimal = Best Possible

- **V*(s)**: Best value achievable from state s
- **Q*(s,a)**: Best value achievable starting with action a in state s
- **π*(s)**: The policy that achieves V* and Q*

### The Key Insight

Once you have Q*, the optimal policy is simple:
```
π*(s) = argmax_a Q*(s, a)
```
"Just pick the action with the highest Q-value!"

---
## Test Your Understanding

**1. What is a policy?**
<details>
<summary>Click to reveal answer</summary>
A policy is a mapping from states to actions - it tells the agent what to do in each situation. It can be deterministic (always the same action) or stochastic (probability distribution over actions).
</details>

**2. What does V(s) represent?**
<details>
<summary>Click to reveal answer</summary>
V(s) is the state-value function. It represents the expected total future reward (return) starting from state s and following a particular policy. It answers: "How good is it to be in this state?"
</details>

**3. What's the difference between V(s) and Q(s,a)?**
<details>
<summary>Click to reveal answer</summary>
V(s) tells you how good a state is overall (averaging over all actions). Q(s,a) tells you how good a specific action is in that state. Q is more directly useful because you can pick the best action by finding argmax_a Q(s,a).
</details>

**4. How do you get the optimal policy from Q*?**
<details>
<summary>Click to reveal answer</summary>
π*(s) = argmax_a Q*(s, a). In other words, in each state, just pick the action with the highest Q-value!
</details>

**5. Why might you use a stochastic policy?**
<details>
<summary>Click to reveal answer</summary>
Three main reasons: (1) Exploration - random actions help discover better strategies. (2) Game theory - unpredictable behavior is harder for opponents to exploit. (3) Uncertainty - when the state is partially observable, hedging bets can be optimal.
</details>

---
## What's Next?

Excellent work! Now you understand policies and value functions.

In the next notebook, we'll learn about **Bellman Equations** - the mathematical foundation that allows us to COMPUTE value functions efficiently:
- The Bellman expectation equation
- The Bellman optimality equation
- How these equations connect current and future values

**Continue to:** [Notebook 5: Bellman Equations](05_bellman_equations.ipynb)

---

*Policies tell us what to do, value functions tell us how good it is. Together, they form the core of reinforcement learning!*