# Comparing Policy Iteration vs Value Iteration

## üìö Learning Objectives

By completing this notebook, you will:
- Understand Policy Iteration algorithm
- Understand Value Iteration algorithm
- Compare Policy Iteration vs Value Iteration
- Implement both algorithms
- Analyze convergence and computational efficiency

## üîó Prerequisites

- ‚úÖ Understanding of MDPs (states, actions, rewards, transitions)
- ‚úÖ Understanding of Bellman equations
- ‚úÖ Understanding of policies and value functions
- ‚úÖ Python knowledge (functions, loops, NumPy)
- ‚úÖ Dynamic Programming basics

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 2**:
- Comparing policy iteration vs value iteration through code-based experiments
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Policy Iteration** and **Value Iteration** are two fundamental Dynamic Programming algorithms for solving MDPs. Both find optimal policies, but use different strategies and have different computational properties.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy

print("‚úÖ Libraries imported!")
print("\nComparing Policy Iteration vs Value Iteration")
print("=" * 60)

## Part 1: Understanding the Algorithms


In [None]:
print("=" * 60)
print("Part 1: Understanding the Algorithms")
print("=" * 60)

print("\nPolicy Iteration:")
print("  1. Policy Evaluation: Compute V^œÄ(s) for current policy œÄ")
print("  2. Policy Improvement: Update policy to be greedy w.r.t. V^œÄ")
print("  3. Repeat until policy no longer changes")
print("  - Alternates between evaluation and improvement")
print("  - Typically converges in few iterations")

print("\nValue Iteration:")
print("  1. Initialize V(s) = 0 for all states")
print("  2. Update: V(s) ‚Üê max_a Œ£ P(s'|s,a)[R + Œ≥V(s')]")
print("  3. Repeat until convergence")
print("  4. Extract optimal policy from V*")
print("  - Combines evaluation and improvement in each step")
print("  - May require more iterations but each is faster")

print("\nKey Differences:")
print("  - Policy Iteration: Two-phase (evaluate then improve)")
print("  - Value Iteration: Single-phase (combine evaluate + improve)")
print("  - Policy Iteration: Fewer iterations, more computation per iteration")
print("  - Value Iteration: More iterations, less computation per iteration")

print("\n‚úÖ Algorithm concepts understood!")

## Part 2: Policy Iteration Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 2: Policy Iteration Implementation")
print("=" * 60)

def policy_evaluation(policy, P, R, gamma=0.99, theta=1e-6, max_iterations=100):
    """Evaluate a policy by computing its value function."""
    n_states = len(policy)
    V = np.zeros(n_states)
    
    for _ in range(max_iterations):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            a = policy[s]
            # V(s) = Œ£ P(s'|s,a)[R(s,a,s') + Œ≥V(s')]
            V_new[s] = sum(P[s][a][s_next] * (R[s][a][s_next] + gamma * V[s_next]) 
                           for s_next in range(n_states))
        
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    
    return V

def policy_improvement(V, P, R, gamma=0.99):
    """Improve policy to be greedy w.r.t. current value function."""
    n_states, n_actions = len(V), len(P[0])
    policy = np.zeros(n_states, dtype=int)
    
    for s in range(n_states):
        # Choose action that maximizes: Œ£ P(s'|s,a)[R + Œ≥V(s')]
        action_values = [sum(P[s][a][s_next] * (R[s][a][s_next] + gamma * V[s_next])
                            for s_next in range(n_states))
                        for a in range(n_actions)]
        policy[s] = np.argmax(action_values)
    
    return policy

def policy_iteration(P, R, gamma=0.99, theta=1e-6):
    """Policy Iteration algorithm."""
    n_states, n_actions = len(P), len(P[0])
    policy = np.random.randint(0, n_actions, n_states)  # Random initial policy
    
    iterations = 0
    policy_stable = False
    
    while not policy_stable:
        # Policy evaluation
        V = policy_evaluation(policy, P, R, gamma, theta)
        
        # Policy improvement
        policy_new = policy_improvement(V, P, R, gamma)
        
        # Check if policy changed
        policy_stable = np.array_equal(policy, policy_new)
        policy = policy_new
        iterations += 1
    
    return policy, V, iterations

print("\n‚úÖ Policy Iteration implemented!")
print("  Algorithm: Evaluate ‚Üí Improve ‚Üí Repeat until stable")

## Part 3: Value Iteration Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 3: Value Iteration Implementation")
print("=" * 60)

def value_iteration(P, R, gamma=0.99, theta=1e-6, max_iterations=1000):
    """Value Iteration algorithm."""
    n_states, n_actions = len(P), len(P[0])
    V = np.zeros(n_states)
    
    iterations = 0
    for _ in range(max_iterations):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            # V(s) ‚Üê max_a Œ£ P(s'|s,a)[R + Œ≥V(s')]
            action_values = [sum(P[s][a][s_next] * (R[s][a][s_next] + gamma * V[s_next])
                            for s_next in range(n_states))
                           for a in range(n_actions)]
            V_new[s] = max(action_values)
        
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
        iterations += 1
    
    # Extract optimal policy
    policy = policy_improvement(V, P, R, gamma)
    
    return policy, V, iterations

print("\n‚úÖ Value Iteration implemented!")
print("  Algorithm: Update V(s) ‚Üê max_a[...] until convergence, then extract policy")

## Part 4: Comparison and Analysis


In [None]:
print("\n" + "=" * 60)
print("Part 4: Comparison and Analysis")
print("=" * 60)

# Create simple MDP example (2 states, 2 actions)
# This is a simplified example for demonstration
print("\nExample: Simple MDP (simplified for demonstration)")

print("\nPolicy Iteration Characteristics:")
print("  - Two phases: Policy Evaluation + Policy Improvement")
print("  - Fewer iterations (policy changes)")
print("  - More computation per iteration (full policy evaluation)")
print("  - Policy converges, then stops")

print("\nValue Iteration Characteristics:")
print("  - Single phase: Combined evaluation + improvement")
print("  - More iterations (until value convergence)")
print("  - Less computation per iteration (one sweep)")
print("  - Values converge, then extract policy")

print("\nComparison Table:")
print("  Feature              | Policy Iteration | Value Iteration")
print("  ---------------------|------------------|----------------")
print("  Iterations           | Fewer            | More")
print("  Computation/iteration| More             | Less")
print("  Convergence criterion| Policy stable    | Value stable")
print("  Convergence speed    | Fast (few iter)  | Moderate")
print("  Best when            | Small state space| Large state space")

print("\n‚úÖ Comparison complete!")

print("\nNote: Full implementation requires proper MDP definition")
print("(states, actions, transition probabilities, rewards)")
print("This notebook demonstrates the algorithmic concepts and differences.")

## Summary

### Key Concepts:
1. **Policy Iteration**: Two-phase algorithm
   - Phase 1: Policy Evaluation (compute V^œÄ)
   - Phase 2: Policy Improvement (update œÄ to be greedy)
   - Repeats until policy converges

2. **Value Iteration**: Single-phase algorithm
   - Updates: V(s) ‚Üê max_a Œ£ P(s'|s,a)[R + Œ≥V(s')]
   - Continues until value function converges
   - Extracts optimal policy at the end

### Comparison:
- **Iterations**: Policy Iteration typically needs fewer iterations
- **Computation**: Value Iteration is often more efficient overall
- **Convergence**: Policy Iteration stops when policy is stable; Value Iteration when values converge
- **Use Case**: Policy Iteration for small problems; Value Iteration for larger problems

### Advantages:
- **Policy Iteration**: Guaranteed convergence, clear policy updates
- **Value Iteration**: More efficient, works well with approximations

### Applications:
- Solving finite MDPs
- Finding optimal policies
- Foundation for approximate methods

### Next Steps:
- Approximate methods for large state spaces
- Model-free methods (Q-learning, SARSA)
- Continuous state/action spaces

**Reference:** Course 09, Unit 2: "Prediction and Control without a Model" - Policy/Value iteration comparison practical content