# Markov Decision Processes (MDPs) and Value Iteration

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand Markov Decision Processes (MDPs) fundamentals
- Implement simple MDPs and value iteration algorithms
- Set up RL environments and agents
- Apply value iteration to solve decision-making problems

## ðŸ”— Prerequisites

- âœ… Understanding of probability and Markov chains
- âœ… Python 3.8+ installed

---

## Official Structure Reference

This notebook covers practical activities from **Course 02, Unit 3**:
- Introduction to reinforcement learning: setting up environments and agents
- Implementing simple MDPs and value iteration algorithms
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 3 Practical Content

---

## Introduction to MDPs

**Markov Decision Processes (MDPs)** are mathematical frameworks for modeling decision-making in situations where outcomes are partly random and partly under control.


## ðŸ“¥ Inputs & ðŸ“¤ Outputs | Ø§Ù„Ù…Ø¯Ø®Ù„Ø§Øª ÙˆØ§Ù„Ù…Ø®Ø±Ø¬Ø§Øª

**Inputs:** What we use in this notebook

- Libraries and concepts as introduced in this notebook; see prerequisites and code comments.

**Outputs:** What you'll see when you run the cells

- Printed results, figures, and summaries as shown when you run the cells.

---


In [None]:
import numpy as np

print("âœ… Libraries imported!")
print("Ready to work with MDPs and Value Iteration!")


## Part 1: Simple MDP Implementation

Let's create a simple grid world MDP.


In [None]:
class SimpleMDP:
    """Simple Markov Decision Process implementation"""
    
    def __init__(self, states, actions, transitions, rewards, gamma=0.9):
        """
        Parameters:
        - states: List of states
        - actions: List of actions
        - transitions: Dict[state][action][next_state] = probability
        - rewards: Dict[state][action] = reward
        - gamma: Discount factor
        """
        self.states = states
        self.actions = actions
        self.transitions = transitions
        self.rewards = rewards
        self.gamma = gamma
    
    def get_reward(self, state, action):
        """Get reward for state-action pair"""
        return self.rewards.get(state, {}).get(action, 0.0)
    
    def get_transition_prob(self, state, action, next_state):
        """Get transition probability P(next_state | state, action)"""
        return self.transitions.get(state, {}).get(action, {}).get(next_state, 0.0)

# Example: Simple 3-state MDP
states = ['S0', 'S1', 'S2']
actions = ['Left', 'Right']

# Transition probabilities: P(next_state | current_state, action)
transitions = {
    'S0': {
        'Left': {'S0': 0.8, 'S1': 0.2},
        'Right': {'S1': 0.9, 'S2': 0.1}
    },
    'S1': {
        'Left': {'S0': 0.7, 'S1': 0.3},
        'Right': {'S1': 0.5, 'S2': 0.5}
    },
    'S2': {
        'Left': {'S1': 1.0},
        'Right': {'S2': 1.0}  # Terminal state
    }
}

# Rewards: R(state, action)
rewards = {
    'S0': {'Left': -1, 'Right': 0},
    'S1': {'Left': 0, 'Right': 5},
    'S2': {'Left': 0, 'Right': 10}  # Terminal state reward
}

mdp = SimpleMDP(states, actions, transitions, rewards)

print("=" * 60)
print("Simple MDP: Grid World")
print("=" * 60)
print(f"States: {states}")
print(f"Actions: {actions}")
print(f"Discount factor (gamma): {mdp.gamma}")


## Part 2: Value Iteration Algorithm

Value iteration computes the optimal value function V*(s) for all states.


In [None]:
def value_iteration(mdp, theta=1e-6, max_iterations=100):
    """
    Value iteration algorithm to find optimal value function
    
    Parameters:
    - mdp: MDP object
    - theta: Convergence threshold
    - max_iterations: Maximum iterations
    
    Returns:
    - V: Optimal value function
    - policy: Optimal policy
    """
    # Initialize value function
    V = {state: 0.0 for state in mdp.states}
    
    for iteration in range(max_iterations):
        V_old = V.copy()
        
        # Update value for each state
        for state in mdp.states:
            # Compute Q-value for each action
            Q_values = []
            for action in mdp.actions:
                # Q(s,a) = R(s,a) + gamma * sum(P(s'|s,a) * V(s'))
                q_value = mdp.get_reward(state, action)
                for next_state in mdp.states:
                    prob = mdp.get_transition_prob(state, action, next_state)
                    q_value += mdp.gamma * prob * V_old[next_state]
                Q_values.append(q_value)
            
            # Value is maximum Q-value (optimal action)
            V[state] = max(Q_values) if Q_values else 0.0
        
        # Check convergence
        max_diff = max(abs(V[state] - V_old[state]) for state in mdp.states)
        if max_diff < theta:
            print(f"Converged after {iteration + 1} iterations")
            break
    
    # Extract optimal policy
    policy = {}
    for state in mdp.states:
        Q_values = []
        for action in mdp.actions:
            q_value = mdp.get_reward(state, action)
            for next_state in mdp.states:
                prob = mdp.get_transition_prob(state, action, next_state)
                q_value += mdp.gamma * prob * V[next_state]
            Q_values.append((q_value, action))
        # Choose action with highest Q-value
        policy[state] = max(Q_values, key=lambda x: x[0])[1]
    
    return V, policy

# Run value iteration
print("=" * 60)
print("Running Value Iteration:")
print("=" * 60)

V_star, optimal_policy = value_iteration(mdp)

print("\nOptimal Value Function V*(s):")
for state, value in V_star.items():
    print(f"  V*({state}) = {value:.4f}")

print("\nOptimal Policy Ï€*(s):")
for state, action in optimal_policy.items():
    print(f"  Ï€*({state}) = {action}")


## Summary

### Key Concepts:
1. **MDP Components**: States, actions, transition probabilities, rewards, discount factor
2. **Value Function V(s)**: Expected cumulative reward from state s
3. **Value Iteration**: Algorithm to compute optimal value function
4. **Policy**: Mapping from states to actions

### Applications:
- Robotics (path planning)
- Game AI
- Resource allocation
- Autonomous systems

**Reference:** Course 02, Unit 3: "Implementing simple MDPs and value iteration algorithms"
