# Lesson 20 - Reinforcement Learning and Value Iteration


## Objectives
- Define an MDP with states, actions, rewards, and transitions.
- Implement value iteration for a small MDP.
- Visualize value function convergence.


## From the notes

**MDP**
- States $S$, actions $A$, transition $P(s'|s,a)$, rewards $R(s,a)$, discount $\gamma$.
- Bellman optimality: $V^*(s) = \max_a \sum_{s'} P(s'|s,a)(R(s,a)+\gamma V^*(s'))$.

_TODO: Validate RL notation with the CS229 main notes PDF._


## Intuition
Value iteration repeatedly applies the Bellman optimality operator until the value function stabilizes.


## Data
We define a 3-state MDP with two actions.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

states = [0, 1, 2]
actions = [0, 1]
gamma = 0.9

# Transition probabilities P[s, a, s']
P = np.array([
    [[0.7, 0.3, 0.0], [0.4, 0.6, 0.0]],
    [[0.1, 0.6, 0.3], [0.0, 0.2, 0.8]],
    [[0.0, 0.0, 1.0], [0.0, 0.0, 1.0]],
])

R = np.array([
    [1.0, 0.5],
    [0.0, 0.2],
    [0.0, 0.0],
])

def value_iteration(P, R, gamma=0.9, iters=50):
    V = np.zeros(P.shape[0])
    history = []
    for _ in range(iters):
        V_new = np.max(R + gamma * (P @ V), axis=1)
        history.append(V_new.copy())
        V = V_new
    return V, np.array(history)

V, history = value_iteration(P, R, gamma)


## Experiments


In [None]:
V


## Visualizations


In [None]:
plt.figure(figsize=(6,4))
for s in range(history.shape[1]):
    plt.plot(history[:, s], label=f"state {s}")
plt.title("Value iteration convergence")
plt.xlabel("iteration")
plt.ylabel("V(s)")
plt.legend()
plt.show()

plt.figure(figsize=(6,4))
plt.bar(states, V)
plt.title("Final value function")
plt.xlabel("state")
plt.ylabel("V(s)")
plt.show()


## Takeaways
- Value iteration applies Bellman updates until convergence.
- The discount factor controls the importance of future rewards.


## Explain it in an interview
- Explain the Bellman optimality equation.
- Describe the effect of changing gamma.


## Exercises
- Add a terminal state with reward and recompute values.
- Compare value iteration to policy iteration.
- Change transition probabilities and observe stability.
