In [None]:
import numpy as np

# Define the MDP parameters
num_states = 3
num_actions = 2
gamma = 0.9  # discount factor

# Define the transition probabilities and rewards
# Transitions are represented as a 3D array: P[state, action, next_state]
P = np.array([[[0.7, 0.3, 0.0], [0.0, 0.8, 0.2]],  # state 0
              [[0.0, 1.0, 0.0], [0.1, 0.0, 0.9]],  # state 1
              [[0.4, 0.6, 0.0], [0.0, 0.0, 1.0]]])  # state 2

# Rewards are represented as a 2D array: R[state, action]
R = np.array([[1.0, -1.0],  # state 0
              [2.0, 0.0],    # state 1
              [0.0, 1.0]])   # state 2

# Initialize the value function and policy
V = np.zeros(num_states)
policy = np.zeros(num_states, dtype=int)

# Policy Improvement
def policy_improvement(V, policy):
    policy_stable = True
    for s in range(num_states):
        old_action = policy[s]
        action_values = np.zeros(num_actions)
        for a in range(num_actions):
            action_values[a] = np.sum(P[s, a, :] * (R[:, a] + gamma * V[:]))
        policy[s] = np.argmax(action_values)
        if old_action != policy[s]:
            policy_stable = False
    return policy_stable

# Value Iteration
def value_iteration():
    epsilon = 1e-6
    while True:
        delta = 0
        for s in range(num_states):
            v = V[s]
            action_values = np.zeros(num_actions)
            for a in range(num_actions):
                action_values[a] = np.sum(P[s, a, :] * (R[:, a] + gamma * V[:]))
            V[s] = np.max(action_values)
            delta = max(delta, np.abs(v - V[s]))
        if delta < epsilon:
            break

# Run Value Iteration
value_iteration()

# Run Policy Improvement
policy_stable = policy_improvement(V, policy)

# Display the results
print("Optimal Value Function:")
print(V)
print("\nOptimal Policy:")
print(policy)

if policy_stable:
    print("\nPolicy is stable.")
else:
    print("\nPolicy is not stable.")


Optimal Value Function:
[18.10809938 19.99999128 18.91891107]

Optimal Policy:
[0 0 0]

Policy is stable.


This code implements the Value Iteration algorithm for solving Markov Decision Processes (MDPs) and performs policy improvement to find the optimal policy. Let's break it down:

1. **MDP Parameters**:
   - The code defines the parameters of the MDP: the number of states (\( S \)), the number of actions (\( A \)), and the discount factor (\( \gamma \)).

2. **Transition Probabilities and Rewards**:
   - The transition probabilities (\( P \)) and rewards (\( R \)) matrices are defined.
   - Transitions are represented as a 3D array where \( P[state, action, next\_state] \) gives the probability of transitioning from a state to the next state given an action.
   - Rewards are represented as a 2D array where \( R[state, action] \) gives the immediate reward for taking an action in a particular state.

3. **Policy Initialization**:
   - The value function (\( V \)) and policy (\( \pi \)) are initialized with zeros.

4. **Policy Improvement Function**:
   - The policy_improvement() function updates the policy based on the current value function.
   - It iterates through each state and selects the action that maximizes the expected cumulative reward.
   - If the policy remains unchanged after an iteration, it indicates policy stability.

5. **Value Iteration Function**:
   - The value_iteration() function performs the value iteration algorithm to compute the optimal value function.
   - It iteratively updates the value function until convergence by applying the Bellman optimality equation.

6. **Run Value Iteration**:
   - The value_iteration() function is called to compute the optimal value function.

7. **Run Policy Improvement**:
   - The policy_improvement() function is called to update the policy based on the computed value function.

8. **Display Results**:
   - The optimal value function (\( V \)) and policy (\( \pi \)) are printed.
   - It also checks if the policy is stable or not.

This code provides a complete implementation of the Value Iteration algorithm for finding the optimal policy in an MDP and demonstrates its usage with a simple example.