In [None]:
import numpy as np

def policy_iteration(P, R, gamma=0.9, max_iter=100, tol=1e-6):

    S, A = R.shape  # Number of states and actions

    # Initialize value function and policy
    V = np.zeros(S)
    pi = np.zeros((S, A), dtype=int)

    for _ in range(max_iter):
        # Policy evaluation
        delta = 0
        for i in range(S):
            v_old = V[i]
            V[i] = max(R[i, a] + gamma * np.dot(P[i, :, a], V) for a in range(A))
            delta = max(delta, abs(v_old - V[i]))

        if delta < tol:
            break  # Converged

        # Policy improvement
        for i in range(S):
            pi[i] = np.argmax(R[i, :] + gamma * np.dot(P[i, :, :], V))

    return V, pi

# Example usage with a simple MDP (adapt matrices as needed)
P = np.array([[[0.5, 0.5], [0.8, 0.2]], [[0.7, 0.3], [0.1, 0.9]]])
R = np.array([[1, 0], [0, 2]])
gamma = 0.9

V, pi = policy_iteration(P, R, gamma)

print("Value function:", V)
print("Optimal policy:", pi)



Value function: [4179380.69718659 3962077.1204357 ]
Optimal policy: [[1 1]
 [0 0]]


"""
    Policy iteration algorithm for solving an MDP.

    Args:
        P: Transition probability matrix (S x S x A).
        R: Reward matrix (S x A).
        gamma: Discount factor (0 < gamma < 1).
        max_iter: Maximum number of iterations.
        tol: Tolerance for convergence.

    Returns:
        V: Value function (S x 1).
        pi: Optimal policy (S x A).
    """

This Python code implements the Policy Iteration algorithm for solving Markov Decision Processes (MDPs). Here's a breakdown of the code:

1. **Importing Libraries**: The code imports NumPy, a library for numerical computing in Python.

2. **policy_iteration() Function**: This function performs policy iteration to find the optimal policy for the given MDP. It takes transition probabilities (P), rewards (R), discount factor (gamma), maximum number of iterations (max_iter), and tolerance for convergence (tol) as inputs.

3. **Initialization**: It initializes the value function (V) and policy (pi) arrays with zeros.

4. **Policy Evaluation**: It iteratively evaluates the value function using the Bellman equation until convergence. The value function is updated for each state (i) by considering all possible actions (a) and calculating the maximum expected future reward.

5. **Policy Improvement**: It updates the policy based on the updated value function. For each state (i), it selects the action that maximizes the expected future reward.

6. **Convergence Check**: It checks for convergence by comparing the change in value function (delta) with the specified tolerance (tol). If the change is below the tolerance, the algorithm breaks, indicating convergence.

7. **Return**: The function returns the optimal value function (V) and policy (pi).

8. **Example Usage**: An example usage of the policy_iteration() function is provided with a simple MDP defined by transition probabilities (P) and rewards (R).

9. **Print Results**: The optimal value function (V) and policy (pi) are printed.

Overall, this code demonstrates how to apply the Policy Iteration algorithm to find the optimal policy for a given Markov Decision Process.


In the provided code, MDP stands for Markov Decision Process.

A Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems in which an agent interacts with an environment over a series of discrete time steps.

In an MDP:
- The agent exists in a set of states \( S \), and at each time step, it chooses an action \( A \) from a set of possible actions available in that state.
- The environment transitions the agent from its current state to a new state based on the chosen action. These transitions are probabilistic and governed by transition probabilities \( P(s'|s, a) \), which represent the probability of transitioning to state \( s' \) from state \( s \) when action \( a \) is taken.
- Upon transitioning to a new state, the agent receives a reward \( R(s, a, s') \), which represents the immediate benefit or cost associated with the transition.

The objective in an MDP is to find a policy \( \pi \), which is a mapping from states to actions, that maximizes the expected cumulative reward over time.

In the code:
- The transition probabilities \( P \) and rewards \( R \) are provided as input to the policy_iteration() function, representing the dynamics of the MDP.
- The policy_iteration() function computes the optimal policy \( \pi \) and value function \( V \) for the given MDP, which allows the agent to make decisions that maximize its long-term rewards.