<a href="https://colab.research.google.com/github/badrinarayanan02/Reinforcement-Learning/blob/main/2348507_RLLab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Markov Decision Process

**MDP** is a mathematical framework that is used to make decisions on uncertainity. It maximizes the expected future reward. We need to define states, actions, transitions, rewards, discount.

Performing policy evaluation and improvement for a given MDP

In [1]:
import numpy as np

In [2]:
num_states = 4
num_actions = 2

P = {
    0: {
        0: [(1.0, 1, 10)],
        1: [(1.0, 0, 0)]
    },
    1: {
        0: [(1.0, 2, -10)],
        1: [(1.0, 0, 10)]
    },
    2: {
        0: [(1.0, 3, 10)],
        1: [(1.0, 1, -10)]
    },
    3: {
        0: [(1.0, 3, 0)],
        1: [(1.0, 3, 0)]
    }
}

gamma = 0.9  # Discount factor
theta = 1e-4  # Convergence threshold

policy = np.ones([num_states, num_actions]) / num_actions

### Inference

Implemented states and actions in MDP process with specified transitions and rewards and initializes the parameters and an initial policy for policy iteration. Specified parameters are Gamma and Theta. Gamma is a discount factor, the value of 0.9 means the future rewards are discounted by 10% for each time setup. Theta value of 1e-4 ensures that the process will stop when the value updates are less than this threshold.

## Policy Evaluation

In [4]:
def policy_evaluation(policy, P, gamma, theta):
    V = np.zeros(num_states)
    while True:
        delta = 0
        for s in range(num_states):
            v = 0
            for a, action_prob in enumerate(policy[s]):
                for prob, next_state, reward in P[s][a]:
                    v += action_prob * prob * (reward + gamma * V[next_state])
            delta = max(delta, abs(v - V[s]))
            V[s] = v
        if delta < theta:
            break
    return V

### Inference

The policy_evaluation function helps assess how good a policy is, setting the stage for policy improvement to make the policy even better, ultimately leading to an optimal policy.

We can take a example of grid world where an agent moves between states and earns rewards. The policy_evaluation function calculates the expected value for each position in the grid, telling the agent how beneficial each position is when following a specific policy. This helps the agent learn which paths are better to take to maximize rewards over time.

## Policy Improvement

In [5]:
def policy_improvement(V, P, gamma):
    policy_stable = True
    for s in range(num_states):
        old_action = np.argmax(policy[s])

        # Finding the best action based on current value function
        action_values = np.zeros(num_actions)
        for a in range(num_actions):
            for prob, next_state, reward in P[s][a]:
                action_values[a] += prob * (reward + gamma * V[next_state])

        new_action = np.argmax(action_values)
        if old_action != new_action:
            policy_stable = False
        policy[s] = np.eye(num_actions)[new_action]
    return policy, policy_stable

### Inference

If the agent is navigating a simple grid with rewards scattered throughout. After evaluating a policy, it might find that moving up is beneficial in a certain state due to a nearby reward. The policy improvement function will then adjust the policy to prioritize moving up in that state.

If each state has now been assigned actions that maximize rewards based on the latest state values, the policy is improved. Through multiple cycles, this process gradually optimizes the policy until it becomes the optimal policy for maximizing rewards in the environment.

## Policy Iteration

In [6]:
def policy_iteration(P, gamma, theta):
    global policy
    while True:
        # Policy Evaluation
        V = policy_evaluation(policy, P, gamma, theta)

        # Policy Improvement
        policy, policy_stable = policy_improvement(V, P, gamma)

        if policy_stable:
            return policy, V

### Inference

The policy_iteration function alternates between evaluating the policy and improving the policy until it finds the optimal policy. The returned policy is the optimal policy, and V is the optimal state-value function, which represents the maximum expected rewards for each state under the optimal policy. This iterative approach makes policy iteration a fundamental method for solving MDPs in reinforcement learning.

## Optimal Policy and Optimal Value

In [7]:
optimal_policy, optimal_value = policy_iteration(P, gamma, theta)

print("Optimal Policy (state-action probabilities):")
print(optimal_policy)
print("\nOptimal State-Value Function:")
print(optimal_value)

Optimal Policy (state-action probabilities):
[[1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]]

Optimal State-Value Function:
[99.99964119 99.99967708 79.99970937  0.        ]


### Inference

State 0 has a value of approximately 99.9996

State 1 has a value of approximately 99.9997

State 2 has a value of approximately 79.9997

State 3 has a value of 0

## Further Clarification

These values represent the expected cumulative reward from each state if the agent follows the optimal policy.

States 0 and 1 have high values (~100), indicating that they are favorable starting points due to potential future rewards.

State 2 has a slightly lower value (~80), implying it has access to rewards but not as favorable as States 0 and 1.

State 3 has a value of 0, possibly representing a terminal state where no further rewards can be obtained.

# Conclusion

Thus the policy evaluation and policy improvement for given MDP has been performed successfully. When we are not sure about the outcome (uncertainity), MDP is very useful. Initially defined all the necessary stuffs for MDP. Policy Evaluation and Policy Improvement function has been implemented. At last got the optimal policy and optimal value.