<a href="https://colab.research.google.com/github/badrinarayanan02/Reinforcement-Learning/blob/main/2348507_RLLab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Markov Decision Process (MDP) Simulation and Value Iteration

**MDP** is a mathematical framework that is used to make decisions on uncertainity. It maximizes the expected future reward. We need to define states, actions, transitions, rewards, discount.

### Performing Value Iteration for a given MDP

The Value Iteration algorithm iteratively calculates the optimal state-value function by updating values based on maximum expected future rewards, converging to an optimal policy. It is a dynamic programming algorithm.

### Loading the Libraries

In [1]:
import numpy as np

### MDP Parameters

In [4]:
num_states = 4
num_actions = 2

P = {
    0: {
        0: [(1.0, 1, 10)],
        1: [(1.0, 0, 0)]
    },
    1: {
        0: [(1.0, 2, -10)],
        1: [(1.0, 0, 10)]
    },
    2: {
        0: [(1.0, 3, 10)],
        1: [(1.0, 1, -10)]
    },
    3: {
        0: [(1.0, 3, 0)],
        1: [(1.0, 3, 0)]
    }
}

gamma = 0.9  # Discount factor
theta = 1e-4 # Convergence threshold

### Inference

Implemented states and actions in MDP process with specified transitions and rewards and initializes the parameters and an initial policy for policy iteration. Specified parameters are Gamma and Theta. Gamma is a discount factor, the value of 0.9 means the future rewards are discounted by 10% for each time setup. Theta value of 1e-4 ensures that the process will stop when the value updates are less than this threshold.

### Value Iteration

In [2]:
def value_iteration(P, gamma, theta):
    V = np.zeros(num_states)
    policy = np.zeros(num_states, dtype=int)

    while True:
        delta = 0
        for s in range(num_states):
            # Calculating the value for each action
            action_values = np.zeros(num_actions)
            for a in range(num_actions):
                action_value = 0
                for prob, next_state, reward in P[s][a]:
                    action_value += prob * (reward + gamma * V[next_state])
                action_values[a] = action_value

            max_value = np.max(action_values)
            delta = max(delta, abs(max_value - V[s]))
            V[s] = max_value
            policy[s] = np.argmax(action_values)  # Storing the best action in the policy

        if delta < theta:
            break

    return policy, V

### Inference

This function finds the best way to find optimal policy in each state to maximize rewards. It does this by repeatedly calculating the potential future rewards for each action, choosing the best action for each state, and updating values until they no longer change significantly.

### Optimal Policy and Optimal Value

In [6]:
optimal_policy, optimal_value = value_iteration(P, gamma, theta)

print("Optimal Policy (state-action pairs):")
print(optimal_policy)
print("\nOptimal State-Value Function:")
print(optimal_value)

Optimal Policy (state-action pairs):
[0 1 1 0]

Optimal State-Value Function:
[99.99964119 99.99967708 79.99970937  0.        ]


### Inference

In Optimal Policy, each element corresponds to the best action for each state. The best action from state 0 is action 0, from state 1 is action 1 etc. Optimal State Value Function shows the maximum expected reward for each state under the optimal policy.

# Conclusion

Thus the value iteration for the markov decision process has been implemented successfully. When we are not sure about the outcome (uncertainity), MDP is very useful. Initially defined all the necessary stuffs for MDP. Value Iteration function has been implemented. At last got the optimal policy and optimal value.