<a href="https://colab.research.google.com/github/Vaishnav2804/Reinforcement-Learning-Notebooks/blob/main/Markov_Decision_Process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MDP
**A Finite Markov Decision Process (MDP)** is a framework used in Reinforcement Learning to model decision-making problems where outcomes are partly random and partly under the control of a decision maker (agent). It provides a mathematical model for situations where you need to choose an action in some state to maximize the expected reward over time.

**Key Concepts of MDP:**
States (S): A finite set of all possible situations the agent can be in.
Actions (A): A finite set of all possible actions the agent can take.
Transition Probability (P): Probability of moving from one state to another, given an action.
Rewards (R): Immediate reward received after transitioning from one state to another.
Discount Factor (γ): How much future rewards are valued compared to immediate rewards (between 0 and 1).

## Example: Gridworld

Let’s assume a simple grid where an agent starts at the top-left corner and wants to reach the bottom-right corner. It can move UP, DOWN, LEFT, or RIGHT. The agent receives a reward of +1 when it reaches the goal, and -1 for hitting a wall or stepping outside the grid. There are penalties for bad moves.

Let’s break this down into an MDP:

States (S): Grid cells.

Actions (A): {UP, DOWN, LEFT, RIGHT}.

Rewards (R): +1 for reaching the goal, -1 for hitting walls.

Transition (P): Probability of moving to the next cell based on an action (we can assume deterministic moves for simplicity).

In [None]:
import numpy as np

# Gridworld setup
class GridworldMDP:
    def __init__(self, grid_size):
        self.grid_size = grid_size
        self.state = (0, 0)  # Start state at the top-left corner (0, 0)
        self.goal_state = (grid_size - 1, grid_size - 1) # goal_state represents the position in the grid where the agent is trying to reach. Specifically, it's the bottom-right corner of the grid.
        self.actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

    def get_actions(self):
        return self.actions

    def step(self, action):
        row, col = self.state

        # Determine new state based on action
        if action == 'UP':
            new_state = (max(row - 1, 0), col)
        elif action == 'DOWN':
            new_state = (min(row + 1, self.grid_size - 1), col)
        elif action == 'LEFT':
            new_state = (row, max(col - 1, 0))
        elif action == 'RIGHT':
            new_state = (row, min(col + 1, self.grid_size - 1))

        # Reward structure
        if new_state == self.goal_state:
            reward = 1  # Reached the goal
            done = True
        else:
            reward = -0.1  # Small penalty for each move
            done = False

        self.state = new_state
        return new_state, reward, done

    def reset(self):
        self.state = (0, 0)
        return self.state

In [None]:
# Simulating the agent interaction with the MDP
gridworld = GridworldMDP(grid_size=4)
state = gridworld.reset()

for _ in range(500):  # Let's simulate for 20 steps
    action = np.random.choice(gridworld.get_actions())  # Randomly choose an action
    next_state, reward, done = gridworld.step(action)
    print(f"Action: {action}, Next State: {next_state}, Reward: {reward}")

    if done:
        print("Reached the goal!")
        break

Action: UP, Next State: (0, 0), Reward: -0.1
Action: LEFT, Next State: (0, 0), Reward: -0.1
Action: DOWN, Next State: (1, 0), Reward: -0.1
Action: DOWN, Next State: (2, 0), Reward: -0.1
Action: RIGHT, Next State: (2, 1), Reward: -0.1
Action: DOWN, Next State: (3, 1), Reward: -0.1
Action: UP, Next State: (2, 1), Reward: -0.1
Action: RIGHT, Next State: (2, 2), Reward: -0.1
Action: LEFT, Next State: (2, 1), Reward: -0.1
Action: RIGHT, Next State: (2, 2), Reward: -0.1
Action: LEFT, Next State: (2, 1), Reward: -0.1
Action: RIGHT, Next State: (2, 2), Reward: -0.1
Action: LEFT, Next State: (2, 1), Reward: -0.1
Action: RIGHT, Next State: (2, 2), Reward: -0.1
Action: DOWN, Next State: (3, 2), Reward: -0.1
Action: DOWN, Next State: (3, 2), Reward: -0.1
Action: RIGHT, Next State: (3, 3), Reward: 1
Reached the goal!


## Mathematical Explanation:

In reinforcement learning (RL), if an environment has the Markov property, it means that the future state of the environment depends only on the current state and the action taken, not on the sequence of events that preceded it.

This is a crucial property because it simplifies the decision-making process for the agent. Instead of having to consider the entire history of states and actions, the agent can focus solely on the current situation to predict the potential outcomes of its actions.

To put it more formally:

```P(s_(t+1) | s_t, a_t) = P(s_(t+1) | s_1, a_1, s_2, a_2, ..., s_t, a_t) ```

where:

- s_t: current state

- a_t: action taken

- s_(t+1): next state

This equation states that the probability of transitioning to the next state (s_(t+1)) depends only on the current state (s_t) and the action taken (a_t), not on the previous states and actions (s_1, a_1, s_2, a_2, ...).

## **Limitations**:
The most significant limitation of MDPs is the assumption of the Markov property. This means that the future state depends only on the current state and the action taken, not on the past history. In many real-world scenarios, this assumption may not hold true, as past events can influence future outcomes. For instance, in a complex game like chess, the past moves can significantly impact the current state and future possibilities.