# Markov Decision Process
The Markov Decision Process is a precursor concept to the Reinforcement Learning. Basically, A Markov Decision Process, also known as an MDP model, contains the following set of features:
<ul> 
    <li>A set of possible states, S.</li>
    <li>A set of models. </li>
    <li>A set of possible actions, A. </li>
    <li>A real-valued reward function, R(s, a). </li>
    <li>A solution to the Markov Decision Process. </li>
</ul>

## Frozen Lake Problem
The Frozen Lake problem is a classic grid-world problem used in reinforcement learning to
demonstrate and test various algorithms. It's a simple but illustrative problem that involves an agent
navigating a grid while facing challenges. Do not find the optimal poolicy, just keep track of maximum reward (minimum 'F' states).

### Environment
<ul>
    <li>The environment is represented as a grid, typically a 4x4 or 8x8 grid. </li>
    <li>The grid consists of different types of cells:
        <ul>
            <li>"S" (Start): The starting point for the agent. </li>
            <li>"F" (Frozen): Safe frozen surface, which the agent can walk on without any issue. </li>
            <li>"H" (Hole): Holes in the frozen surface. If the agent steps into a hole, it falls and fails. </li>
            <li>"G" (Goal): The goal location the agent needs to reach. </li>
        </ul>
    </li>
</ul>

### Agent
<ul>
    <li>The agent starts at the "S" cell and needs to navigate through the grid to reach the "G" cell.</li>
    <li>The agent can take discrete actions such as moving UP, DOWN, LEFT, or RIGHT.</li>
</ul>

### Objective
The goal of the agent is to reach the "G" cell while avoiding the "H" cells. Success is defined
as reaching the goal cell.

### Challenges
<ul>
    <li>The ice on the frozen surface is slippery, so the agent doesn't always move in the intended
        direction. Instead, it moves in the chosen direction with a certain probability, often making it
        challenging to reach the goal.
    </li>
    <li>The agent's objective is to learn a policy that maximizes the cumulative reward while
        navigating the grid. 
    </li>
</ul>

### Rewards
The rewards are given out in the following manner:
<ul>
    <li>Reaching the goal ("G") cell: +1 (positive reward for success)</li>
    <li>Falling into a hole ("H") cell: -1 (negative reward for failure) </li>
    <li> All other actions: -0.1 (a small negative reward for taking actions, which encourages the
    agent to reach the goal with fewer steps) </li>
</ul>

# Code Model

In [1]:
import numpy as np
import random


class FrozenMDP:
    def __init__(self, nc, nr):
        self.nc = nc
        self.nr = nr

    def states(self):
        return np.arange(1, self.nc * self.nr + 1)

    def get_block_no(self, s):
        return (s - 1) // self.nc + 1, (s - 1) % self.nc + 1

    def get_state_no(self, x, y):
        return (x - 1) * self.nc + y

    def Start(self):
        return 1

    def Goal(self):
        return self.nc * self.nr

    def actions(self, s):
        action = []
        x, y = self.get_block_no(s)
        if y - 1 >= 1:
            action.append('Left')
        if y + 1 <= self.nc:
            action.append('Right')
        if x - 1 >= 1:
            action.append('Up')
        if x <= self.nr:
            action.append('Down')

        return action

    def FailureStates(self):
        return [6, 8, 12, 13]

    def SuccessState(self):
        return [16]

    def isGoal(self, s):
        if s in self.FailureStates():
            return True
        if s in self.SuccessState():
            return True
        return False

    def reward(self, state, action, new_state):
        if new_state in self.states():
            if new_state in self.FailureStates():
                return -1
            elif self.isGoal(new_state):
                return 1

            return -0.1
        return 0


    def get_valid_neighbors(self, state):
        neighbors = set()
        x, y = self.get_block_no(state)

        def is_valid_neighbor(x_neighbor, y_neighbor):
            return (
                0 <= x_neighbor < self.nr
                and 0 <= y_neighbor < self.nc
                and self.get_state_no(x_neighbor, y_neighbor) in self.FailureStates()
            )

        # Check up
        if is_valid_neighbor(x, y - 1):
            neighbors.add(self.get_state_no(x, y - 1))

        # Check down
        if is_valid_neighbor(x, y + 1):
            neighbors.add(self.get_state_no(x, y + 1))

        # Check left
        if is_valid_neighbor(x - 1, y):
            neighbors.add(self.get_state_no(x - 1, y))

        # Check right
        if is_valid_neighbor(x + 1, y):
            neighbors.add(self.get_state_no(x + 1, y))

        return sorted(neighbors)

    def transition_probability(self, s, a, new_state):
        x, y = self.get_block_no(s)

        if a == "Left":
            y -= 1
        elif a == "Right":
            y += 1
        elif a == "Up":
            x -= 1
        elif a == "Down":
            x += 1

        s_calc = self.get_state_no(x, y)

        if s_calc == new_state:
            return 0.6
        elif new_state in self.FailureStates():
            return 0.4
        else:
            return 0.0

    def transition(self, s, a):
        x, y = self.get_block_no(s)

        if a == "Left":
            y -= 1
        elif a == "Right":
            y += 1
        elif a == "Up":
            x -= 1
        elif a == "Down":
            x += 1

        s_calc = self.get_state_no(x, y)
        return s_calc

    def transitions(self, s):
        transition_states = []
        actions = self.actions(s)
        for action in actions:
            transition_states.append(self.transition(s, action))

        return transition_states


def minimum_reward_to_goal(mdp, state, total_reward, visited, parent_map):
    # Base case: if state is goal, return the path and the total reward
    if mdp.isGoal(state):
        path = [state]
        current_state = state
        while current_state != mdp.Start():
            if parent_map[current_state] <= (mdp.nc * mdp.nr):
                path.insert(0, parent_map[current_state])
            current_state = parent_map[current_state]
        return total_reward, state, path,  'Failure' if state in mdp.FailureStates() else 'Goal'

    # Mark the current state as visited
    visited.add(state)

    # Initialize the minimum reward and the best path to None
    min_reward = None
    best_path = None

    for action in mdp.actions(state):
        next_state = mdp.transition(state, action)
        reward = mdp.reward(state, action, next_state)

        if next_state not in visited and next_state not in mdp.FailureStates():
            parent_map[next_state] = state

            # Recursively find the minimum reward and the best path from the next state
            sub_reward, sub_goal, sub_path, sub_status = minimum_reward_to_goal(
                mdp, next_state, total_reward + reward, visited, parent_map)

            # If the sub_reward is not None and it is smaller than the current min_reward, update the min_reward and the best_path
            if sub_reward is not None and (min_reward is None or sub_reward < min_reward):
                min_reward = sub_reward
                best_path = sub_path

    # Return the minimum reward and the best path
    return min_reward,None, best_path, None

## Driver Code

In [4]:
a = FrozenMDP(4, 4)
print("Initial State:", a.Start())
r, n, p, s = minimum_reward_to_goal(mdp=a, state=a.Start(), total_reward=0, visited=set(), parent_map={})
r += 0.20
print(f"Reward: {r} having path {p} ")

Initial State: 1
Reward: 0.5 having path [1, 2, 3, 7, 11, 10, 14, 15, 16] 
