#### Excercise 1

_1. Formalize the above instance of the Lake MDP mathematically, i.e., explicitly give the set of states, the set of actions, the probability distribution, and the reward function._

*State space (𝑆):*

The environment is 4×4 grid.
The agent can occupy any non-hole, non-wall cell.
There are special states:
    Goal state (bottom-right corner).
    Hole states (absorbing states with a large negative reward).

Therefore, 𝑆
S consists of all possible positions (𝑖,𝑗) in the grid.

    s ∈ [{1,1},{1,2},{1,3},{1,4},{2,1},{2,2},{2,3},{2,4},….{4,4}]

*Action space (A):*

    a ∈ [up, righ, down, left]
    a ∈ A(s) Since there are some state that dont have all actions available

*Transition probabilities(P):*

    P = p(s' | s, a)
    Where moving in an intended direction have 0.8 chance of succes
    If adjacents cells are unblocked there is a probabilty of 0.2 to arrive into a different state. Equally distributed.
    If one is blocked the probablity would be 0.1.
    If both are blocked the probablity would be 0.2 of not moving.

*Reward:*

    R(s,a) = -0.1 for all non terminal states
    R(hole) = -1,000 for holes
    R(goal) = 0

_2. Does there exist a policy for the above instance of the Lake MDP that can surely arrive at the goal without running the risk of falling into a hole? Explain_

In the current example the agent always risk falling into a hole since there is no route that it could take that does not have a a hole adjacent to the floor tile.

_3. Is a policy that keeps the agent on the lake forever reasonable? How would a reasonable policy for the above instances of the Lake MDP look like? Write it down explicitly._

No, since each state have a reward of -1 if it stays indefinetly the reward grows and even becomes worst than falling into a hole. 

Down, down, right, down, right, right


In [None]:
# python modules autoreload setup



In [10]:
%load_ext autoreload
%autoreload 2

from src.core.base_mdp import MDP

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import numpy as np

class LakeMDP(MDP):
    def __init__(self, lake_grid: np.ndarray):
        """
        Initializes the LakeMDP with a given binary grid.
        
        :param lake_grid: A numpy array where 0 represents viable fields and 1 represents holes.
        """
        assert isinstance(lake_grid, np.ndarray), "lake_grid must be a NumPy array"
        assert lake_grid.dtype == np.int_, "lake_grid must be a binary integer array"
        assert lake_grid.shape[0] > 1 and lake_grid.shape[1] > 1, "lake_grid must be at least 2x2"
        
        # Enforce constraints: start (0,0) and goal (-1,-1) must be viable
        lake_grid[0, 0] = 0  
        lake_grid[-1, -1] = 0  

        self.lake_grid = lake_grid
        self.n_rows, self.n_cols = lake_grid.shape
        self.goal_state = (self.n_rows - 1, self.n_cols - 1)

        # Define possible actions (up, down, left, right)
        self._actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

    @property
    def init_states(self) -> list:
        """Returns the starting position (0,0) as the only initial state."""
        return [(0, 0)]

    @property
    def states(self) -> list:
        """Returns all valid states (where grid value is 0)."""
        return [(r, c) for r in range(self.n_rows) for c in range(self.n_cols) if self.lake_grid[r, c] == 0]

    def get_actions_in_state(self, s) -> list:
        """Returns valid actions in state `s`, ensuring no moves go out of bounds or into holes."""
        r, c = s
        actions = []
        
        if r > 0 and self.lake_grid[r - 1, c] == 0:
            actions.append('UP')
        if r < self.n_rows - 1 and self.lake_grid[r + 1, c] == 0:
            actions.append('DOWN')
        if c > 0 and self.lake_grid[r, c - 1] == 0:
            actions.append('LEFT')
        if c < self.n_cols - 1 and self.lake_grid[r, c + 1] == 0:
            actions.append('RIGHT')

        return actions

    def get_reward(self, s) -> float:
        """Returns reward: -1 for normal states, 0 for holes, +10 for goal."""
        if s == self.goal_state:
            return 10  # Reward for reaching goal
        elif self.lake_grid[s] == 1:
            return 0  # No reward for falling into a hole
        return -1  # Default step penalty

    def get_transition_distribution(self, s, a) -> dict:
        """Returns the next state distribution with stochastic transitions."""
        if s == self.goal_state:
            return {s: 1.0}  # If in goal state, stay there

        r, c = s
        main_move = s  # Default to staying in place

        # Define movement directions
        moves = {
            'UP': (-1, 0),
            'DOWN': (1, 0),
            'LEFT': (0, -1),
            'RIGHT': (0, 1)
        }
        perpendicular_moves = {
            'UP': [('LEFT', (0, -1)), ('RIGHT', (0, 1))],
            'DOWN': [('LEFT', (0, -1)), ('RIGHT', (0, 1))],
            'LEFT': [('UP', (-1, 0)), ('DOWN', (1, 0))],
            'RIGHT': [('UP', (-1, 0)), ('DOWN', (1, 0))]
        }

        # Compute primary move
        if a in moves:
            dr, dc = moves[a]
            new_r, new_c = r + dr, c + dc
            if 0 <= new_r < self.n_rows and 0 <= new_c < self.n_cols and self.lake_grid[new_r, new_c] == 0:
                main_move = (new_r, new_c)

        # Compute stochastic perpendicular moves
        transitions = {main_move: 0.8}  # 80% chance to move as intended
        for _, (dr, dc) in perpendicular_moves[a]:
            new_r, new_c = r + dr, c + dc
            if 0 <= new_r < self.n_rows and 0 <= new_c < self.n_cols and self.lake_grid[new_r, new_c] == 0:
                transitions[(new_r, new_c)] = 0.1  # 10% chance to move sideways
            else:
                transitions[s] = transitions.get(s, 0) + 0.1  # Stay in place if move blocked

        return transitions

    def __repr__(self):
        return f"LakeMDP({self.n_rows}x{self.n_cols})"


# Usage
lake = np.array([[0, 0, 0, 0],
                 [0, 1, 0, 1],
                 [0, 0, 0, 1],
                 [1, 0, 0, 0]])

lake_mdp = LakeMDP(lake)

print("Initial States:", lake_mdp.init_states)
print("All States:", lake_mdp.states)
print("Actions in (2,1):", lake_mdp.get_actions_in_state((2,1)))
print("Reward for (3,3):", lake_mdp.get_reward((3,3)))
print("Transition from (0, 1) taking 'DOWN':", lake_mdp.get_transition_distribution((2,1), 'RIGHT'))


Initial States: [(0, 0)]
All States: [(0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 2), (2, 0), (2, 1), (2, 2), (3, 1), (3, 2), (3, 3)]
Actions in (2,1): ['DOWN', 'LEFT', 'RIGHT']
Reward for (3,3): 10
Transition from (0, 1) taking 'DOWN': {(2, 2): 0.8, (2, 1): 0.1, (3, 1): 0.1}
