### Key Components of the Code

1. **Environment Initialization**
The gird size and terminal stated are configurable:

```bash
env = ModelBasedGridWorld(grid_size=5, terminal_states=[(4,4)])
```

- `grid_size=5`: Create a 5x5 grid.
- `terminal_states=[(4, 4)]`:  Sets the bottom-right corner as the goal (terminal state).
- The agent starts at the top-left corner `(0, 0)`.

2. **Actions**
The agent has 4 discrete actions:

- **Up (0)**: Move up by one cell.
- **Right (1)**: Move right by one cell.
- **Down (2)**: Move down by one cell.
- **Left (3)**: Move left by one cell.

Each action transitions the agent to a new state, unless:
- It would move the agent out of bounds (e.g., moving up from the top row does nothing).

3. **State Transition**
The function step(action) defines how the agent moves within the grid:

```bash
# Movement definitions: {0: Up, 1: Right, 2: Down, 3: Left}
moves = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}
```

- Each action modifies the agent's row and column coordinates.
- Boundary conditions ensure the agent doesn’t move outside the grid:

```bash
next_row = max(0, min(self.grid_size - 1, next_row))
next_col = max(0, min(self.grid_size - 1, next_col))
```

4. **Rewards**
Rewards guide the agent’s behavior:
- Reaching a terminal state (goal) yields a positive reward (e.g., +10).
- Each step taken without reaching the goal incurs a negative reward (e.g., -1).
This reward structure incentivizes the agent to reach the terminal state in as few steps as possible.

5. **Terminal States**
The episode ends when the agent reaches one of the **terminal states**:

```bash
done = next_state in self.terminal_states
```

6. **Rendering**
The `render()` method visualizes the current state of the grid:

The grid displays:
- `A`: The agent’s current position.
- `T`: Terminal states (goals).
- `.`: Empty cells.

Example visualization:
```bash
A . . . .
. . . . .
. . . . .
. . . . .
. . . . T
```

In [4]:
import numpy as np

class ModelBasedGridWorld:
    def __init__(self, grid_size=5, terminal_states=None, random_seed=None, stochastic=False):
        """
        Custom GridWorld environment for model-based reinforcement learning.
        :param grid_size: Size of the grid (grid_size x grid_size).
        :param terminal_states: List of terminal state positions (row, col).
        :param random_seed: Random seed for reproducibility.
        :param stochastic: Whether to use stochastic transitions.
        """
        self.grid_size = grid_size
        self.terminal_states = terminal_states or [(grid_size - 1, grid_size - 1)]
        self.state = (0, 0)  # Starting state
        self.transition_model = {}  # Transition probabilities P(s'|s, a)
        self.reward_model = {}  # Reward function R(s, a)
        self.stochastic = stochastic

        if random_seed is not None:
            np.random.seed(random_seed)

        self._build_models()

    def _build_models(self):
        """Construct the transition and reward models."""
        for row in range(self.grid_size):
            for col in range(self.grid_size):
                for action in range(4):  # Actions: 0=Up, 1=Right, 2=Down, 3=Left
                    state = (row, col)
                    self.transition_model[(state, action)] = self._compute_transition(state, action)
                    self.reward_model[(state, action)] = -1  # Default step penalty
                    
                    if state in self.terminal_states:
                        self.reward_model[(state, action)] = 10  # Reward for terminal state

    def _compute_transition(self, state, action):
        """
        Compute transition dynamics for a given state and action.
        If stochastic=True, return a set of possible outcomes with probabilities.
        """
        row, col = state
        moves = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}  # Up, Right, Down, Left

        if self.stochastic:
            # Define stochastic probabilities for actions
            probabilities = [0.8, 0.1, 0.05, 0.05]  # Action, +slight noise
            outcomes = []

            for i, (dr, dc) in moves.items():
                next_row, next_col = row + dr, col + dc
                next_row = max(0, min(self.grid_size - 1, next_row))
                next_col = max(0, min(self.grid_size - 1, next_col))
                outcomes.append(((next_row, next_col), probabilities[i]))
            return outcomes
        else:
            # Deterministic movement
            dr, dc = moves[action]
            next_row, next_col = row + dr, col + dc
            next_row = max(0, min(self.grid_size - 1, next_row))
            next_col = max(0, min(self.grid_size - 1, next_col))
            return (next_row, next_col)

    def reset(self):
        """Reset the environment to the initial state."""
        self.state = (0, 0)
        return self.state

    def step(self, action):
        """
        Take an action and return the next state, reward, done, and info.
        """
        if self.stochastic:
            outcomes = self.transition_model[(self.state, action)]
            next_state = outcomes[np.random.choice(len(outcomes), p=[p for _, p in outcomes])][0]
        else:
            next_state = self.transition_model[(self.state, action)]

        reward = self.reward_model[(self.state, action)]
        self.state = next_state
        done = self.state in self.terminal_states
        return next_state, reward, done, {}

    def render(self):
        """Render the current state of the environment."""
        grid = np.full((self.grid_size, self.grid_size), '.', dtype=str)
        for r, c in self.terminal_states:
            grid[r, c] = 'T'
        row, col = self.state
        grid[row, col] = 'A'
        print("\n".join([" ".join(row) for row in grid]))
        print()

    def get_transition_model(self):
        """Return the transition dynamics for model-based RL."""
        return self.transition_model

    def get_reward_model(self):
        """Return the reward function for model-based RL."""
        return self.reward_model


In [5]:
if __name__ == "__main__":
    # Initialize the environment
    env = ModelBasedGridWorld(grid_size=5, terminal_states=[(4, 4)], random_seed=42, stochastic=False)

    # Reset the environment
    state = env.reset()
    env.render()

    # Simulate an episode with a maximum step count
    done = False
    total_reward = 0
    max_steps = 100  # Limit the number of steps per episode
    steps = 0

    while not done and steps < max_steps:
        # Implement a simple policy: Move toward the terminal state
        row, col = state
        terminal_row, terminal_col = env.terminal_states[0]

        if row < terminal_row:
            action = 2  # Move Down
        elif row > terminal_row:
            action = 0  # Move Up
        elif col < terminal_col:
            action = 1  # Move Right
        elif col > terminal_col:
            action = 3  # Move Left

        # Take the action
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        steps += 1

        print(f"Step: {steps}, Action: {action}, Next State: {next_state}, Reward: {reward}, Done: {done}")
        env.render()

    if steps == max_steps:
        print("Episode terminated due to maximum step limit.")

    print(f"Total Reward: {total_reward}")

A . . . .
. . . . .
. . . . .
. . . . .
. . . . T

Step: 1, Action: 2, Next State: (1, 0), Reward: -1, Done: False
. . . . .
A . . . .
. . . . .
. . . . .
. . . . T

Step: 2, Action: 2, Next State: (2, 0), Reward: -1, Done: False
. . . . .
. . . . .
A . . . .
. . . . .
. . . . T

Step: 3, Action: 2, Next State: (3, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
A . . . .
. . . . T

Step: 4, Action: 2, Next State: (4, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
. . . . .
A . . . T

Step: 5, Action: 2, Next State: (4, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
. . . . .
A . . . T

Step: 6, Action: 2, Next State: (4, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
. . . . .
A . . . T

Step: 7, Action: 2, Next State: (4, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
. . . . .
A . . . T

Step: 8, Action: 2, Next State: (4, 0), Reward: -1, Done: False
. . . . .
. . . . .
. . . . .
. . . . .
A . . . T

Step: 9, Action: 2, Next Stat