### Key Components of the Code

1. **Environment Initialization**
The gird size and terminal stated are configurable:

```bash
env = ModelBasedGridWorld(grid_size=5, terminal_states=[(4,4)])
```

- `grid_size=5`: Create a 5x5 grid.
- `terminal_states=[(4, 4)]`:  Sets the bottom-right corner as the goal (terminal state).
- The agent starts at the top-left corner `(0, 0)`.

2. **Actions**
The agent has 4 discrete actions:

- **Up (0)**: Move up by one cell.
- **Right (1)**: Move right by one cell.
- **Down (2)**: Move down by one cell.
- **Left (3)**: Move left by one cell.

Each action transitions the agent to a new state, unless:
- It would move the agent out of bounds (e.g., moving up from the top row does nothing).

3. **State Transition**
The function step(action) defines how the agent moves within the grid:

```bash
# Movement definitions: {0: Up, 1: Right, 2: Down, 3: Left}
moves = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}
```

- Each action modifies the agent's row and column coordinates.
- Boundary conditions ensure the agent doesn’t move outside the grid:

```bash
next_row = max(0, min(self.grid_size - 1, next_row))
next_col = max(0, min(self.grid_size - 1, next_col))
```

4. **Rewards**
Rewards guide the agent’s behavior:
- Reaching a terminal state (goal) yields a positive reward (e.g., +10).
- Each step taken without reaching the goal incurs a negative reward (e.g., -1).
This reward structure incentivizes the agent to reach the terminal state in as few steps as possible.

5. **Terminal States**
The episode ends when the agent reaches one of the **terminal states**:

```bash
done = next_state in self.terminal_states
```

6. **Rendering**
The `render()` method visualizes the current state of the grid:

The grid displays:
- `A`: The agent’s current position.
- `T`: Terminal states (goals).
- `.`: Empty cells.

Example visualization:
```bash
A . . . .
. . . . .
. . . . .
. . . . .
. . . . T
```

In [None]:
import numpy as np
import gym
from gym import spaces

In [None]:
class ModelBasedGridWorld(gym.Env):
    def __init__(self, grid_size=5, terminal_states=None, random_seed=None):
        """
        Custom GridWorld environment for model-based reinforcement learning.
        :param grid_size: Size of the grid (grid_size x grid_size).
        :param terminal_states: List of terminal state positions (row, col).
        :param random_seed: Random seed for reproducibility.
        """
        super().__init__()
        self.grid_size = grid_size
        self.terminal_states = terminal_states or [(grid_size - 1, grid_size - 1)]
        self.random_seed = random_seed
        self._setup_environment()

    def _setup_environment(self):
        # Initialize state and action spaces
        self.action_space = spaces.Discrete(4)  # Actions: 0=Up, 1=Right, 2=Down, 3=Left
        self.observation_space = spaces.MultiDiscrete([self.grid_size, self.grid_size])
        self.state = (0, 0)  # Start at top-left corner
        self.reward_model = {}  # Reward function R(s, a)
        self.transition_model = {}  # Transition dynamics P(s'|s, a)
        if self.random_seed:
            np.random.seed(self.random_seed)

    def step(self, action):
        """Take an action and return the next state, reward, done, and info."""
        if action < 0 or action >= self.action_space.n:
            raise ValueError("Invalid action.")

        row, col = self.state
        if self.state in self.terminal_states:
            return self.state, 0, True, {}

        moves = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}  # Up, Right, Down, Left
        dr, dc = moves[action]
        next_row, next_col = row + dr, col + dc

        # Ensure the next state stays within bounds
        next_row = max(0, min(self.grid_size - 1, next_row))
        next_col = max(0, min(self.grid_size - 1, next_col))
        next_state = (next_row, next_col)

        # Reward is -1 for each step unless in terminal state
        reward = -1 if next_state not in self.terminal_states else 10
        done = next_state in self.terminal_states

        self.state = next_state

        return self.state, reward, done, {}

    def reset(self):
        """Reset the environment to the initial state."""
        self.state = (0, 0)
        return self.state

    def render(self, mode='human'):
        """Render the current state of the environment."""
        grid = np.zeros((self.grid_size, self.grid_size), dtype=str)
        grid[:, :] = '.'
        for r, c in self.terminal_states:
            grid[r, c] = 'T'  # Mark terminal states
        row, col = self.state
        grid[row, col] = 'A'  # Mark agent position
        print("\n".join([" ".join(row) for row in grid]))
        print()

    def get_transition_model(self):
        """Return the transition dynamics for model-based RL."""
        return self.transition_model

    def get_reward_model(self):
        """Return the reward function for model-based RL."""
        return self.reward_model

In [None]:
env = ModelBasedGridWorld(grid_size=5, terminal_states=[(4, 4)])

state = env.reset()
env.render()

# Simulate a random episode
done = False
while not done:
    action = env.action_space.sample()  # Random action
    next_state, reward, done, _ = env.step(action)
    env.render()
    print(f"Action: {action}, Reward: {reward}")

A . . . .
. . . . .
. . . . .
. . . . .
. . . . T

A . . . .
. . . . .
. . . . .
. . . . .
. . . . T

Action: 0, Reward: -1
. A . . .
. . . . .
. . . . .
. . . . .
. . . . T

Action: 1, Reward: -1
. . . . .
. A . . .
. . . . .
. . . . .
. . . . T

Action: 2, Reward: -1
. . . . .
A . . . .
. . . . .
. . . . .
. . . . T

Action: 3, Reward: -1
. . . . .
. . . . .
A . . . .
. . . . .
. . . . T

Action: 2, Reward: -1
. . . . .
. . . . .
A . . . .
. . . . .
. . . . T

Action: 3, Reward: -1
. . . . .
. . . . .
. . . . .
A . . . .
. . . . T

Action: 2, Reward: -1
. . . . .
. . . . .
. . . . .
. A . . .
. . . . T

Action: 1, Reward: -1
. . . . .
. . . . .
. . . . .
. . A . .
. . . . T

Action: 1, Reward: -1
. . . . .
. . . . .
. . A . .
. . . . .
. . . . T

Action: 0, Reward: -1
. . . . .
. . . . .
. . . A .
. . . . .
. . . . T

Action: 1, Reward: -1
. . . . .
. . . . .
. . . . .
. . . A .
. . . . T

Action: 2, Reward: -1
. . . . .
. . . . .
. . . . .
. . . . .
. . . A T

Action: 2, Reward: -1
