# Self-study try-it activity 24.1: Grid world experiment in Python

A grid world is a simplified environment used to study how intelligent agents make decisions over time. It’s a grid-based map where each cell represents a state, and the agent can move in four directions: up, down, left, or right. Some cells offer rewards, others penalties, and some are terminal states where the episode ends.

This set-up is ideal for understanding key concepts in Markov decision processes (MDPs) and reinforcement learning, such as:

- Value iteration and policy iteration

- Discounting future rewards

- Stochastic transitions (noise)

- Optimal policy computation

In [None]:
#Install aima3 if it hasn't been installed initially
!pip install aima3

In [None]:
#Import the necessary libraries
import aima3
import numpy as np
import matplotlib.pyplot as plt
import inspect
import aima3.mdp

print(inspect.getfile(aima3.mdp.value_iteration))

In [None]:
from aima3 import mdp

A grid world environment is created using the aima3 library to model sequential decision-making. The agent navigates a 3 × 4 grid, aiming to maximise cumulative rewards while avoiding penalties.

In [None]:
#Sequencial decision environment
grid_world = mdp.GridMDP([[-0.04, -0.04, -0.04, +1],
                          [-0.04, None, -0.04, -1],
                          [-0.04, -0.04, -0.04, -0.04]],
                          terminals=[(3, 2), (3, 1)])

Once the grid world environment is defined, you can apply two classic algorithms to compute optimal strategies for the agent.

Value iteration computes the value of each state by iteratively updating expected rewards. Policy iteration computes the optimal policy, which is the best action to take in each state. These methods allow the agent to make informed decisions that maximize cumulative rewards while navigating the grid.

In [None]:
values = mdp.value_iteration(grid_world)

In [None]:
policy = mdp.policy_iteration(grid_world)

When working with Grid World environments, the results of value_iteration and policy_iteration are stored as dictionaries. To visualize these results as a grid, we need to convert them into a matrix format that matches the layout of the grid.

In [None]:
def convert_to_grid(policy, base_grid, dtype=float):
    grid_shape = n, m = len(base_grid), len(base_grid[0]) #Corrected grid_shape calculation
    mat = np.full(grid_shape, fill_value=np.nan).astype(dtype)
    for k, v in policy.items():
        #Adjust key indexing for grid representation from (column, row) to (row, column)
        mat[k[1], k[0]] = v #Keep indexing as (row, col)
    #Rotate the grid to match the visual representation
    return np.rot90(mat, k=-1)

After computing the values or policy for a grid world, it's helpful to visualise them as a heat map. The `plot_grid()` function uses Matplotlib to display a grid with colour-coded values and numeric labels.

In [None]:
def plot_grid(grid):
    fig, ax = plt.subplots()
    ax.axis('off')
    img = ax.imshow(grid, cmap=plt.get_cmap("viridis"), animated=True)
    for (i, j), z in np.ndenumerate(grid):
        ax.text(j, i, '{:0.2f}'.format(z), ha='center', va='center')
    cbar = fig.colorbar(img)
    plt.show()

After computing the state values using value iteration, you convert the resulting dictionary into a 2D grid format for visualisation. This helps align the values with the spatial layout of the environment.

In [None]:
values_mat = convert_to_grid(values, grid_world.grid, dtype=float);

In [None]:
plot_grid(values_mat)

After computing the optimal policy using policy iteration, let's convert the result into a 2D grid layout that matches the environment. This makes it easier to interpret and visualize the agent’s recommended actions for each state.



In [None]:
policy_mat = convert_to_grid(policy, grid_world.grid, dtype=object)
policy_mat.tolist()

Recall that your grid conventions are:
- Move up: `(0, 1)`
- Move down: `(0, -1)`
- Move left: `(-1, 0)`
- Move right: `(1, 0)`
- Do nothing: `None`

In [None]:
from aima3.mdp import value_iteration
from IPython.core.getipython import get_ipython
get_ipython().run_line_magic('psource', 'value_iteration')

### To-do 1:

-  Define the cliff world MDP, which is a 4 × 12 grid with rewards of:
	- `−1` for each step
	- `−100` for falling off the cliff (states in the third row, excluding the start and goal)
	- `+100` for reaching the goal
- Provide a start state of `(0, 2)`

In [None]:
#Write your code here
cliff_world_big = mdp.GridMDP(
    [[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
     [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
     [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100]],
    terminals=[(11, 2)],
    init=(0, 2)  #Start state defined
)


Once the cliff world environment is defined, you use value iteration to compute the optimal value for each state. This tells you how good it is for the agent to be in a particular state, assuming it follows the best possible policy.

- Use `mdp.value_iteration()` to compute `cliff_values_big`.
- Use `mdp.policy_iteration()` to compute `policy_values_big`.


In [None]:
#Write your code here
cliff_values_big = mdp.value_iteration(cliff_world_big)
policy_values_big = mdp.policy_iteration(cliff_world_big)

### To-do 2:

- Use the `convert_to_grid()` to convert the result into a 2D grid layout that matches the environment.
- Assign it to `values_mat_big`.


In [None]:
#Write your code here
values_mat_big = convert_to_grid(values, cliff_world_big.grid, dtype=float);


## Part 1: varying discount rates

You’ll test how different discount rates affect the agent’s trajectory. Try `gamma = [0.9, 0.95, 0.99]`.

In [None]:
gammas = [0.9, 0.95, 0.99]
for gamma in gammas:
    cliff_world_big.gamma = gamma #Corrected line
    values = mdp.value_iteration(cliff_world_big)
    policy = mdp.best_policy(cliff_world_big, values)
    print(f"\nDiscount Rate: {gamma}")
    print(cliff_world_big.to_arrows(policy))

The discount factor(gamma) `γ` determines how much the agent values future rewards compared to immediate rewards.

- A lower `γ`(e.g. 0.9) makes the agent more short-sighted, prioritising immediate gains.

- A higher `γ` (e.g. 0.99) makes the agent more far-sighted, planning for long-term rewards.

## Part 2: adding noise to transitions

In real-world environments, actions may not always lead to predictable outcomes. To simulate this uncertainty, let's define a custom MDP class called NoisyCliffMDP, which adds action noise to the cliff world setup. We’ll simulate stochastic transitions by modifying the transition model to include noise. Try noise levels of 0.1 and 0.25.

In [None]:
class NoisyCliffMDP(mdp.GridMDP):
    def __init__(self, grid, terminals, init=(0, 0), gamma=0.9, noise=0.1):
        self.noise = noise
        #Pass the underlying grid (list of lists) to the superclass constructor
        super().__init__(grid.grid, terminals, init, gamma)


    def T(self, state, action):
        if action is None:
            return [(0.0, state)]
        return [(1 - self.noise, self.go(state, action)),
                (self.noise / 2, self.go(state, mdp.turn_right(action))),
                (self.noise / 2, self.go(state, mdp.turn_left(action)))]

noise_levels = [0.1, 0.25]
for noise in noise_levels:
    #Pass the grid from cliff_world_big, not the GridMDP object itself
    noisy_mdp = NoisyCliffMDP(cliff_world_big, terminals=[(11, 2)], noise=noise)
    values = mdp.value_iteration(noisy_mdp)
    policy = mdp.best_policy(noisy_mdp, values)
    print(f"\nNoise Level: {noise}")
    print(noisy_mdp.to_arrows(policy))

As noise increases:

- The agent becomes more risk-averse, avoiding paths near the cliff.

- Policies shift to favour safer routes, even if they are longer.

This demonstrates how uncertainty affects decision-making in sequential environments.

### To-do 3:

In a cliff world environment, how does increasing the noise parameter influence the agent’s optimal policy?

Choose the most accurate explanation:

A. The agent becomes more aggressive, taking riskier paths to reach the goal faster.

B. The agent ignores the cliff and treats all paths equally.

C. The agent becomes more cautious, preferring longer but safer routes that avoid the cliff.

D. The agent always chooses the shortest path, regardless of noise.

In [None]:
#Input your choice (A/B/C/D)

**The correct answer is: C.**

The agent becomes more cautious, preferring longer but safer routes that avoid the cliff.

In a cliff world environment, the agent must navigate near dangerous cliff edges that result in large negative rewards (or termination) if entered. The noise parameter introduces stochasticity — meaning the agent might not move exactly in the intended direction.

Higher noise means a greater chance of unintended movement (e.g. slipping into the cliff).

To avoid falling off, the agent learns to favour safer paths, even if they are longer.

This behaviour reflects risk aversion due to increased uncertainty.

### Visualising Value Grids

In [None]:
#values_mat = convert_to_grid(values, cliff_grid, dtype=float)
values_mat_big = convert_to_grid(values, cliff_world_big.grid, dtype=float);
plot_grid(values_mat_big)

`plot_policy_grid()` visualises a grid world policy by mapping directional actions to arrows and displaying them on a Matplotlib axis.


In [None]:


def plot_policy_grid(policy_mat, title, ax):
    ax.set_title(title)
    ax.axis('off')
    for (i, j), action in np.ndenumerate(policy_mat):
        if isinstance(action, tuple):
            dx, dy = action
            arrow = {
                (1, 0): '→', (-1, 0): '←',
                (0, 1): '↑', (0, -1): '↓'
            }.get((dx, dy), '.')
        elif action is None:
            arrow = 'G'
        else:
            arrow = '.'
        ax.text(j, i, arrow, ha='center', va='center', fontsize=12)

In [None]:
cliff_grid = [[-1]*12 for _ in range(3)] + [[-1]*12]
cliff_grid[2][:11] = [-100]*11
cliff_grid[2][11] = 100
terminals = [(11, 2)]


###  Policy Comparison in a Cliff World MDP

Let's visualise how different discount rates (`gamma`) and noise levels affect the optimal policy in a cliff world environment using value iteration. It creates a 2 × 2 subplot grid:

- **Top left**: low noise (0.1), moderate discount (0.9)
- **Top right**: high noise (0.25), moderate discount (0.9)
- **Bottom left**: low noise (0.1), high discount (0.99)
- **Bottom right**: high noise (0.25), high discount (0.99)

Each subplot shows the agent’s preferred action in each state, revealing how increased noise leads to more cautious policies and higher discounting encourages long-term planning. The custom `NoisyCliffMDP` class overrides the transition model to simulate stochastic movement.


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

#Top left: gamma=0.9, noise=0.1
mdp1 = mdp.GridMDP(cliff_grid, terminals=terminals, init=(0, 0), gamma=0.9)
values1 = mdp.value_iteration(mdp1)
policy1 = mdp.best_policy(mdp1, values1)
mat1 = convert_to_grid(policy1, cliff_grid, dtype=object) #Changed dtype to object
plot_policy_grid(mat1, "Discount Rate: 0.9", axes[0, 0])

#Top right: gamma=0.9, noise=0.25
class NoisyCliffMDP(mdp.GridMDP):
    def __init__(self, grid, terminals, init=(0, 0), gamma=0.9, noise=0.25):
        self.noise = noise
        super().__init__(grid, terminals, init, gamma)

    def T(self, state, action):
        if action is None:
            return [(0.0, state)]
        return [(1 - self.noise, self.go(state, action)),
                (self.noise / 2, self.go(state, mdp.turn_right(action))),
                (self.noise / 2, self.go(state, mdp.turn_left(action)))]

mdp2 = NoisyCliffMDP(cliff_grid, terminals, gamma=0.9, noise=0.25)
values2 = mdp.value_iteration(mdp2)
policy2 = mdp.best_policy(mdp2, values2)
mat2 = convert_to_grid(policy2, cliff_grid, dtype=object) #Changed dtype to object
plot_policy_grid(mat2, "Noise Level: 0.25", axes[0, 1])

#Bottom left: gamma=0.99, noise=0.1
mdp3 = mdp.GridMDP(cliff_grid, terminals=terminals, init=(0, 0), gamma=0.99)
values3 = mdp.value_iteration(mdp3)
policy3 = mdp.best_policy(mdp3, values3)
mat3 = convert_to_grid(policy3, cliff_grid, dtype=object) #Changed dtype to object
plot_policy_grid(mat3, "Discount Rate: 0.99", axes[1, 0])

#Bottom right: gamma=0.99, noise=0.25
mdp4 = NoisyCliffMDP(cliff_grid, terminals, gamma=0.99, noise=0.25)
values4 = mdp.value_iteration(mdp4)
policy4 = mdp.best_policy(mdp4, values4)
mat4 = convert_to_grid(policy4, cliff_grid, dtype=object) #Changed dtype to object
plot_policy_grid(mat4, "Noise Level: 0.25", axes[1, 1])

plt.tight_layout()
plt.show()

In this notebook, you explored how intelligent agents make decisions in uncertain environments using grid world and the aima3 library. You:

- Defined deterministic and stochastic MDPs, including a cliff world scenario
- Applied **value iteration** and **policy iteration** to compute optimal strategies
- Visualised state values and policies using heat maps and directional arrows
- Investigated how varying the **discount factor** (γ) affects short-term vs long-term planning
- Observed how increasing **noise** in transitions makes the agent more risk-averse, preferring safer paths

These experiments deepen your understanding of how agents balance reward, risk and uncertainty — core ideas in reinforcement learning. You're now equipped to explore more advanced topics such as Q-learning, policy gradients and real-world applications in robotics and recommendation systems.