# Exercise: Move on a grid

## Problem description
We move on a `grid` of size $(n \times n)$, where $(i, j) : 0 \leq i \leq n, 0 \leq j \leq n$ is a `location`.

We always start from $(0,0)$ and we need to reach as fast as possible one of the final locations $(\cdot, n)$.

On the grid we can:
- `walk`: this is always possible, and we can move East, North, West, or South, moving to the next location adjacent to the starting one.
- `bus`: this is always possible, and we can move East, North, West, or South two locations away in the chosen direction. However, with probability $b$, we need to wait for the bus. Waiting means that we do not move.
- `train`: trains are available only in locations $(i, j)$ where $\mod(i, 2) = 0$ and $\mod(j, 2) = 0$. If a train is available, you can get it to one of the next stations. However, with probability $c$ you need to wait. Waiting means not moving.
- Any movement requires the same amount of time.
- You can never move outside the grid.

In [1]:
from grid_move import GridMoveMDP

In [2]:
mdp = GridMoveMDP(size=(4, 4), b=.5, c=.4)

## Model

In [3]:
from utils import transitions_table, mdp_to_graph, plot_mdp

In [4]:
T = transitions_table(mdp)
T

Unnamed: 0,from_state,action,to_state,reward,probability
0,"(0, 1)",WS,"(0, 0)",-1,1.0
1,"(0, 1)",WN,"(0, 2)",-1,1.0
2,"(0, 1)",WE,"(1, 1)",-1,1.0
3,"(0, 1)",BE,"(2, 1)",-1,0.5
4,"(0, 1)",BE,"(0, 1)",-1,0.5
...,...,...,...,...,...
131,"(1, 3)",BS,"(1, 3)",-1,0.5
132,"(1, 3)",WE,"(2, 3)",-1,1.0
133,"(1, 3)",BE,"(3, 3)",-1,0.5
134,"(1, 3)",BE,"(1, 3)",-1,0.5


In [5]:
net = plot_mdp(mdp_to_graph(mdp))
net.show('grid-move.html')

## Value iteration

In [6]:
from IPython.display import display, clear_output
from algorithms import value_iteration
from utils import show_value_iterations

In [7]:
optimal_value, optimal_policy, value_history, policy_history = value_iteration(mdp=mdp, epsilon=1e-10)

In [8]:
optimal_value, optimal_policy

({(0, 1): -1.8181818181100367,
  (1, 2): -1.0,
  (2, 1): -1.0,
  (0, 0): -2.6363636362200733,
  (3, 1): 0,
  (1, 1): -1.8181818181100367,
  (0, 3): 0.0,
  (2, 0): -1.0,
  (3, 0): 0,
  (2, 3): 0.0,
  (0, 2): -1.0,
  (3, 3): 0.0,
  (2, 2): -1.0,
  (1, 0): -1.8181818181100367,
  (3, 2): 0,
  (1, 3): 0.0},
 {(0, 1): 'BN',
  (1, 2): 'WN',
  (2, 1): 'WE',
  (0, 0): 'WN',
  (3, 1): 'WE',
  (1, 1): 'BN',
  (0, 3): None,
  (2, 0): 'WE',
  (3, 0): 'WE',
  (2, 3): None,
  (0, 2): 'WN',
  (3, 3): None,
  (2, 2): 'WN',
  (1, 0): 'BE',
  (3, 2): 'WE',
  (1, 3): None})

In [9]:
show_value_iterations(value_history, policy_history)

Unnamed: 0,S,V,A
0,"(0, 1)",-1.818182,BN
1,"(1, 2)",-1.0,WN
2,"(2, 1)",-1.0,WE
3,"(0, 0)",-2.636364,WN
4,"(3, 1)",0.0,WE
5,"(1, 1)",-1.818182,BN
6,"(0, 3)",0.0,
7,"(2, 0)",-1.0,WE
8,"(3, 0)",0.0,WE
9,"(2, 3)",0.0,





## Policy iteration

In [10]:
from algorithms import policy_iteration
from utils import show_policy_iterations

In [11]:
pi, pi_history = policy_iteration(mdp)

In [12]:
pi.actions

{(0, 1): 'BN',
 (1, 2): 'WN',
 (2, 1): 'WE',
 (0, 0): 'WE',
 (3, 1): 'WE',
 (1, 1): 'BN',
 (0, 3): 'WS',
 (2, 0): 'WE',
 (3, 0): 'WE',
 (2, 3): 'WS',
 (0, 2): 'WN',
 (3, 3): 'WE',
 (2, 2): 'WN',
 (1, 0): 'BE',
 (3, 2): 'WE',
 (1, 3): 'WW'}

In [13]:
show_policy_iterations(pi_history)

Unnamed: 0,S,V,A
0,"(0, 1)",-1.818182,BN
1,"(1, 2)",-1.0,WN
2,"(2, 1)",-1.0,WE
3,"(0, 0)",-2.636364,WE
4,"(3, 1)",0.0,WE
5,"(1, 1)",-1.818182,BN
6,"(0, 3)",0.0,WS
7,"(2, 0)",-1.0,WE
8,"(3, 0)",0.0,WE
9,"(2, 3)",0.0,WS



