# Exercise: transportation

## Problem description
You need to travel from a place `1` to a place `N`, moving always forward. At any place `i` with $0 \leq i \leq N$ you can choose one of the following transportation options:
- `walk`: when you walk you always spend $w$ minutes to go to $i + 1$;
- `bus`: if you take a bus you spend $b$ minute to go to $i + 2$, but with probability $\alpha$ you need to wait the bus. There's not bus traveling to $j$ if $j > N$;
- `train` if you take the train you spend $t$ minute to go to $2i$, but with probability $\beta$ you need to wait the train. There's not train traveling to $j$ if $j > N$;

Your goal is to reach $N$ as fast as possible.

In [16]:
import sys
sys.path.append('../../rlcoding/')

In [17]:
from transportation import TransportationMDP

In [18]:
mdp = TransportationMDP(n=6, alpha=.4, beta=.3, gamma=.9, w=-4, b=-1, t=-1)

## Model

In [19]:
from utils import transitions_table, mdp_to_graph, plot_mdp

In [20]:
T = transitions_table(mdp)

In [21]:
T

Unnamed: 0,from_state,action,to_state,reward,probability
0,1,walk,2,-4,1.0
1,1,train,2,-1,0.7
2,1,train,1,-1,0.3
3,1,bus,3,-1,0.6
4,1,bus,1,-1,0.4
5,2,walk,3,-4,1.0
6,2,train,4,-1,0.7
7,2,train,2,-1,0.3
8,2,bus,4,-1,0.6
9,2,bus,2,-1,0.4


In [22]:
net = plot_mdp(mdp_to_graph(mdp))
net.show('transportation.html')

## Value iteration

In [23]:
from IPython.display import display, clear_output
from algorithms import value_iteration
from utils import show_value_iterations

In [24]:
optimal_value, optimal_policy, value_history, policy_history = value_iteration(mdp=mdp, epsilon=1e-10)

In [25]:
optimal_value, optimal_policy

({1: -2.718321917719863,
  2: -2.718321917719863,
  3: -1.3698630136986218,
  4: -1.5624999999873705,
  5: -4.0,
  6: 0.0},
 {1: 'bus', 2: 'train', 3: 'train', 4: 'bus', 5: 'walk', 6: None})

In [26]:
show_value_iterations(value_history, policy_history)

Unnamed: 0,S,V,A
0,1,-2.718322,bus
1,2,-2.718322,train
2,3,-1.369863,train
3,4,-1.5625,bus
4,5,-4.0,walk
5,6,0.0,





## Policy iteration

In [28]:
from algorithms import policy_iteration
from utils import show_policy_iterations

In [29]:
pi, pi_history = policy_iteration(mdp)

In [30]:
pi.actions

{1: 'bus', 2: 'train', 3: 'train', 4: 'bus', 5: 'walk', 6: None}

In [31]:
show_policy_iterations(pi_history)

Unnamed: 0,S,V,A
0,1,-2.718322,bus
1,2,-2.718322,train
2,3,-1.369863,train
3,4,-1.5625,bus
4,5,-4.0,walk
5,6,0.0,



