## EXERCISE 1: What to do at the airport?

You are travelling and have some time to kill at the aiport. There are three things you could spend your time doing:
  
1) You could have a coffee.

This has a probability of $0.8$ of giving you time to relax with a tasty beverage, and a utility of $10$. 
It also has a probability of $0.2$ of providing you with a nasty cup from over-roasted beans that annoys you,
and outcome with a utility of $-5$.

2) You could shop for clothes.

This has a probability of $0.1$ that you will find a great outfit at a good price, utility $20$. However, it 
has a probability of $0.9$ that you end up wasting money on over-priced junk, utility $-10$.

3) You could have a bite to eat.

This has a probability of $0.8$ that you find something rather mediocre that prevents you from being too hungry 
during your flight, utility $2$, and a probability of $0.2$ that you find something filling and tasty, utility $5$.

> __QUESTION 1(a):__ What should you do if you take the principle of maximum expected utility to be your decision criterion?

> __QUESTION 1(b):__ What should you do if you take the principle of maximax decision criterion to be your decision criterion?

> __QUESTION 1(c):__ What should you do if you take the principle of maximin decision criterion to be your decision criterion?
    

In [4]:
import numpy as np

clothes_outcomes = ['nice', 'not nice']
prob_coffee_outcomes = np.array([0.8, 0.2])
util_coffee_outcomes = np.array([10, -5])

clothes_outcomes = ['great', 'not great']
prob_clothes_outcomes = np.array([0.1, 0.9])
util_clothes_outcomes = np.array([20, -10])

eat_outcomes = ['alright', 'tasty']
prob_eat_outcomes = np.array([0.8, 0.2])
util_eat_outcomes = np.array([2, 5])

# MEU
eu_coffee_outcomes = prob_coffee_outcomes * util_coffee_outcomes
eu_coffee = np.sum(eu_coffee_outcomes)
eu_clothes_outcomes = prob_clothes_outcomes * util_clothes_outcomes
eu_clothes = np.sum(eu_clothes_outcomes)
eu_eat_outcomes = prob_eat_outcomes * util_eat_outcomes
eu_eat = np.sum(eu_eat_outcomes)

eu = np.array([eu_coffee, eu_clothes, eu_eat])
max_eu = np.max(eu)
action = np.argmax(eu)
print("MEU: With an EU of", max_eu, "the best choice is", end=' ')
if action==0:
    print('coffee')
elif action==1:
    print('clothes')
elif action==2:
    print('eat')

# maximax
max_u_coffee_outcome = np.max(util_coffee_outcomes)
max_u_clothes_outcome = np.max(util_clothes_outcomes)
max_u_eat_outcome = np.max(util_eat_outcomes)
u_max = np.array([max_u_coffee_outcome, max_u_clothes_outcome, max_u_eat_outcome])
max_u = np.max(u_max)
action = np.argmax(u_max)
print("Maximax: With a maximum utility of", max_u, "the best choice is", end=' ')
if action==0:
    print('coffee')
elif action==1:
    print('clothes')
elif action==2:
    print('eat')

# maximin
min_u_coffee_outcome = np.min(util_coffee_outcomes)
min_u_clothes_outcome = np.min(util_clothes_outcomes)
min_u_eat_outcome = np.min(util_eat_outcomes)
u_min = np.array([min_u_coffee_outcome, min_u_clothes_outcome, min_u_eat_outcome])
max_u = np.max(u_min)
action = np.argmax(u_min)
print("Maximin: With a minimum utility of", max_u, "the best choice is", end=' ')
if action==0:
    print('coffee')
elif action==1:
    print('clothes')
elif action==2:
    print('eat')

MEU: With an EU of 7.0 the best choice is coffee
Maximax: With a maximum utility of 20 the best choice is clothes
Maximin: With a minimum utility of 2 the best choice is eat


## EXERCISE 2: Solving a MDP with MDP toolbox

We have four states and four actions.

The actions are: 0 is Right, 1 is Left, 2 is Up and 3 is Down.

The states are 0, 1, 2, 3, and they are arranged like this:
    
$$
\begin{array}{cc}
2 & 3\\
0 & 1\\
\end{array}
$$

The motion model provides:
*   0.8 probability of moving in the direction of the action,
*   0.1 probability of moving in each of the directions perpendicular to that of the action.

So that 2 is Up from 0 and 1 is Right of 0, and so on. The cost of any action (in any state) is -0.04.

In case of "infeasible" movements, the agent remains in the current state.

The reward for state 3 is 1, and the reward for state 1 is -1, and the agent does not leave those states.

Set discount factor equal to 0.99.

> __QUESTION 2(a):__ What is the policy based on the Value iteration algorithm?

> __QUESTION 2(b):__ What is the policy based on the Policy iteration algorithm?

> __QUESTION 2(c):__ What is the policy based on the Q-Learning algorithm?

> __QUESTION 2(d):__ Look at the **setVerbose**() function and the time attribute of the MDP objects in MDPToolbox and use them to compare the number of iterations (hint: see the iter attribute) and the CPU time used to come up with a solution (hint: see the time attribute) in the Value iteration algorithm and Policy iteration algorithm resolutions.


In [5]:
import mdptoolbox
import numpy as np

# azione, stato partenza, stato arrivo
# azioni: 0 1 2 3 (right, left, up, down)
# stati: 0 1 2 3
P1 = np.array([
    [[0.1,0.8,0.1,0], [0,1,0,0], [0.1,0,0.1,0.8], [0,0,0,1]],
    [[1,0,0,0], [0.8,0.1,0,0.1], [0,0,1,0], [0,0.1,0.8,0.1]],
    [[0.1,0.1,0.8,0], [0.1,0.1,0,0.8], [0,0,1,0], [0,0,0,1]],
    [[1,0,0,0], [0,1,0,0], [0.8,0,0.1,0.1], [0,0.8,0.1,0.1]]
])

# stato, azione (premio per fare azione in stato)
R1 = np.array([[-1,-0.04,-0.04,-0.04], [-1,-0.04,1,-1], [1,-0.04,-0.04,-0.04], [1,-0.04,1,-1]])

mdptoolbox.util.check(P1, R1)

vi1 = mdptoolbox.mdp.ValueIteration(P1, R1, 0.99)
#vi1.setVerbose()
vi1.run()
# We can then display the values (utilities) computed, and look at the policy:
print("VALUE ITERATION")
print('Values:\n', vi1.V)
print('Policy:\n', vi1.policy)

pi1 = mdptoolbox.mdp.PolicyIteration(P1, R1, 0.99)
#pi1.setVerbose()
pi1.run()
# We can then display the values (utilities) computed, and look at the policy:
print("POLICY ITERATION")
print('Values:\n', pi1.V)
print('Policy:\n', pi1.policy)

ql1 = mdptoolbox.mdp.QLearning(P1, R1, 0.99)
ql1.run()
# We can then display the values (utilities) computed, and look at the policy:
print("QLEARNING")
print('Values:\n', ql1.V)
print('Policy:\n', ql1.policy)

print("\nITERATIONS AND CPU:")
print('Value Iteration:', vi1.iter, ';', vi1.time)
print('Policy Iteration:', pi1.iter, ';', pi1.time)

VALUE ITERATION
Values:
 (9.171222983159206, 10.323895223290556, 10.323895223290556, 10.466174574128356)
Policy:
 (2, 2, 0, 0)
POLICY ITERATION
Values:
 (98.70501608641325, 99.85770986965021, 99.85770986965022, 99.99999999999991)
Policy:
 (2, 2, 0, 0)
QLEARNING
Values:
 (42.06368027504703, 68.42538946292196, 65.64772891492544, 79.02605048260297)
Policy:
 (0, 2, 0, 0)

ITERATIONS AND CPU:
Value Iteration: 11 ; 0.0010170936584472656
Policy Iteration: 2 ; 0.0019989013671875
