**Laboratory Lecture 10**

## EXAMPLE: Time for coffee?

I have half an hour to spare in my busy schedule, and I have a choice between working quietly in my office and
going out for a coffee.

If I stay in my office, three things can happen: I can get some work done (Utility = 8), or I can get distracted
looking at the latest news (Utility = 1), or a colleague might stop by to talk about some work we are doing (Utility = 5).

If I go out for coffee, I will most likely enjoy a good cup of smooth caffeination (Utility = 10), but there is
also a chance I will end up spilling coffee all over myself (Utility = −20).

The probability of getting work done if I choose to stay in the office is 0.5, while the probabilities of getting
distracted, and a colleague stopping by are 0.3 and 0.2 respectively.

If I go out for a coffee, my chance of enjoying my beverage is 0.95, and the chance of spilling my drink is 0.05.


> __QUESTION 1(a):__ What is the expected utility of staying in my office?
    
> __QUESTION 1(b):__ What is the expected utility of going out for a coffee?

> __QUESTION 1(c):__ By the principle of maximum expected utility, what should I do?

### Solution 1(a)

Staying in the $office$ means that I will either $work$, get $distracted$, or talk with a $colleague$. These states 
have the following utilities:
\begin{align*}
U(work) & = 8\\
U(distracted) & = 1\\
U(colleague) & = 5
\end{align*}
and the probabilities of these happening, given I stay in the $office$ are:
\begin{align*}
P(work|office) & = 0.5\\
P(distracted|office) & = 0.3\\
P(colleague|office) & = 0.2
\end{align*}
Let's declare these as Python arrays

In [None]:
import numpy as np
# Setup arrays with: symbolic names for outcomes (not currently used), utilities of outcomes, and 
# probabililites of those outcomes
office_outcomes = ["work", "distracted", "colleague"] 
print('office_outcomes = ', office_outcomes)
u_office_outcomes = np.array([8, 1, 5]) 
print('U(office_outcomes) = ', u_office_outcomes) 
p_office_outcomes_office = np.array([0.5, 0.3, 0.2]) 
print('P(office_outcomes|office) =', p_office_outcomes_office)


office_outcomes =  ['work', 'distracted', 'colleague']
U(office_outcomes) =  [8 1 5]
P(office_outcomes|office) = [0.5 0.3 0.2]


<br>
The expected utility of staying in the office is:
    \begin{align*}
      EU(office) & = 0.5\times 8 + 0.3\times 1 + 0.2\times 5\\
                 & = 5.3
    \end{align*}
    
Now let's implement this in Python:

In [None]:
# The weighted utility ofeach outcome is each to compute by pairwise multiplication
eu_office_outcomes = u_office_outcomes * p_office_outcomes_office
print('EU by outcome =', eu_office_outcomes)
# Summing the weighted utilities gets us the expected utility
eu_office = np.sum(eu_office_outcomes)
eu_office_2 = np.dot(u_office_outcomes, p_office_outcomes_office) # alternatives
print(eu_office_2)
print('EU(office) = ', eu_office)

EU by outcome = [4.  0.3 1. ]
5.3
EU(office) =  5.3


So the expected utility of staying in the office is 5.3

### Solution 1(b)

This time, we'll jump straight to the Python code, using the same notation as before

In [None]:
# The coffee calculation is the same as the office calculation, first set up arrays
coffee_outcomes = ["caffeination", "spillage"] 
print('coffee_outcomes = ', coffee_outcomes)
u_coffee_outcomes = np.array([10, -20]) 
print('U(coffee_outcomes) = ', u_coffee_outcomes) 
p_coffee_outcomes_coffee = np.array([0.95, 0.05]) 
print('P(coffee_outcomes|coffee) =', p_coffee_outcomes_coffee)
print('\n')
# Then compute the expected utility
eu_coffee_outcomes = u_coffee_outcomes * p_coffee_outcomes_coffee
print('EU by outcome =', eu_coffee_outcomes)
eu_coffee = np.sum(eu_coffee_outcomes)
print('EU(coffee) = ', eu_coffee)

coffee_outcomes =  ['caffeination', 'spillage']
U(coffee_outcomes) =  [ 10 -20]
P(coffee_outcomes|coffee) = [0.95 0.05]


EU by outcome = [ 9.5 -1. ]
EU(coffee) =  8.5


So the expected utility of going out for coffee is 8.5

### Solution 1(c)

The MEU criterion is that the option with the maximum expected utility is the one chosen. Clearly in the case of
numbers in the example, the option of going out for $coffee$ is the one with the maximum expected utility. 

However, we will also program it in Python so that we can see what happens as the probabilities of the outcomes 
vary:

In [None]:
if eu_office > eu_coffee:
    print('Office is the MEU choice')
else: 
    print('Coffee is the MEU choice')

Coffee is the MEU choice


---

## EXAMPLE: Time for coffee? Maximax- Maximin

Revisit the decision for the coffee example using the maximax and maximin decision criteria

### Solution 2

The maximax decision criterion rates each choice by the utility of its best outcome,
and then picks the choice with best utility.

In Python we would do this calculation as follows

In [None]:
# The utility of each choice is the max utility of their outcomes
max_u_office = np.max(u_office_outcomes)
print('MaxU(office) =', max_u_office)
max_u_coffee = np.max(u_coffee_outcomes)
print('MaxU(coffee) =', max_u_coffee)
print('\n')
# The decision criterion is then to pick the outcome with the highest utility:
if max_u_office > max_u_coffee:
    print('Office is the Maximax choice')
else: 
    print('Coffee is the Maximax choice')

MaxU(office) = 8
MaxU(coffee) = 10


Coffee is the Maximax choice


The maximin decision criterion rates each choice by the utility of its worst outcome, and then picks the choice with the best utility.

In Python:

In [None]:
# The utility of each choice is the max utility of their outcomes
min_u_office = np.min(u_office_outcomes)
print('MinU(office) =', min_u_office)
min_u_coffee = np.min(u_coffee_outcomes)
print('MinU(coffee) =', min_u_coffee)
print('\n')
# The decision criterion is then to pick the outcome with the highest utility:
if min_u_office > min_u_coffee:
    print('Office is the Minimax choice')
else: 
    print('Coffee is the Minimax choice')

MinU(office) = 1
MinU(coffee) = -20


Office is the Minimax choice


## EXAMPLE: Solving a simple MDP using the MDP toolbox

The MDP Toolbox is an implementation of some MDP algorithms in Python. You will need to install this using: 

pip install pymdptoolbox

Documentation is at: https://pymdptoolbox.readthedocs.io/en/latest/index.html

Let's start with a really simple problem. 

We have 4 states and two actions.

There are two actions, 0 is "Stay" and 1 is "Right". **0 always succeeds** and leaves the agent in the same state. **1 moves the agent right** with probability **0.8**, stays in place with probability **0.2**.

The states are 0, 1, 2, 3. 0 is left of 1, which is left of 2 and so on. (Thus the states **are in a line** which runs 0, 1, 2, 3 from left to right).

The agent remains in **state 3** with **probability 1**.

State 3 has a **reward of 1**, and the **cost** of any action is **-0.04**.

In [1]:
import mdptoolbox
import numpy as np

# The MDP Toolbox defines MDPs through a probability array and a reward array.

# The probability array has shape (A, S, S), where A are actions and S
# are states. For each action specify the transitions probabilities of reaching
# the second state by applying that action in the first state.

# So, to implement the action model described above, we need:
P1 = np.array([[[1, 0, 0, 0],
                [0, 1, 0, 0],
                [0, 0, 1, 0],
                [0, 0, 0, 1]],
               [[0.2, 0.8, 0,   0],
                [0,   0.2, 0.8, 0],
                [0,   0,   0.2, 0.8],
                [0,   0,   0,   1]]])
# The first matrix is that for the action "Stay" (when executed in a given
# state the agent stays there) and the second is for the action "Right" 
# (which shifts the agent right with probability 0.8 except in state 3 
# when the agent remains in state 3 with probability 1).

# The reward array has shape (S, A), so there is a set of S vectors,
# one for each state, and each is a vector with one element for each 
# the actions --- each element is the reward for executing the relevant 
# action in the state (so this is really modelling cost of the action).
R1 = np.array([[-0.04, -0.04], [-0.04, -0.04], [-0.04, -0.04], [1, 1]])
# R1 says that executing either action in states 0, 1, or 2 has a reward
# of -0.04, and executing either action in state 3 has reward 1.

# The util.check() function checks that the reward and probability matrices 
# are well-formed, and match.
# 
# Success is silent, failure provides somewhat useful error messages.
mdptoolbox.util.check(P1, R1)
# To run value iteration we create a value iteration object, and run it. Note that 
# discount value is 0.9
vi1 = mdptoolbox.mdp.ValueIteration(P1, R1, 0.9)
vi1.run()
# We can then display the values (utilities) computed, and look at the policy:
print('Values:\n', vi1.V)
print('Policy:\n', vi1.policy)

Values:
 (2.766226988084275, 3.7438891127976524, 4.857502678650809, 6.12579511)
Policy:
 (1, 1, 1, 0)


This says that the optimum policy is to go Right in every state until reaching state 3, then Stay.

Although we have been looking at the policy, we go it through value iteration.

Solving the same problem using **policy iteration** is easy with the MDP Toolbox:

In [None]:
# To run policy iteration we create a policy iteration object, and run it. Note that 
# discount value is 0.9
pi1 = mdptoolbox.mdp.PolicyIteration(P1, R1, 0.9)
pi1.run()
# We can then display the values (utilities) computed, and look at the policy:
print('Values:\n', pi1.V)
print('Policy:\n', pi1.policy)

Values:
 (6.6402692938291725, 7.6180844735276665, 8.731707317073173, 10.000000000000002)
Policy:
 (1, 1, 1, 0)


Note that the methods disagree on the value while agreeing on the policy.

Solving a problem using reinforcement learning (well, the **Q-learning** kind of RL) is also easy using the MDP Toolbox:

In [None]:
# To run q-learning we create a q-learning object, and run it. Note that 
# discount value is 0.9
ql1 = mdptoolbox.mdp.QLearning(P1, R1, 0.9)
ql1.run()
# We can then display the values (utilities) computed, and look at the policy:
print('Values:\n', ql1.V)
print('Policy:\n', ql1.policy)

Values:
 (0.26457661943006405, 2.0695643950241327, 6.421037194032352, 9.999999762119485)
Policy:
 (1, 1, 1, 0)


Note that the methods disagree on the value while agreeing on the policy.