<img src="figures/RabbitMDP.png" width="50%" align="right" />

# Policy iteration

This notebook explains how the "Policy iteration" framework works.

Let's first import the necessary components that were also used in the previous notebook.

In [1]:
from mdp import RabbitMDP
import numpy as np

And create a new instance with which we can interact.

In [2]:
mdp = RabbitMDP()

## Policy

A policy determines the probability that an agent takes a certain action $a$ when the environment is in state $s$. It is denoted with the function:

$$pi(a|s)$$

A policy is stochastic, when in a state multiple actions have a non-zero probability, or deterministic when in each state only one action has a probability (of 1).

In our example MDP the state-action space is small so we can put the entire policy in a simple dictionary. Let's define a completely random policy, where all actions allowed in a state have an equal probability.

In [3]:
pi = {
    'idle': {'wakeup': 1.0},
    'hungry': {'go eat': 0.5, 'stay': 0.5},
    'eating': { 'go eat': 0.5, 'go home': 0.5},
    'dead': {}
}

Now let's walk through an trajectory and see how the agent interacts with the environment.
The environment starts in the initial state.

In [4]:
s = mdp.STATES[0]
s

'idle'

In this state there is only one action allowed, and the policy should give that action a probability of 1.

In [5]:
pi[s]

{'wakeup': 1.0}

If the agent would take this action. What are the possible transitions the environment could take?

In [6]:
a = 'wakeup'

In [7]:
for s_next in mdp.STATES:
    for r in mdp.REWARDS:
        p = mdp.p(s, a, s_next, r)
        if p > 0: print(f'{s_next}, {r}: {p}')

hungry, 0: 1


Only one transition, with probability 1, that will put the environment in state `hungry`.

In this state the agent should take the next action. The policy now returns the following probabilities.

In [8]:
s = 'hungry'
pi[s]

{'go eat': 0.5, 'stay': 0.5}

A complete random choice between two actions. We could take one action and see what happens, but let's continue with the next step. Evaluating this policy.

## Policy evaluation

How good is a policy? That can simply be determined by computing the value of each state, and in particular the start states. The higher the state-values for the first states, the higher the expected return when following this policy.

We don't know the values yet, so let's start by initializing them to 0 and see where we get from there.

In [9]:
V = {state: 0 for state in mdp.STATES}
V

{'idle': 0, 'hungry': 0, 'eating': 0, 'dead': 0}

### Action value function $q$

We have seen the computation of action values in the previous notebook. With the following function we can compute the value of taking action $a$ when in state $s$.

$$ q(s,a) = \sum_{s'} \sum_r p(s',r|s,a)[r + \gamma \mathop{v_\pi}(s')] $$

Let's define a Python function that computes this using the table if state values.

In [10]:
def q(s, a, V):
    value = 0
    for s_next in mdp.STATES:
        for r in mdp.REWARDS:
            p = mdp.p(s, a, s_next, r)            
            value += p * (r + gamma * V[s_next])
    return value

It uses $\gamma$ as a discount factor, so let's define that.

In [11]:
gamma = 0.9

Let's see what happens with the initial state and action.

In [12]:
q('idle', 'wakeup', V)

0.0

It returns 0, because the action does not give a reward and the value of the next state is still 0.

In [13]:
q('hungry', 'go eat', V)

0.6000000000000001

It returns 0.6, because the `go eat` action can result in two transitions one with probability 0.8 and reward 1 and one with probability 0.2 and reward -1. Both end up in a state we don't have a value for, so we get the following computation.

In [14]:
0.8 * (1 + gamma * 0) + 0.2 * (-1 + gamma * 0)

0.6000000000000001

In [15]:
q('hungry', 'stay', V)

-0.1

### State value function $v$

With this action value function $q$ we can compute the value of a state, by inspecting all actions that are allowed in a state.

$$ v(s) = \sum_a \mathop{\pi}(a|s) q(s,a) $$

Let's implement this in Python.

In [16]:
def v(s, pi, V):
    value = 0
    for a in mdp.A(s):
        value += pi[s][a] * q(s, a, V)
    return value

Again, check to so what happens in the initial state.

In [17]:
v('idle', pi, V)

0.0

Result is again 0, because only 1 action is allowed, and that results in no reward, to a state with no value.

In [18]:
v('hungry', pi, V)

0.25000000000000006

This state has a value because two actions are possible both with a reward.

`go eat` has an action value of 0.6 and `stay` has an action value of -0.1. Both actions have equal probability of 0.5.

In [19]:
0.5 * 0.6 + 0.5 * -0.1

0.25

Let's compute the state values for each state.

In [20]:
V_new = {s: v(s, pi, V) for s in mdp.STATES}
V_new

{'idle': 0.0, 'hungry': 0.25000000000000006, 'eating': -0.1, 'dead': 0}

### Evaluation convergence

We now have a table of values, let's see what happens if we use these values in our computations of $q$ and $v$.

In [21]:
V = {s: 0 for s in mdp.STATES}
V

{'idle': 0, 'hungry': 0, 'eating': 0, 'dead': 0}

In [22]:
V = V_new
V_new = {s: v(s, pi, V) for s in mdp.STATES}
V_new

{'idle': 0.22500000000000006,
 'hungry': 0.31525000000000003,
 'eating': -0.1225,
 'dead': 0}

The values have changed again, but with a smaller amount. Let's see if we iterate 150 times.

In [23]:
for _ in range(150):
    V_new = {s: v(s, pi, V) for s in mdp.STATES}
    V = V_new
V

{'idle': 0.41213695806784034,
 'hungry': 0.4579299534087115,
 'eating': 0.06241200632828714,
 'dead': 0}

And if we run it one more time we see that it doesn't change anymore. The values have converged.

In [24]:
V_new = {s: v(s, pi, V) for s in mdp.STATES}
V_new

{'idle': 0.41213695806784034,
 'hungry': 0.4579299534087115,
 'eating': 0.06241200632828714,
 'dead': 0}

We can use this to create a function that keeps computing until the maximum difference is below a certain threshold.

In [25]:
def evaluate(pi):
    # initialize the values of all states to 0
    V = {s: 0 for s in mdp.STATES}
    while True:
        # compute new state values
        V_new = {s: v(s, pi, V) for s in mdp.STATES}
        # compare with previous values
        diff = np.abs([V_new[s] - V[s] for s in mdp.STATES]).max()
        V = V_new
        # stop when difference is below threshold
        if diff < 1e-9:
            break
    return V   

It should return the same values as before.

In [26]:
V = evaluate(pi)
V

{'idle': 0.41213695568355024,
 'hungry': 0.4579299514857902,
 'eating': 0.062412004614500693,
 'dead': 0}

This table now contains, for each state, the expected return when starting in this state and following policy $\pi$.

## Policy improvement

Now that we have accurate the state values (and therefore action values), we can improve our policy. We know, in each state, which actions result in the highest expected return. Let's take a look.

In [27]:
for s in mdp.STATES:
    print(s)
    for a in mdp.A(s):
        print('  ', a, q(s, a, V))

idle
   wakeup 0.4121369563372112
hungry
   go eat 0.6449366433224406
   stay 0.27092326070349004
eating
   go eat 0.028085402076525323
   go home 0.09673860809215618
dead


You see that in state `hungry` the action `go eat` has a higher state-action value, and in state `eating` the action `go home` has a higher value.

So, if we want to maximize the expected return, i.e. get a better policy, we should change the probabilities so the actions with higher values are chosen. We can do this by acting greedy on the state values. Let's always take the action with the highest value.

In [28]:
pi['hungry']['go eat'] = 1
pi['hungry']['stay'] = 0
pi['eating']['go eat'] = 0
pi['eating']['go home'] = 1
pi

{'idle': {'wakeup': 1.0},
 'hungry': {'go eat': 1, 'stay': 0},
 'eating': {'go eat': 0, 'go home': 1},
 'dead': {}}

But now our state values are no longer correct. They represent the values when following the previous policy. This policy is no longer the same, so we should re-compute our state values.

In [29]:
V = evaluate(pi)
V

{'idle': 0.7693461298894951,
 'hungry': 0.8548290332105501,
 'eating': 0.3539292137503331,
 'dead': 0}

Compared to the previous policy the values have changed a lot. Let's take a look at the action values.

In [30]:
for s in mdp.STATES:
    print(s)
    for a in mdp.A(s):
        print('  ', a, q(s,a, V))

idle
   wakeup 0.7693461298894951
hungry
   go eat 0.85482903390024
   stay 0.5924115169005456
eating
   go eat 0.15926814618764995
   go home 0.35392921352043644
dead


They have also changed a lot absolutely. But for each state, the action with the highest value has not changed.

If we want to get the highest expected return, then we should still act greedy and that means, no change to the current policy.

We are done, our policy is the optimal policy.

## Greedy with NumPy

How can you act greedy on action values programmatically?

Let's see for a single state what the action values are.

In [31]:
s = 'hungry'
{a: q(s, a, V) for a in mdp.A(s)}

{'go eat': 0.85482903390024, 'stay': 0.5924115169005456}

Let's only pick the values, and not the state names.

In [32]:
[q(s,a,V) for a in mdp.A(s)]

[0.85482903390024, 0.5924115169005456]

With numpy we can see what the maximum value of this list is using `np.max`.

In [33]:
np.max([q(s,a,V) for a in mdp.A(s)])

0.85482903390024

And, with `np.argmax` we can find out the index in the list where the maximum value is.

In [34]:
max_a = np.argmax([q(s,a,V) for a in mdp.A(s)])
max_a

0

Let's see if it works in another state.

In [35]:
s = 'eating'
max_a = np.argmax([q(s,a,V) for a in mdp.A(s)])
max_a

1

Which action is this?

In [36]:
mdp.A(s)[max_a]

'go home'

And what is the value of that action

In [37]:
np.max([q(s,a,V) for a in mdp.A(s)])

0.35392921352043644

Since our policy is greedy, i.e. is taking the action with the highest expected return, the value of this state should be the value of taking this action.

In [38]:
V[s]

0.3539292137503331