# Value Functions

In this notebook we will show you how the state value function $V(s)$ and action value function $Q(s,a)$ of an MDP can be computed.

<img src="figures/RabbitMDP.png" width="50%" align="right" />

## Explore MDP

Let's start with exploring the interface of the Rabbit MDP that was discussed in the lecture. It is depicted on the right.

We first need to import the class that implements it.

In [1]:
from mdp import RabbitMDP

And create a new instance with which we can interact.

In [2]:
mdp = RabbitMDP()

Now, let's see what states and actions are available in this MDP by looking at its public field.

In [3]:
mdp.STATES

['idle', 'hungry', 'eating', 'dead']

In [4]:
mdp.ACTIONS

['wakeup', 'go eat', 'stay', 'go home']

As you can see there are 4 states, and also 4 actions.

Each action results in a reward. The different reward values that are possible are listed in the field `REWARDS`.

In [5]:
mdp.REWARDS

[0, 1, -1]

However, not all actions allowed in all states. The function `A(s)` returns the list of actions allowed in state `s`.

In [6]:
for s in mdp.STATES:
    print(s)
    for a in mdp.A(s):
        print(f'  {a}')

idle
  wakeup
hungry
  go eat
  stay
eating
  go eat
  go home
dead


## p function

The other very important component of an MDP is its state-transition function $p(s', r | s, a)$. It returns the probability of ending in state $s'$ with reward $r$ when starting from state $s$ and taking action $a$.

In the MDP class this is implemented with function `p(state, action, next_state, reward)`. Let's see if it returns what we expect.

In [7]:
mdp.p('hungry', 'go eat', 'eating', 1)

0.8

In [8]:
mdp.p('hungry', 'go eat', 'dead', -1)

0.2

Yup, these transitions match what we see in the diagram above.

Transitions that are not allowed have a probability of 0.

In [9]:
mdp.p('hungry', 'go eat', 'eating', 0)

0

In [10]:
mdp.p('hungry', 'go eat', 'idle', 1)

0

The probabilities of all transitions from state $s$ by taking a action $a$ should sum up to 1.

In [11]:
total = 0
for next_s in mdp.STATES:
    for r in mdp.REWARDS:
        p = mdp.p('hungry', 'go eat', next_s, r)
        total += p
        print(p)
total

0
0
0
0
0
0
0
0.8
0
0
0
0.2


1.0

## Interacting

Interacting with the environment (MDP) goes as follows.

The environment starts in a certain state. In this case `idle`.

In [12]:
s = mdp.STATES[0]
s

'idle'

Then there are a number of actions possible in that state.
In this case only one.

In [13]:
mdp.A(s)

['wakeup']

The agent can now determine which action to take using its policy $\pi(a|s)$.

This policy will return the probability of taking each action. Since we have only one action in this state, the probability is `1` for action `wakeup`.

In [14]:
a = mdp.A(s)[0]
a

'wakeup'

Then the environment will tell us which state we will end up in using its `p` function. For each possible next state and reward combination it gives a probability of that transition.

In [15]:
for s_next in mdp.STATES:
    for r in mdp.REWARDS:
        p = mdp.p(s, a, s_next, r)
        if p > 0:
            print(f'{s_next}, {r}: {p}')

hungry, 0: 1


Only one transition has a probability which goes to next state `hungy` with a reward of `0`.

So now the current state is updated and we have another set of allowed actions.

In [16]:
s = 'hungry'
mdp.A(s)

['go eat', 'stay']

Now the agent can choose between two actions. Let's assume its policy is completely random, so it has a 50% chance of taking each action. In this case is chooses the `stay` action.

In [17]:
a = 'stay'

Let's see again what transitions are possible by the environment.

In [18]:
def show_transitions(s, a):
    for s_next in mdp.STATES:
        for r in mdp.REWARDS:
            p = mdp.p(s, a, s_next, r)
            if p > 0:
                print(f'{s_next}, {r}: {p}')
show_transitions(s, a)

hungry, 0: 0.9
dead, -1: 0.1


In this case the environment will remain in the same state without a reward 90% probability. But, in 10% of the cases the environment can end up in the `dead` state with a negative reward.

## Trajectory

A trajectory is a sequence of such interactions with the environment.

$$S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, \dots, R_T, S_T$$

For example we can take the following trajectory:

    idle, wakeup, 0, hungry, stay, 0, hungry, go eat, +1, go home, 0, idle

Over this trajectory the return `G` is equals to

$$G = R_1 + R_2 + R_3 + R4 = 0 + 0 + 1 + 0 = 1$$

## Transition values

We are going to compute the value of each state with the given functions.

But, we are not there yet, so let's assume we know the values of each state upfront.

In [19]:
V = {'idle': 0.41213695806784034, 'hungry': 0.4579299534087115, 'eating': 0.06241200632828714, 'dead': 0}

Again let's start from the initial state.

In [20]:
s = mdp.STATES[0]
s

'idle'

The value of this state $V(idle)$ is the expected return when following policy $pi$. This can be computed by computing the value of each possible transition from this state multiplied with the probability that this transition occurs.

In this state it is easy. As seen above there is only on possible transition. If taking action `wakeup` there is a 100% probability the environment ends up in state `hungry` with `0` reward.

In [21]:
a = 'wakeup'
s_next = 'hungry'
r = 0

The value of this transition is the expected return of this transition. In other words, the reward you get immediately plus the expected return from the next state. However, this environment is continuous, so we have to apply discounting.

$$ G = R_1 + \gamma R_2 + \gamma^2 R_3 \dots$$

For this example we use a value of 0.9, which more or less limits our horizon to 10 steps in the future.

In [22]:
gamma = 0.9

Assuming we know the value of the next state (`V['hungry']`), we can compute the value of this transition.

In [23]:
r + gamma * V[s_next]

0.41213695806784034

As you can see this is the same value as was already in the table.

In [24]:
V['idle']

0.41213695806784034

## Action values

Now continue with the next state `hungry`. From this two actions are possible.

In [25]:
s = 'hungry'
mdp.A(s)

['go eat', 'stay']

For each action we can take a look at the possible transitions.

In [26]:
a = 'go eat'
print(a)
show_transitions(s, a)

a = 'stay'
print(a)
show_transitions(s, a)

go eat
eating, 1: 0.8
dead, -1: 0.2
stay
hungry, 0: 0.9
dead, -1: 0.1


Let's compute the value for the first action `go eat`. Two transitions are possible, let's compute the values for these two transitions. 

In [27]:
v1 = 1 + gamma * V['eating']
print(v1)
v2 = -1 + gamma * V['dead']
print(v2)

1.0561708056954584
-1.0


The transition to state `eating` has 80% probability and to state `dead` 20%. So the expected return of taking action `go eat` from state `hungry` is equal to:

In [28]:
v_go_eat = 0.8 * v1 + 0.2 * v2
v_go_eat

0.6449366445563667

The other action, `stay`, also has two possible transitions, so the expected return for taking action `stay` in state `hungry` is equal to:

In [29]:
v1 = 0 + gamma * V['hungry']
v2 = -1 + gamma * V['dead']
v_stay = 0.9 * v1 + 0.1 * v2
v_stay

0.27092326226105634

These values are actuall the state-action values $q({go eat} | hungry)$ and $q(stay | hungry)$. In other words, the expected returns when taking action $a$ from state $s$.

In [30]:
q_go_eat = v_go_eat
q_stay = v_stay

## State values

The policy of the agent determines the probability of taking each action. Because the policy determines what happens next from the current state, we can compute the expected return for state `hungry` when we follow the current policy.

Our policy is completely random, so the probability of taking the two actions is 50% for each. So in 50% of the cases the return will be the expected return of taking action `go eat`, and in the other 50% of the cases the return will be the expected return of taking action `stay`.

In [31]:
v_hungry = 0.5 * q_go_eat + 0.5 * q_stay
v_hungry

0.4579299534087115

This is the value of state `hungry` and is indeed equal to the value in the table.

In [32]:
V['hungry']

0.4579299534087115