In [15]:
import numpy as np
from numpy.random import choice


This article is going to discuss policy improvement building on top of previous discussion around solving the problem of gridworld.

If you remember, you can solve for the value function given the "random walk" policy by solving the linear system below or by using policy iteration to approximate. 

$$ V^{\pi}(s) = \sum_{a} \pi(s,a) \sum_{s^\prime} P_{s{s^\prime}}^{a} 
   (R_{s{s^\prime}}^{a} + \gamma V^{\pi}(s^\prime)) $$
   
$$ V^{\pi}(s) = \sum_{a \in (E,S,W,N)} \frac{1}{4} \sum_{s^\prime}(R_{s{s^\prime}}^{a} + \gamma V^{\pi}(s^\prime)) $$

$$ 4V_{00}=\gamma V_{10}+0+\gamma V_{01}+0+\gamma V_{00}+-1+\gamma V_{00}+-1 $$
$$ 4V_{01}=\gamma V_{11}+0+\gamma V_{02}+0+\gamma V_{01}+-1+\gamma V_{00}+0 $$
$$ 4V_{02}=\gamma V_{12}+0+\gamma V_{03}+0+\gamma V_{02}+-1+\gamma V_{01}+0 $$
$$ 4V_{03}=\gamma V_{13}+0+\gamma V_{04}+0+\gamma V_{03}+-1+\gamma V_{02}+0 $$
$$ 4V_{04}=\gamma V_{14}+0+\gamma V_{04}+-1+\gamma V_{04}+-1+\gamma V_{03}+0 $$
$$ 4V_{10}=\gamma V_{20}+0+\gamma V_{11}+0+\gamma V_{00}+0+\gamma V_{10}+-1 $$
$$ 4V_{11}=\gamma V_{21}+0+\gamma V_{12}+0+\gamma V_{01}+0+\gamma V_{10}+0 $$
$$ 4V_{12}=\gamma V_{22}+0+\gamma V_{13}+0+\gamma V_{02}+0+\gamma V_{11}+0 $$
$$ 4V_{13}=\gamma V_{23}+0+\gamma V_{14}+0+\gamma V_{03}+0+\gamma V_{12}+0 $$
$$ 4V_{14}=\gamma V_{24}+0+\gamma V_{14}+-1+\gamma V_{04}+0+\gamma V_{13}+0 $$
$$ 4V_{20}=\gamma V_{30}+0+\gamma V_{21}+0+\gamma V_{10}+0+\gamma V_{20}+-1 $$
$$ 4V_{21}=\gamma V_{31}+0+\gamma V_{22}+0+\gamma V_{11}+0+\gamma V_{20}+0 $$
$$ 4V_{22}=\gamma V_{32}+0+\gamma V_{23}+0+\gamma V_{12}+0+\gamma V_{21}+0 $$
$$ 4V_{23}=\gamma V_{33}+0+\gamma V_{24}+0+\gamma V_{13}+0+\gamma V_{22}+0 $$
$$ 4V_{24}=\gamma V_{34}+0+\gamma V_{24}+-1+\gamma V_{14}+0+\gamma V_{23}+0 $$
$$ 4V_{30}=\gamma V_{40}+0+\gamma V_{31}+0+\gamma V_{20}+0+\gamma V_{30}+-1 $$
$$ 4V_{31}=\gamma V_{41}+0+\gamma V_{32}+0+\gamma V_{21}+0+\gamma V_{30}+0 $$
$$ 4V_{32}=\gamma V_{42}+0+\gamma V_{33}+0+\gamma V_{22}+0+\gamma V_{31}+0 $$
$$ 4V_{33}=\gamma V_{43}+0+\gamma V_{34}+0+\gamma V_{23}+0+\gamma V_{32}+0 $$
$$ 4V_{34}=\gamma V_{44}+0+\gamma V_{34}+-1+\gamma V_{24}+0+\gamma V_{33}+0 $$
$$ 4V_{40}=\gamma V_{40}+-1+\gamma V_{41}+0+\gamma V_{30}+0+\gamma V_{40}+-1 $$
$$ 4V_{41}=\gamma V_{41}+-1+\gamma V_{42}+0+\gamma V_{31}+0+\gamma V_{40}+0 $$
$$ 4V_{42}=\gamma V_{42}+-1+\gamma V_{43}+0+\gamma V_{32}+0+\gamma V_{41}+0 $$
$$ 4V_{43}=\gamma V_{43}+-1+\gamma V_{44}+0+\gamma V_{33}+0+\gamma V_{42}+0 $$
$$ 4V_{44}=\gamma V_{44}+-1+\gamma V_{44}+-1+\gamma V_{34}+0+\gamma V_{43}+0 $$

It is mentioned in the book that one way to find a better policy at a given state is by selection a different action and thereafter following the existing policy.

$$ Q^{\pi}(s, a) = \sum_{s'} P_{s{s^\prime}}^{a}(R_{s{s^\prime}}^{a} + \gamma V^{\pi}(s^\prime)) $$
   
If the environment dynamics is fully understood and deterministic, by taken given action, the next state is known for certainty, this can be even further simplified to be. 

$$ Q^{\pi}(s, a) = R_{s{s^\prime}}^{a} + \gamma V^{\pi}(s^\prime) $$


So this could be very intersting, if you pay attention to the definition of $V^\pi(s)$ and $Q^\pi(s, a)$, we know:

$$ V^{\pi}(s) = \sum_{a} \pi(s,a) Q^\pi(s, a)  $$
   

Let's pick a random cell and validate if the above equation hold true. Here let's look into the cell of (0, 2) which is the cell in the middle on the most left column. 

It has the following possible actions which leads to its corresponding next state:

- S = [0,2] ---> a = E, r = 0  ---> S' = [1,2]
- S = [0,2] ---> a = S, r = 0  ---> S' = [0,3]
- S = [0,2] ---> a = W, r = -1  ---> S' = [0,2]   (hit the call, bounce back)
- S = [0,2] ---> a = N, r = 0  ---> S' = [0,1]

We know that $Q^{\pi}(s, a) = R_{s{s^\prime}}^{a} + \gamma V^{\pi}(s^\prime)$

Below is the value function for the given policy of random walk 

||c0|c1|c2|c3|c4|
|-|-|-|-|-|-|
|r0|3.282|8.762|4.097|4.403|0.998|
|r1|1.483|2.922|2.058|1.558|0.258|
|r2|0.015|0.682|0.569|0.204|-0.556|
|r3|-1.002|-0.475|-0.414|-0.663|-1.266|
|r4|-1.882|-1.375|-1.27|-1.473|-2.029|

Let's calculate the Q functions for all the possible actions for state (0,2), we have 

- $Q^{\pi}(s=[0,2], a=E) = R_{s=[0,2]{s^\prime=[1,2]}}^{a=E} + \gamma V^{\pi}(s^\prime=[1,2]) = 0 + 0.9 * V^{\pi}(s^\prime=[1,2]) = 0 + 0.9 * 0.682 = 0.614 $
- $Q^{\pi}(s=[0,2], a=S) = R_{s=[0,2]{s^\prime=[0,3]}}^{a=S} + \gamma V^{\pi}(s^\prime=[0,3]) = 0 + 0.9 * V^{\pi}(s^\prime=[0,3]) = 0 + 0.9 * (-1.002) = -0.912 $
- $Q^{\pi}(s=[0,2], a=W) = R_{s=[0,2]{s^\prime=[0,2]}}^{a=W} + \gamma V^{\pi}(s^\prime=[0,2]) = -1 + 0.9 * V^{\pi}(s^\prime=[0,2]) = -1 + 0.9 * 0.015 = -0.987$
- $Q^{\pi}(s=[0,2], a=N) = R_{s=[0,2]{s^\prime=[0,1]}}^{a=N} + \gamma V^{\pi}(s^\prime=[0,1]) = 0 + 0.9 * V^{\pi}(s^\prime=[0,1]) = 0 + 0.9 * 1.483 = 1.335$


For the random policy, we know we have a equal probability of taking either action. so we have 

$$ V^{\pi}(s=[0,2]) = \frac{Q^{\pi}(s=[0,2], a=E) + Q^{\pi}(s=[0,2], a=S) + Q^{\pi}(s=[0,2], a=W) + Q^{\pi}(s=[0,2], a=N)}{4} = 0.01505 \approx 0.015 $$

Now looking at the Q value of all the possible actions, we can tell that heading North has the highest Q value of 1.335. So what does this mean? 

So first, let's keep in mind that the State value function is based on the assumption of a given policy. And in this case, all the value functions above were calculated under the assumption of "random walk", if we change our policy now to be "whenever you are on cell (0,2), instead of randomly walk, head North because it will get a higher state function". Actually, this is now a different policy overall. 

Let's first start by looking at two scenarios:

### Scenario 1
Between two policies, the only difference is that "when you **FIRST** land on S=(0,2), go north but for the future landing on (0,2), still random walk for the rest of the game".

One has to be aware of what determines the value function, it is the policy and the dynamics of the environment.   


### Scenario 2
Between two policies, the only difference is that "**EVERY TIME** you land on S=(0,2), go north". The change in the policy will change the linear equation that we created in the first notebook. 

In [9]:
size = 5
coef = np.zeros(shape=(size*size,size*size))
res = [0 for i in range(size*size)]
gamma = 0.9

# loop through each column
for x in range(size):
    
    # loop through each row
    for y in range(size):
        
        equation_num = x*size+y
        # skip teleport points for now, deal with later 
        # the two teleport points: S=[1,0], S=[3,0], S=[0,2]
        if equation_num in (5, 15, 2):
            continue
        
        # top-left:0, bot-left:4, top-right:20, bot-right:24
        cell_num = equation_num 
        coef[equation_num][cell_num] = 4  
        
        # E: x+1, y
        r = 0
        x_ = x+1
        y_ = y 
        if x_ > size - 1:
            x_ = x
            r = -1 
        coef[equation_num][x_*size+y_] -= gamma 
        res[equation_num] += r
        
        # S: x, y+1
        r = 0
        x_ = x
        y_ = y+1 
        if y_ > size - 1:
            y_ = y
            r = -1
        coef[equation_num][x_*size+y_] -= gamma 
        res[equation_num] += r


        # W: x-1, y
        r = 0
        x_ = x-1
        y_ = y 
        if x_ < 0:
            x_ = x
            r = -1
        coef[equation_num][x_*size+y_] -= gamma 
        res[equation_num] += r
        # N: x, y-1
        r = 0
        x_ = x
        y_ = y-1 
        if y_ < 0:
            y_ = y
            r = -1
        coef[equation_num][x_*size+y_] -= gamma 
        res[equation_num] += r
        

# V(s=[1,0])
equation_num = 1 * size + 0 
coef[equation_num][1 * size + 0] = 1
coef[equation_num][1 * size + 4] -= gamma
res[equation_num] = 10

# V(s=[3,0])
equation_num = 3 * size + 0 
coef[equation_num][3 * size + 0] = 1 
coef[equation_num][3 * size + 3] -= gamma 
res[equation_num] = 5

# V(s=[0,2])
equation_num = 0 * size + 2 
coef[equation_num][0 * size + 2] = 1 
coef[equation_num][0 * size + 1] -= gamma 
res[equation_num] = 0

v = np.linalg.solve(coef, res)

np.set_printoptions(precision=3, suppress=True)

t = v.copy()
t.shape=(5,5)
print(t.transpose())

[[ 3.808  9.111  4.326  4.553  1.106]
 [ 2.419  3.458  2.347  1.723  0.372]
 [ 2.177  1.493  0.924  0.385 -0.437]
 [-0.057  0.074 -0.118 -0.497 -1.151]
 [-1.336 -0.987 -1.028 -1.323 -1.921]]


In [14]:
v_random_walk = [[ 3.282,  8.762,  4.097,  4.403,  0.998],
 [ 1.483,  2.922,  2.058,  1.558,  0.258],
 [ 0.015,  0.682,  0.569,  0.204, -0.556],
 [-1.002, -0.475, -0.414, -0.663, -1.266],
 [-1.882, -1.375, -1.27,  -1.473, -2.029]]

print(">>> Difference between the new policy and old policy")
print(">>> Positive means the value function became bigger in the new policy")
print(t-v_random_walk)

>>> Difference between the new policy and old policy
>>> Positive means the value function became bigger in the new policy
[[ 0.526 -6.343 -1.92  -4.46  -2.334]
 [ 7.628  0.536 -0.565 -1.484 -1.245]
 [ 4.311  1.665  0.355 -0.322 -0.472]
 [ 5.555  2.198  0.799  0.166 -0.057]
 [ 2.988  1.747  0.833  0.322  0.108]]


Here we only change one step in the policy, however, it changed all the state value functions. And if you take a look, the value function got decreased for all the cells above the diagonal. And the ones on the diagonal and the below got increased.

In [176]:
actions = ['E', 'S', 'W', 'N']
# a table where each row is a unique state, and each column represent an action
policy_probability_table = np.zeros(shape=(size*size, len(actions))) + 1/len(actions)
qtable = np.zeros(shape=(size*size, len(actions)))
v_copy = v.copy()

# the value function will change after you change your policy 
# however, the reward will be solely dependent on the state and action
# and we can precalculate the reward as a function of state, action 

def calc_reward(state, action):
    # if A = [1,0] teleport
    if state == 5:
        return 10
    # if B = [3,0] teleport
    elif state == 15:
        return 5
    # if edge
    elif (
        # east [4,0] 20, [4,1] 21, ..., [4,4] 24
        (int(state/5) == 4 and action == 'E') or
        # west [0,0] 0, [0,1] 1, ..., [0,4] 4
        (int(state/5) == 0 and action == 'W') or
        # south [0,4] 4, [1,4] 8, ..., [4,4] 24
        (state%4 == 0 and state != 0 and action == 'S') or
        # north [0,0] 0, [1,0] 5, ..., [4,0] 20
        (state%5 == 0 and action == 'N')
    ):
        return -1 
    else: 
        return 0
    
    
numb_iterations = 1
for _ in range(num_iterations):
    for state in range(size*size):
        
        # initialize
        for action in actions:
            # R 
            reward = calc_reward(state, action)
            # Q=R+V
            q = reward + v_copy[state]
            action_idx = actions.index(action)
            qtable[state][action_idx] = q
        
        max_q = qtable[state].max()
        
        # set policy_probability_table to be the greedy
        # the probability of the optimal actions are 1/(# of optimal actions)
        qs = np.array([1 if e == max_q else 0 for e in qtable[state]])
        qs = qs / qs.sum()
        # choose greedy action, update policy for all states 
        policy_probability_table[state] = qs
    
    # new V
    # generate the linear system, solve it in order to get V
    
    break
    #repeat


[ 3.808  3.808  2.808  2.808]
[1 1 0 0]
[ 0.5  0.5  0.   0. ]
[ 2.419  2.419  1.419  2.419]
[1 1 0 1]
[ 0.333  0.333  0.     0.333]
[ 2.177  2.177  1.177  2.177]
[1 1 0 1]
[ 0.333  0.333  0.     0.333]
[-0.057 -0.057 -1.057 -0.057]
[1 1 0 1]
[ 0.333  0.333  0.     0.333]
[-1.336 -2.336 -2.336 -1.336]
[1 0 0 1]
[ 0.5  0.   0.   0.5]
[ 19.111  19.111  19.111  19.111]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[ 3.458  3.458  3.458  3.458]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[ 1.493  1.493  1.493  1.493]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[ 0.074 -0.926  0.074  0.074]
[1 0 1 1]
[ 0.333  0.     0.333  0.333]
[-0.987 -0.987 -0.987 -0.987]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[ 4.326  4.326  4.326  3.326]
[1 1 1 0]
[ 0.333  0.333  0.333  0.   ]
[ 2.347  2.347  2.347  2.347]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[ 0.924 -0.076  0.924  0.924]
[1 0 1 1]
[ 0.333  0.     0.333  0.333]
[-0.118 -0.118 -0.118 -0.118]
[1 1 1 1]
[ 0.25  0.25  0.25  0.25]
[-1.028 -1.028 -1.028 -1.028]
[1 1 1 1]
[ 0.25  0.25  0.

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'