# Iterative Policy Evaluation
This solve the prediction problem, e.g. given a policy find the value function.<br>
First initialise V(s) to zero for all the states, then loop through all the states and update V(s) until it converges.
- Prediction problem: given a policy, find the value function
- Control problem: find the optimal policy

We are going to implement Iterative Policy Evaluation for two different policies, a random and a deterministic policy.

### Random Policy
The probability of an action is 1 divided by the number of actions.
$$\pi(a | s) = \frac{1}{|A|}$$
$p(s', r | s, a)$ is only relevant when state transitions are stochastic.

### Deterministic Policy
Completely deterministic, we expect the values of the value function to be 1 in the path to the goal, and -1 in all other states.

In [1]:
%qtconsole

In [2]:
%matplotlib notebook
from grid_world import standard_grid
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15,6)

In [3]:
THRESHOLD = 10e-4

def print_values(V, g):
    for i in range(g.width):
        print('-----------------------')
        for j in range(g.height):
            v = V.get((i, j), 0)
            if v >= 0:
                print(' {0:.2f}'.format(v), end = ' ')
            else:
                print('{0:.2f}'.format(v), end = ' ')
        print()
    print('-----------------------')

def print_policy(P, g):
    for i in range(g.width):
        print('---------------')
        for j in range(g.height):
            p = P.get((i, j), ' ')
            print(' ' + p + ' ', end = ' ')
        print()
    print('---------------')

# Random Policy - Iterative Policy Evaluation

In [4]:
grid = standard_grid()
states = grid.all_states()
V = {s: 0 for s in states}
# discount factor gamma
gamma = 1

while True:
    delta = 0
    for s in states:
        old_v = V[s]
        if s in grid.actions:
            actions = grid.actions[s]
            v = 0
            for a in actions:
                grid.set_state(s)
                r = grid.move(a)
                s_prime = grid.current_state()
                v += 1 / len(actions) * (r + gamma * V[s_prime])
            V[s] = v
            delta = max(delta, abs(v - old_v))
    if delta < THRESHOLD:
        break
print('\nValue function for uniformly random actions:')
print_values(V, grid)
print('\n')


Value function for uniformly random actions:
-----------------------
-0.03  0.09  0.22  0.00 
-----------------------
-0.16  0.00 -0.44  0.00 
-----------------------
-0.29 -0.41 -0.54 -0.77 
-----------------------




# Deterministic Policy - Iterative Policy Evaluation

In [5]:
policy = {(2,0): 'U',
         (1,0): 'U',
         (0,0): 'R',
         (0,1): 'R',
         (0,2): 'R',
         (1,2): 'R',
         (2,1): 'R',
         (2,2): 'R',
         (2,3): 'U'}
print_policy(policy, grid)
grid.policy = policy

V_det = {s: 0 for s in states}
gamma = 0.9

while True:
    delta = 0
    for s in states:
        old_v = V_det[s]
        if not grid.is_terminal(s):
            grid.set_state(s)
            a = grid.policy[s]
            r = grid.move(a)
            s_prime = grid.current_state()
            V_det[s] = 1 * (r + gamma * V_det[s_prime])
            delta = max(delta, abs(V_det[s] - old_v))
    if delta < THRESHOLD:
        break
print('\nValue function for deterministic policy:')
print_values(V_det, grid)
print('\n')

---------------
 R   R   R      
---------------
 U       R      
---------------
 U   R   R   U  
---------------

Value function for deterministic policy:
-----------------------
 0.81  0.90  1.00  0.00 
-----------------------
 0.73  0.00 -1.00  0.00 
-----------------------
 0.66 -0.81 -0.90 -1.00 
-----------------------




# Policy Iteration
We alternate Iterative Policy Evaluation and Policy Improvement.<br>
First, we initialise both V and $\pi$, then we perform the Policy Evaluation to find the V(s), and finally we perform Policy Improvement. If the policy changes in the last step, then we perform again the Policy Evaluation because the V(s) is changed, then Policy Improvement and so on.
- Initialise V(s) and $\pi(s)$
- Policy Evaluation to find V(s)
- Policy Improvement to find a better policy $\pi(s)$
- If the policy is changed, got o step 2

In [6]:
from grid_world import negative_grid

grid = negative_grid()
print('Rewards:\n')
print_values(grid.rewards, grid)
states = grid.all_states()

policy = {s: grid.actions[s][np.random.choice(len(grid.actions[s]))] for s in states if s in grid.actions}
V = {s: 0 for s in states}
print('Random policy:')
print_policy(policy, grid)
print()

gamma = 0.9
changed = True
while changed:
    # Perform Policy Evaluation
    while True:
        delta = 0
        for s in states:
            old_v = V[s]
            if s in grid.actions:
                grid.set_state(s)
                r = grid.move(policy[s])
                s_prime = grid.current_state()
                V[s] = 1 * (r + gamma * V[s_prime])
                delta = max(delta, V[s] - old_v)
        if delta < THRESHOLD:
            break
    
    # Perform Policy Improvement
    changed = False
    for s in states:
        if s in grid.actions:
            old_a = policy[s]
            possible_actions = grid.actions[s]
            best_a = old_a
            for a in possible_actions:
                grid.set_state(s)
                r = grid.move(a)
                s_prime = grid.current_state()
                v = r + gamma * V[s_prime]
                best_a = a if v > V[s] else best_a
            policy[s] = best_a
            if best_a != old_a:
                changed = True

print('\nValue function: ')
print_values(V, grid)
print('\nOptimal Policy:')
print_policy(policy, grid)

Rewards:

-----------------------
-0.10 -0.10 -0.10  1.00 
-----------------------
-0.10  0.00 -0.10 -1.00 
-----------------------
-0.10 -0.10 -0.10 -0.10 
-----------------------
Random policy:
---------------
 D   L   R      
---------------
 D       R      
---------------
 U   L   U   L  
---------------


Value function: 
-----------------------
 0.62  0.80  1.00  0.00 
-----------------------
 0.46  0.00  0.80  0.00 
-----------------------
 0.31  0.46  0.62  0.46 
-----------------------

Optimal Policy:
---------------
 R   R   R      
---------------
 U       U      
---------------
 U   R   U   L  
---------------


# Policy Iteration in Windy Gridworld
Now the state transitions are not deterministic, so when the agent wants to go up it goes up with a probability of 0.5, and it goes left, right or down with a probability of $\frac{0.5}{3}$.<br>
So we have to consider the $p(s', r | s, a)$.

In [9]:
from grid_world import negative_grid

POSSIBLE_ACTIONS = ['U', 'D', 'L', 'R']

grid = negative_grid(step_cost=-0.3)
print('Rewards:\n')
print_values(grid.rewards, grid)
states = grid.all_states()

policy = {s: grid.actions[s][np.random.choice(len(grid.actions[s]))] for s in states if s in grid.actions}
V = {s: 0 for s in states}
print('Random policy:')
print_policy(policy, grid)
print()

gamma = 0.9
changed = True
while changed:
    # Perform Policy Evaluation
    while True:
        delta = 0
        for s in states:
            old_v = V[s]
            new_v = 0
            if s in grid.actions:
                for a in POSSIBLE_ACTIONS:
                    if a == policy[s]:
                        p = 0.5
                    else:
                        p = 0.5 / 3
                    grid.set_state(s)
                    r = grid.move(a)
                    s_prime = grid.current_state()
                    new_v += p * (r + gamma * V[s_prime])
                V[s] = new_v
                delta = max(delta, V[s] - old_v)
        if delta < THRESHOLD:
            break
    
    # Perform Policy Improvement
    changed = False
    for s in states:
        if s in grid.actions:
            old_a = policy[s]
            possible_actions = grid.actions[s]
            best_a = old_a
            for a in possible_actions:
                v = 0
                for a2 in POSSIBLE_ACTIONS:
                    if a2 == a:
                        p = 0.5
                    else:
                        p = 0.5 / 3
                    grid.set_state(s)
                    r = grid.move(a2)
                    s_prime = grid.current_state()
                    v += p * (r + gamma * V[s_prime])
                best_a = a if v > V[s] else best_a
            policy[s] = best_a
            if best_a != old_a:
                changed = True

print('\nValue function: ')
print_values(V, grid)
print('\nOptimal Policy:')
print_policy(policy, grid)

Rewards:

-----------------------
-0.30 -0.30 -0.30  1.00 
-----------------------
-0.30  0.00 -0.30 -1.00 
-----------------------
-0.30 -0.30 -0.30 -0.30 
-----------------------
Random policy:
---------------
 R   L   D      
---------------
 U       D      
---------------
 R   L   U   L  
---------------


Value function: 
-----------------------
-1.08 -0.52  0.22  0.00 
-----------------------
-1.49  0.00 -0.57  0.00 
-----------------------
-1.72 -1.53 -1.13 -1.17 
-----------------------

Optimal Policy:
---------------
 R   R   R      
---------------
 U       U      
---------------
 U   R   U   U  
---------------


# Value Iteration
We merge the policy evaluation and the policy improvement steps. We have only one Bellmann Equation that takes the max value of V(s) over all the actions.
$$ V_{k+1}(s) = \max_a{\sum_{s'}\sum_{r} p(s', r | s, a)[r + \gamma V_k(s')]}$$
It's more efficient because we have no longer an iterative algorithm (Iterative Policy Evaluation) inside an iterative algorithm.

In [25]:
from grid_world import negative_grid

POSSIBLE_ACTIONS = ['U', 'D', 'L', 'R']
gamma = 0.9
THRESHOLD = 10e-4

grid = negative_grid()
print('Rewards:\n')
print_values(grid.rewards, grid)
states = grid.all_states()

policy = {s: grid.actions[s][np.random.choice(len(grid.actions[s]))] for s in states if s in grid.actions}
V = {s: 0 for s in states}
print('Random policy:')
print_policy(policy, grid)
print()

while True:
    delta = 0
    for s in states:
        old_v = V[s]
        new_v = 0
        max_v = 0
        if s in grid.actions:
            for a in grid.actions[s]:
                grid.set_state(s)
                r = grid.move(a)
                s_prime = grid.current_state()
                new_v = 1 * (r + gamma * V[s_prime])
                max_v = max_v if max_v >= new_v else new_v
            V[s] = max_v
            delta = max(delta, V[s] - old_v)
    if delta < THRESHOLD:
        break
        
# Find the optimal policy from the optimal value function
for s in policy.keys():
    max_a = None
    max_v = float('-inf')
    for a in grid.actions[s]:
        grid.set_state(s)
        r = grid.move(a)
        s_prime = grid.current_state()
        v = 1 * (r + gamma * V[s_prime])
        if v > max_v:
            max_a = a
            max_v = v
    policy[s] = max_a

print('\nValue function: ')
print_values(V, grid)
print('\nOptimal Policy:')
print_policy(policy, grid)

Rewards:

-----------------------
-0.10 -0.10 -0.10  1.00 
-----------------------
-0.10  0.00 -0.10 -1.00 
-----------------------
-0.10 -0.10 -0.10 -0.10 
-----------------------
Random policy:
---------------
 D   L   R      
---------------
 D       R      
---------------
 U   L   R   U  
---------------


Value function: 
-----------------------
 0.62  0.80  1.00  0.00 
-----------------------
 0.46  0.00  0.80  0.00 
-----------------------
 0.31  0.46  0.62  0.46 
-----------------------

Optimal Policy:
---------------
 R   R   R      
---------------
 U       U      
---------------
 U   R   U   L  
---------------
