# Value Iteration 




#### Introduction 

The purpose of this tutorial is to walk-through how to implement Value Iteration, a dynamic programming method. We will use frozenlake, an existing environment developed by openai gym. This tutotial will be broken up into four parts (see below). We will also walk-through the deterministic and stochastic cases and a few discount factors to gain intuition for how this algorithm works.  

#### 4 Parts:

- Policy Evaluation (prediction)
- Policy Improvement
- Policy Iteration 
- Value Iteration

Reference: Sutton & Barto, 2018, Reinforcement Learning: An Introduction. 

### Environment 

Reference: https://gym.openai.com/envs/FrozenLake-v0/

A 4x4 gridworld with several states: 

- S: starting point, safe
- F: frozen surface, safe
- H: hole, fall to your doom
- G: goal, where the frisbee is located

When you run the environment, the gridworld will be represented as: 

SFFF       
FHFH       
FFFH       
HFFG  

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

In [1]:
import numpy as np
import gym
import time
from lake_envs import *

np.set_printoptions(precision=3)

### Rendering Function -- Do NOT need to modify 

In [2]:
def render_single(env, policy, max_steps=100):
  """
    This function does not need to be modified
    Renders policy once on environment. Watch your agent play!

    Parameters
    ----------
    env: gym.core.Environment
      Environment to play on. Must have nS, nA, and P as
      attributes.
    Policy: np.array of shape [env.nS]
      The action to take at a given state
  """

  episode_reward = 0
  ob = env.reset()
  for t in range(max_steps):
    env.render()
    time.sleep(0.25)
    a = policy[ob]
    ob, rew, done, _ = env.step(a)
    episode_reward += rew
    if done:
      break
  env.render();
  if not done:
    print("The agent didn't reach a terminal state in {} steps.".format(max_steps))
  else:
  	print("Episode reward: %f" % episode_reward)

### Examine & Initialize Environments

In [3]:
# Inspect deterministic environment
env_d = gym.make("Deterministic-4x4-FrozenLake-v0")
print('probability, nextstate, reward, terminal')
env_d.P[0]

probability, nextstate, reward, terminal


{0: [(1.0, 0, 0.0, False)],
 1: [(1.0, 4, 0.0, False)],
 2: [(1.0, 1, 0.0, False)],
 3: [(1.0, 0, 0.0, False)]}

In [4]:
# Inspect stochastic environment
env_s = gym.make("Stochastic-4x4-FrozenLake-v0")
env_s.P[0]

{0: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False)],
 1: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 1, 0.0, False)],
 2: [(0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 1, 0.0, False),
  (0.3333333333333333, 0, 0.0, False)],
 3: [(0.3333333333333333, 1, 0.0, False),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 0, 0.0, False)]}

###### Why did the deterministic environment have 1 cell per paramter, but the stochastic environment had 3?

### Examine & Initialize A Policy

In [5]:
# Randomly initialize a policy
policy = np.zeros(env_d.nS, dtype=int )
print(type(policy))

<class 'numpy.ndarray'>


In [6]:
# For example, our policy currently yields
for s in range(env_d.nS): 
    print('Policy for state',s,': ',policy[s])

Policy for state 0 :  0
Policy for state 1 :  0
Policy for state 2 :  0
Policy for state 3 :  0
Policy for state 4 :  0
Policy for state 5 :  0
Policy for state 6 :  0
Policy for state 7 :  0
Policy for state 8 :  0
Policy for state 9 :  0
Policy for state 10 :  0
Policy for state 11 :  0
Policy for state 12 :  0
Policy for state 13 :  0
Policy for state 14 :  0
Policy for state 15 :  0


### Policy Evaluation (prediction): 

The process for evaluating a policy is to successively approximate and update the value of a state using the Bellman equation. The old state's value is replaced with a new value. The new value is obtained by using the old value of the successor state, s', and the rewards we expect to get at the next state. Then, the value function is computer by summing the expectations for all of the successeeding next states.  


### Algorithm: Iterative Policy Evaluation for estimating V~v_pi [p. 75]

Input pi, the policy we are evaluating

Set theta > 0, a small threshold that will determine the accuracy of our policy's estimation

Initialize V(s) arbitrarily, the initial state values, for all states except the terminal state set to zero 

Loop:

    delta <- 0
    Loop for each state in S:
        v <- V(s)
        V(s) <- sum over all actions, pi(a|s) 
                * sum over all next states and their corresponding 
                rewards, p(s',r|s,a)[r + gamma * V(s')]
        delta <- max(delta, |v - V(s)|) 
    until delta < theta


### Try Policy Evaluation for a discount value of 1

In [9]:
def policy_evaluation(P, nS, nA, policy, gamma=1, tol=1e-3):
    
    # Initialize value function & delta
    V = np.zeros(nS)
    delta = np.inf
    episode = 0
    
    # Policy eval. will terminate when the value function's change is below the threshold
    while delta >= tol:
        
        print('Episode: ',episode)
        print('value_function: ',V)
        
        # Why do we loop through all the states? 
        for s in range(nS):   
            v = V[s]
            a = policy[s]
            
            for prob, nextstate, reward, done in P[s][a]:   
                V[s] = prob * (reward + gamma * nextstate) 
                
                #print('state:',s)
                #print('action: ',a)
                #print('prob: ',prob)
                #print('nextstate: ',nextstate)
                #print('reward: ',reward)
                #print('done: ',done) 
                #print('value: ',V[s])
                      
                # Compute the change in value functions across states
                delta = max(delta, np.abs(v - V[s]))

        episode+=1
        if episode < 10:
            print('Episode: ',episode)
            print('value_function: ',V)
        if episode % 10 == 0:
            print('10th Episode: ',episode)
            print('value_function: ',V)
        
    # Final value function
    value_function = np.array(V)
        
    return value_function

In [10]:
print("\n" + "-"*25 + "\nBeginning Policy Iteration\n" + "-"*25)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=0.9, tol=1e-3)
print('Policy: ',policy)
print('State Value Fcn: ',state_values)


-------------------------
Beginning Policy Iteration
-------------------------
Episode:  0
value_function:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode:  1
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  4
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  4
value_function:  [ 0.   0.   0.9  1.8  3.6  4

 11.7 13.5]
Episode:  411
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  412
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  413
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  414
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  415
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  416
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  417
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  418
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  419
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2

 11.7 13.5]
Episode:  849
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  850
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  850
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  851
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  852
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  853
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  854
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  855
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  856
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3

 11.7 13.5]
Episode:  1289
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  1290
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1290
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1291
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1292
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1293
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1294
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1295
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1296
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5 

 11.7 13.5]
Episode:  1700
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1701
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1702
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1703
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1704
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1705
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1706
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1707
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  1708
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5 

Episode:  2118
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2119
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  2120
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2120
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2121
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2122
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2123
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2124
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2125
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7

 11.7 13.5]
Episode:  2530
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2531
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2532
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2533
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2534
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2535
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2536
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2537
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2538
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5 

 11.7 13.5]
Episode:  2915
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2916
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2917
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2918
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2919
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  2920
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2920
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2921
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  2922
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5 

 11.7 13.5]
Episode:  3335
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3336
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3337
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3338
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3339
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  3340
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3340
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3341
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3342
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5 

 11.7 13.5]
Episode:  3763
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3764
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3765
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3766
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3767
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3768
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3769
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
10th Episode:  3770
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode:  3770
value_function:  [ 0.   0.   0.9  1.8  3.6  4.5 

KeyboardInterrupt: 

In [34]:
render_single(env_d, policy, max_steps=100)


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mF

KeyboardInterrupt: 

In [12]:
## Policy Evaluation 
"""
                    # ------------ Deviation from 4.1 algorithm ------------ #
                    # Loop through set of possible next actions
                    
                    for a, action_prob in enumerate(policy[s]):
                        # For each action, look at its possible next state
                        for prob, next_state, reward, done in P[s][a]:

                            # Calculate the expected value using equation 4.6
                            v += action_prob * prob * (reward + gamma * ) 
                            
                    # ------------ Deviation from 4.1 algorithm ------------ #
                    """

'\n                    # ------------ Deviation from 4.1 algorithm ------------ #\n                    # Loop through set of possible next actions\n                    \n                    for a, action_prob in enumerate(policy[s]):\n                        # For each action, look at its possible next state\n                        for prob, next_state, reward, done in P[s][a]:\n\n                            # Calculate the expected value using equation 4.6\n                            v += action_prob * prob * (reward + gamma * ) \n                            \n                    # ------------ Deviation from 4.1 algorithm ------------ #\n                    '

#### Policy Improvement

#### Policy Iteration for Deterministic Environments