# Policy Evaluation 




### Introduction 

The purpose of this tutorial is to walk-through how to implement Value Iteration, a dynamic programming method. We will use frozenlake, an existing environment developed by openai gym. This tutotial will be broken up into four parts (see below). We will also walk-through the deterministic and stochastic cases and a few discount factors to gain intuition for how this algorithm works.  

### 4 Parts:

- Policy Evaluation (prediction)
- Policy Improvement
- Policy Iteration 
- Value Iteration

Reference: Sutton & Barto, 2018, Reinforcement Learning: An Introduction. 

### Environment 

Reference: https://gym.openai.com/envs/FrozenLake-v0/

A 4x4 gridworld with several states: 

- S: starting point, safe
- F: frozen surface, safe
- H: hole, fall to your doom
- G: goal, where the frisbee is located

When you run the environment, the gridworld will be represented as: 

SFFF       
FHFH       
FFFH       
HFFG  

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

In [1]:
import numpy as np
import gym
import time
from lake_envs import *

np.set_printoptions(precision=3)

### Rendering Function -- Do NOT need to modify 

In [2]:
def render_single(env, policy, max_steps=100):
  """
    This function does not need to be modified
    Renders policy once on environment. Watch your agent play!

    Parameters
    ----------
    env: gym.core.Environment
      Environment to play on. Must have nS, nA, and P as
      attributes.
    Policy: np.array of shape [env.nS]
      The action to take at a given state
  """

  episode_reward = 0
  ob = env.reset()
  for t in range(max_steps):
    env.render()
    time.sleep(0.25)
    a = policy[ob]
    ob, rew, done, _ = env.step(a)
    episode_reward += rew
    if done:
      break
  env.render();
  if not done:
    print("The agent didn't reach a terminal state in {} steps.".format(max_steps))
  else:
    print('# Max Steps: ',max_steps)
    print("Episode reward: %f" % episode_reward)

### Examine & Initialize Environments

In [3]:
# Inspect the deterministic environment
env_d = gym.make("Deterministic-4x4-FrozenLake-v0")
print('probability, nextstate, reward, terminal')
env_d.P[0]

probability, nextstate, reward, terminal


{0: [(1.0, 0, 0.0, False)],
 1: [(1.0, 4, 0.0, False)],
 2: [(1.0, 1, 0.0, False)],
 3: [(1.0, 0, 0.0, False)]}

In [4]:
# Inspect the stochastic environment
env_s = gym.make("Stochastic-4x4-FrozenLake-v0")
print('probability, nextstate, reward, terminal')
env_s.P[0]

probability, nextstate, reward, terminal


{0: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False)],
 1: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 1, 0.0, False)],
 2: [(0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 1, 0.0, False),
  (0.3333333333333333, 0, 0.0, False)],
 3: [(0.3333333333333333, 1, 0.0, False),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 0, 0.0, False)]}

#### Check for Understanding

Notice that the deterministic environment has 1 cell per parameter, but the stochastic environment has 3 cells. Why is that?

### Policy Evaluation (prediction): 

The process for evaluating a policy is to successively approximate and update the value of a state using the Bellman equation. The old state's value is replaced with a new value. The new value is obtained by using the old value of the successor state, s', and the rewards we expect to get at the next state. Then, the value function is computer by summing the expectations for all of the successeeding next states.  


### Algorithm: Iterative Policy Evaluation for estimating V~v_pi [p. 75]

Input pi, the policy we are evaluating

Set theta > 0, a small threshold that will determine the accuracy of our policy's estimation

Initialize V(s) arbitrarily, the initial state values, for all states except the terminal state set to zero 

Loop:

    delta <- 0
    Loop for each state in S:
        v <- V(s)
        V(s) <- sum over all actions, pi(a|s) 
                * sum over all next states and their corresponding 
                rewards, p(s',r|s,a)[r + gamma * V(s')]
        delta <- max(delta, |v - V(s)|) 
    until delta < theta


In [13]:
def policy_evaluation(P, nS, nA, policy, gamma=1, tol=1e-3):
    
    # Initialize value function & delta
    V = np.zeros(nS)
    print('Initial Value Function',V)
    delta = np.inf
    episode = 0
    
    # Policy eval. will terminate when the value function's change is below the threshold
    while episode < 10:
    #while delta >= tol:
        
        print('Episode:',episode, ' &  Value Function',V)
        
        # Why do we loop through all the states? 
        for s in range(nS):   
            v = V[s]
            a = policy[s]
            
            for prob, nextstate, reward, done in P[s][a]:   
                V[s] = prob * (reward + gamma * nextstate) 
                
                #print('state:',s)
                #print('action: ',a)
                #print('prob: ',prob)
                #print('nextstate: ',nextstate)
                #print('reward: ',reward)
                #print('done: ',done) 
                #print('value: ',V[s])
                      
                # Compute the change in value functions across states
                delta = max(delta, np.abs(v - V[s]))

        episode+=1
        #if episode < 10:
            #print('Episode: ',episode)
            #print('Working Value Fcn: ',V)
        
        """
        if episode % 10 == 0:
            print('Episode: ',episode)
            print('value_function: ',V)
        """
        
    # Final value function
    print('Final # of Episodes: ',episode)
    value_function = np.array(V)
        
    return value_function

## Evaluate Deterministic Policies

### Examine a Deterministic Policies for a Discount Value of 1

In [14]:
# Initialize a Deterministic Zeros Policy
policy = np.zeros(env_d.nS, dtype=int)
print('Policy: ',policy)

Policy:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [15]:
# Evaluate a Deterministic Zeros Policy 
print("\n" + "-"*31 + "\nBeginning Zero Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=1, tol=1e-3)


-------------------------------
Beginning Zero Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 2  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 3  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 4  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 5  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 6  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 7  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]
Episode: 8  &  Value Function [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11.

In [16]:
# Examine a Deterministic Zeros Policy & Values
print('Policy: ',policy)
print('Values: ',state_values)

Policy:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Values:  [ 0.  0.  1.  2.  4.  5.  5.  7.  8.  8.  9. 11. 12. 12. 13. 15.]


#### Check for Understanding -- Zero Policy

Does this value function make sense to you? Why or why not?

In [17]:
# Inspect Behavior for a Deterministic Zeros Policy
render_single(env_d, policy, max_steps=3)


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
The agent didn't reach a terminal state in 3 steps.


#### Check for Understanding -- Zero Policy

Our agent doesn't reach a terminal state in 5 steps. It turns out that if we instead tried for 100 steps, our agent still wouldn't reach a terminal state. In an environment of only 16 states and 64 actions, it seems like we should reach a terminal states, so why don't we? (Hint: Observe the behavior of the highlighted state.)

If the values of our value function didn't make sense to you before, do they now? What new realization did you have?

In [20]:
# Initialize a Ones Policy
policy = np.ones(env_d.nS, dtype=int)

# Evaluate a Ones Policy 
print("\n" + "-"*31 + "\nBeginning Ones Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=1, tol=1e-3)

# Examine a Ones Policy & Values
print('\nPolicy: ',policy)
print('Final Values: ',state_values)

# Inspect the Behavior of a Ones Policy
render_single(env_d, policy, max_steps=3)


-------------------------------
Beginning Ones Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 2  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 3  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 4  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 5  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 6  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 7  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11. 12. 13. 14. 15.]
Episode: 8  &  Value Function [ 4.  5.  6.  7.  8.  5. 10.  7. 12. 13. 14. 11.

#### Check for Understanding -- Ones Policy

This policy does reach a terminal state after only 5 steps. Additionally, there is a difference in behavior with 
a policy of ones. What is that behavior? What does this mean? (Hint: If you're unsure test for policies of all twos, threes, fours and so on until you've figured it out.)

In [21]:
# Initialize a Twos Policy
policy = np.array([2]*env_d.nS, dtype=int)

# Evaluate a Twos Policy 
print("\n" + "-"*31 + "\nBeginning Twos Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=1, tol=1e-3)

# Examine a Twos Policy & Values
print('\nPolicy: ',policy)
print('Final Values: ',state_values)

# Inspect the Behavior of a Twos Policy
render_single(env_d, policy, max_steps=3)


-------------------------------
Beginning Twos Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 2  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 3  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 4  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 5  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 6  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 7  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11. 12. 14. 16. 15.]
Episode: 8  &  Value Function [ 1.  2.  3.  3.  5.  5.  7.  7.  9. 10. 11. 11.

#### Check for Understanding -- Twos Policy

Describe the agent's behavior. Does it make sense that the agent didn't reach a terminal state? Why? Do the values of our value function make sense? 

In [23]:
# Initialize a Threes Policy
policy = np.array([3]*env_d.nS, dtype=int)

# Evaluate a Threes Policy 
print("\n" + "-"*33 + "\nBeginning Threes Policy Iteration\n" + "-"*33)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=1, tol=1e-3)

# Examine a Threes Policy & Values
print('\nPolicy: ',policy)
print('Final Values: ',state_values)

# Inspect the Behavior of a ThreesPolicy
render_single(env_d, policy, max_steps=3)


---------------------------------
Beginning Threes Policy Iteration
---------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 2  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 3  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 4  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 5  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 6  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 7  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  6. 11. 12.  9. 10. 15.]
Episode: 8  &  Value Function [ 0.  1.  2.  3.  0.  5.  2.  7.  4.  5.  

In [24]:
# Initialize a Fours Policy
policy = np.array([4]*env_d.nS, dtype=int)

# Evaluate a Fours Policy 
print("\n" + "-"*33 + "\nBeginning Fours Policy Iteration\n" + "-"*33)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=1, tol=1e-3)


---------------------------------
Beginning Fours Policy Iteration
---------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


KeyError: 4

#### Check for Understanding? 

Why do we get an error when we try to implement a Fours policy? 

### Examine Deterministic Policies for a Discount Value of 0.9

In [26]:
# Initialize a Deterministic Zeros Policy
policy = np.zeros(env_d.nS, dtype=int)

# Evaluate a Deterministic Zeros Policy 
print("\n" + "-"*31 + "\nBeginning Zero Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=0.9, tol=1e-3)

# Examine a Deterministic Zeros Policy & Values
print('\nPolicy: ',policy)
print('Final Values: ',state_values)

# Inspect Behavior for a Deterministic Zeros Policy
render_single(env_d, policy, max_steps=3)


-------------------------------
Beginning Zero Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 2  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 3  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 4  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 5  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 6  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  7.2  7.2  8.1  9.9 10.8 10.8
 11.7 13.5]
Episode: 7  &  Value Function [ 0.   0.   0.9  1.8  3.6  4.5  4.5  6.3  

In [27]:
# Initialize a Deterministic Ones Policy
policy = np.ones(env_d.nS, dtype=int)

# Evaluate a Deterministic Ones Policy 
print("\n" + "-"*31 + "\nBeginning Ones Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=0.9, tol=1e-3)

# Examine a Deterministic Ones Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Deterministic Ones Policy
render_single(env_d, policy, max_steps=3)


-------------------------------
Beginning Ones Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 2  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 3  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 4  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 5  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 6  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 10.8 11.7 12.6  9.9 10.8 11.7
 12.6 13.5]
Episode: 7  &  Value Function [ 3.6  4.5  5.4  6.3  7.2  4.5  9.   6.3 1

In [28]:
# Initialize a Deterministic Twos Policy
policy = np.array([2]*env_d.nS, dtype=int)

# Evaluate a Determinstic Twos Policy 
print("\n" + "-"*31 + "\nBeginning Twos Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=0.9, tol=1e-3)

# Examine a Deterministic Twos Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Deterministic Twos Policy
render_single(env_d, policy, max_steps=10)


-------------------------------
Beginning Twos Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 2  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 3  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 4  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 5  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 6  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  8.1  9.   9.9  9.9 10.8 12.6
 14.5 13.5]
Episode: 7  &  Value Function [ 0.9  1.8  2.7  2.7  4.5  4.5  6.3  6.3  

### Check for Understanding 

Why don't we reach a terminal state in this case?

In [33]:
# Initialize a Deterministic Threes Policy
policy = np.array([3]*env_d.nS, dtype=int)

# Evaluate a Deterministic Threes Policy 
print("\n" + "-"*31 + "\nBeginning Twos Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_d.P, env_d.nS, env_d.nA, policy, gamma=0.9, tol=1e-3)

# Examine a Threes Policy & Values
print('\nPolicy: ',policy)
print('Final Values: ',state_values)

# Inspect Behavior for a Threes Policy
render_single(env_d, policy, max_steps=3)


-------------------------------
Beginning Twos Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 2  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 3  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 4  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 5  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 6  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  3.6  4.5  5.4  9.9 10.8  8.1
  9.  13.5]
Episode: 7  &  Value Function [ 0.   0.9  1.8  2.7  0.   4.5  1.8  6.3  

## Evaluate Stochastic Policies

### Examine Stochastic Policies for Discount Value of 1

In [34]:
# Recall we initialized a stochastic environment
env_s = gym.make("Stochastic-4x4-FrozenLake-v0")

In [35]:
# Initialize a Stochastic Zeros Policy
policy = np.zeros(env_s.nS, dtype=int)

# Evaluate a Stochastic Zeros Policy 
print("\n" + "-"*31 + "\nBeginning Zeros Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_s.P, env_s.nS, env_s.nA, policy, gamma=1, tol=1e-3)

# Examine a Stochastic Zeros Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Stochastic Zeros Policy
render_single(env_s, policy, max_steps=5)


-------------------------------
Beginning Zeros Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 1.333  1.667  2.     2.333  2.667  5.     3.333  7.     4.     4.333
  4.667 11.    12.     4.333  4.667 15.   ]
Episode: 2  &  Value Function [ 1.333  1.667  2.     2.333  2.667  5.     3.333  7.     4.     4.333
  4.667 11.    12.     4.333  4.667 15.   ]
Episode: 3  &  Value Function [ 1.333  1.667  2.     2.333  2.667  5.     3.333  7.     4.     4.333
  4.667 11.    12.     4.333  4.667 15.   ]
Episode: 4  &  Value Function [ 1.333  1.667  2.     2.333  2.667  5.     3.333  7.     4.     4.333
  4.667 11.    12.     4.333  4.667 15.   ]
Episode: 5  &  Value Function [ 1.333  1.667  2.     2.333  2.667  5.     3.333  7.     4.     4.333
  4.667 11.    12.     4.333  4.667 15.   ]
Episode: 6  &  Value Fun

In [37]:
# Initialize a Stochastic Ones Policy
policy = np.ones(env_s.nS, dtype=int)

# Evaluate a Stochastic Ones Policy 
print("\n" + "-"*31 + "\nBeginning Ones Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_s.P, env_s.nS, env_s.nA, policy, gamma=1, tol=1e-3)

# Examine a Stochastic Ones Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Stochastic Ones Policy
render_single(env_s, policy, max_steps=3)


-------------------------------
Beginning Ones Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.333  0.667  1.     1.     1.667  5.     2.333  7.     3.     3.333
  3.667 11.    12.     4.667  5.333 15.   ]
Episode: 2  &  Value Function [ 0.333  0.667  1.     1.     1.667  5.     2.333  7.     3.     3.333
  3.667 11.    12.     4.667  5.333 15.   ]
Episode: 3  &  Value Function [ 0.333  0.667  1.     1.     1.667  5.     2.333  7.     3.     3.333
  3.667 11.    12.     4.667  5.333 15.   ]
Episode: 4  &  Value Function [ 0.333  0.667  1.     1.     1.667  5.     2.333  7.     3.     3.333
  3.667 11.    12.     4.667  5.333 15.   ]
Episode: 5  &  Value Function [ 0.333  0.667  1.     1.     1.667  5.     2.333  7.     3.     3.333
  3.667 11.    12.     4.667  5.333 15.   ]
Episode: 6  &  Value Func

In [38]:
# Initialize a Stochastic Twos Policy
policy = np.array([2]*env_s.nS, dtype=int)

# Evaluate a Stochastic Twos Policy 
print("\n" + "-"*31 + "\nBeginning Twos Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_s.P, env_s.nS, env_s.nA, policy, gamma=1, tol=1e-3)

# Examine a Stochastic Twos Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Stochastic Twos Policy
render_single(env_s, policy, max_steps=3)


-------------------------------
Beginning Twos Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.     0.333  0.667  1.     0.     5.     0.667  7.     1.333  1.667
  2.    11.    12.     3.     3.333 15.   ]
Episode: 2  &  Value Function [ 0.     0.333  0.667  1.     0.     5.     0.667  7.     1.333  1.667
  2.    11.    12.     3.     3.333 15.   ]
Episode: 3  &  Value Function [ 0.     0.333  0.667  1.     0.     5.     0.667  7.     1.333  1.667
  2.    11.    12.     3.     3.333 15.   ]
Episode: 4  &  Value Function [ 0.     0.333  0.667  1.     0.     5.     0.667  7.     1.333  1.667
  2.    11.    12.     3.     3.333 15.   ]
Episode: 5  &  Value Function [ 0.     0.333  0.667  1.     0.     5.     0.667  7.     1.333  1.667
  2.    11.    12.     3.     3.333 15.   ]
Episode: 6  &  Value Func

In [39]:
# Initialize a Stochastic Threes Policy
policy = np.array([3]*env_s.nS, dtype=int)

# Evaluate a Stochastic Threes Policy 
print("\n" + "-"*31 + "\nBeginning Threes Policy Iteration\n" + "-"*31)
state_values = policy_evaluation(env_s.P, env_s.nS, env_s.nA, policy, gamma=1, tol=1e-3)

# Examine a Stochastic Threes Policy & Values
print('\nPolicy: ',policy)
print('Values: ',state_values)

# Inspect Behavior for a Stochastic Threes Policy
render_single(env_s, policy, max_steps=3)


-------------------------------
Beginning Threes Policy Iteration
-------------------------------
Initial Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 0  &  Value Function [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Episode: 1  &  Value Function [ 0.     0.     0.333  0.667  1.333  5.     1.667  7.     2.667  2.667
  3.    11.    12.     4.     4.333 15.   ]
Episode: 2  &  Value Function [ 0.     0.     0.333  0.667  1.333  5.     1.667  7.     2.667  2.667
  3.    11.    12.     4.     4.333 15.   ]
Episode: 3  &  Value Function [ 0.     0.     0.333  0.667  1.333  5.     1.667  7.     2.667  2.667
  3.    11.    12.     4.     4.333 15.   ]
Episode: 4  &  Value Function [ 0.     0.     0.333  0.667  1.333  5.     1.667  7.     2.667  2.667
  3.    11.    12.     4.     4.333 15.   ]
Episode: 5  &  Value Function [ 0.     0.     0.333  0.667  1.333  5.     1.667  7.     2.667  2.667
  3.    11.    12.     4.     4.333 15.   ]
Episode: 6  &  Value Fu

# Policy Iteration 

Policy Iteration (using iterative policy evaluation) for estimating pi ~= pi*

1. Initialization 
V(s) is an element of the real numbers for all states
pi(s) is an element of A(s) for all states 

2. Policy Evaluation (implemented above)
Loop: 
    delta <- 0
    Loop for each s in the set of States:
        v <- V(s)
        V(s) <- sum over all next states & rewards p(s',r|s,a) * [r + gamma * V(s')]
        delta <- max(delta, |v-V(s)|)
    until delta < theta 
    
3. Policy Improvement (still need to implement)
policy-stable <- true
For each s in the set of States:
    old-action <- pi(s)
    pi(s) <- argmax over actions sum over all next states and rewards, p(s',r|s,a)[r + gamma * V(s)]
    If old-action =/= pi()
        

In [26]:
## Policy Evaluation 
"""
                    # ------------ Deviation from 4.1 algorithm ------------ #
                    # Loop through set of possible next actions
                    
                    for a, action_prob in enumerate(policy[s]):
                        # For each action, look at its possible next state
                        for prob, next_state, reward, done in P[s][a]:

                            # Calculate the expected value using equation 4.6
                            v += action_prob * prob * (reward + gamma * ) 
                            
                    # ------------ Deviation from 4.1 algorithm ------------ #
                    """

'\n                    # ------------ Deviation from 4.1 algorithm ------------ #\n                    # Loop through set of possible next actions\n                    \n                    for a, action_prob in enumerate(policy[s]):\n                        # For each action, look at its possible next state\n                        for prob, next_state, reward, done in P[s][a]:\n\n                            # Calculate the expected value using equation 4.6\n                            v += action_prob * prob * (reward + gamma * ) \n                            \n                    # ------------ Deviation from 4.1 algorithm ------------ #\n                    '

#### Policy Improvement

#### Policy Iteration for Deterministic Environments