d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Dynamic Programming: Policy Evaluation Lab

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you learn:<br>
 - Policy Evaluation
  
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* Sutton Chapter 2

### Policy Evaluation ###
<br>
In this lab, we are going to evaluate a policy. Keep in mind that we are assuming that the MDP is given and we know the dynamic of the environment i.e. we are going to solve a **planning problem**. Consider the following grid, we created an environment for this problem earlier. We are going to use that environment and create an agent. Follow the following steps to evaluate a **random policy**.

0. Initialize all states with value 0. Set \\(\gamma = 0.9 \\). What does that mean to pick \\(\gamma = 0.9 \\)? 
0. Assume a random policy i.e. 25% chance of going to right, left, up and down no matter where you are. The state does not change if the agent is going out of the grid. However, the reward is given.
0. Update the state values based on above formula for two iterations i.e. k = 0, 1, 2, 3. What happens to terminal states' values? Do they change?
0. Observe how policy changes through iterations
0. Write above algorithm in Python

![gridenv](https://miro.medium.com/max/1000/1*iX-Fu5YzUZ8CNEZ86BvfKA.png)

Recreate the environment for the Dynamic Programming from the first lab.

In [5]:
%run "./helper/GridWorldAdvancedEnvironment"

In [6]:
environment = GridWorldAdvancedEnvironment()

In [7]:
# ANSWER 
import copy

def evaluate_policy(environment, gamma=0.9, iterations=200):
    """Evaluate a policy given the full information of the environment. This is not the full RL."""
    
    # Initialize the number of time visited a state 
    # Initialize the values of each state. We are doing synchronous updates. We need to keep the old values. 
    count = 0
    new_values = np.zeros(environment.ns)
    old_values = np.zeros(environment.ns)
    
    # Action's probability at any given state
    action_probability = 0.25
    
    # Probability of going to state s' from s if you take an action
    probability = 1
    
    while count < iterations:
      for state in range(environment.ns):
        # Make sure we cover all states
        environment.set_state(state)
        #environment.reset(state)
        # Use the formula to back up the values
        value = 0
        for i in range(4):
          # Take an action
          next_state, reward, is_done, _= environment.step(i)
          # Use the formula to back up the values
          value = value + action_probability * (reward + gamma * probability * old_values[next_state])
          environment.set_state(state)
          #environment.reset(state)
        
        # Update the values     
        new_values[state] = value
      # Increase the count by 1
      count += 1 
      # Keep the copy of old values
      old_values = copy.deepcopy(new_values)
    return np.array(new_values)

In [8]:
# ANSWER
initial_value = evaluate_policy(environment, gamma=0.9, iterations=0)
one_iteration_value = evaluate_policy(environment, gamma=0.9, iterations=1)
two_iterations_value = evaluate_policy(environment, gamma=0.9, iterations=2)
three_iterations_value = evaluate_policy(environment, gamma=0.9, iterations=3)
two_hundred_iterations_value = evaluate_policy(environment, gamma=0.9, iterations=200)

In [9]:
print(f"Initial state value:{initial_value}")
print(f"State value after 1 iteration:{one_iteration_value}")
print(f"State value after 2 iterations:{two_iterations_value}")
print(f"State value after 3 iteration:{three_iterations_value}")
print(f"State value after 200 iteration::{two_hundred_iterations_value}")

In [10]:
# Test your code
value_expected = np.array([3.30, 8.78,  4.42,  5.32,  1.49,  1.52, 2.99, 2.25, 1.90, 0.54, 0.05, 0.73, 0.67, 0.35, -0.40, -0.97, -0.43, -0.35, -0.58, -1.18, -1.85, -1.34, -1.22, -1.42, -1.97])
np.testing.assert_array_almost_equal(two_hundred_iterations_value, value_expected, err_msg= "The values are incorrect", decimal=2)

### Questions ###
0. Calculate: 
 - \\(q(0,up)\\)
 - \\(q(1,down)\\)
 - \\(q(5,left)\\)
0. How does the policy change in each iteration?
0. What is the optimal policy?
0. What are the state values under the given policy?

In [12]:
# ANSWER

# Answer to question 1
print(f"q(0, up) is {2.3} ")
print(f"q(1, down) is {8.7} ")
print(f"q(5, left) is {19.4} ")

# Answer to question 2
# The policy evolves to optimal policy. To see this in action, run it for 1 iteration, 2 iterations, 3 iterations and 4 iterations and see how values are evolving

# Answer to question 3
# The optimal policy is the stable one at the end. You try to move towards higher values

# Answer to question 4
print(f"State values are {two_hundred_iterations_value}" )


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>