d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Dynamic Programming: Asynchronous Lab

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you learn:<br>
 - Asynchronous update
  
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* Sutton Chapter 3

### Policy Iteration Asynchronously###
<br>
In this lab we are going to find an optimal policy by implementing **asynchronous policy iteration** algorithm. Refer to the demo notebook to find proper formula for this lab.

![gridenv](https://miro.medium.com/max/1000/1*iX-Fu5YzUZ8CNEZ86BvfKA.png)

Recreate the environment for the Dynamic Programming from the first lab.

In [5]:
%run "./helper/GridWorldAdvancedEnvironment"

In [6]:
environment = GridWorldAdvancedEnvironment()

In [7]:
# ANSWER 
def evaluate_policy_in_place(environment, policy, values, gamma=0.9): 
    """Evaluate a policy given the full information of the environment. This is not the full RL."""
    
    # Action's probability at any given state
    action_probability = 1
    
    # Probability of going to state s' from s if you take an action
    probability = 1
    
    for state in range(environment.ns):
      # Make sure we cover all states
      environment.set_state(state)
    
      # Take an action
      next_state, reward, is_done, _= environment.step(policy[state])
      # Use the formula to back up the values
      values[state] = action_probability * (reward + gamma * probability * values[next_state])
           

    return values

In [8]:
import copy 

def improve_policy(environment, policy_evaluate_function = evaluate_policy_in_place, iterations=20000,  gamma=0.9):
  """Iteratively select best action, update the state-value and repeat until there is no improvements."""

   # Define a deterministic policy. For example, assume no matter what state you are at, you can only go DOWN 
  policy = np.ones(environment.ns)
  
  # Initialize the values
  values = np.zeros(environment.ns)
  count = 0
  
  
  while count < iterations:
     
    # Evaluate the policy first
    values = policy_evaluate_function(environment, policy, values)
    
    # We need a copy
    values_copy = copy.deepcopy(values)
      
    for state in range(environment.ns):
      # Reset the environment
      environment.set_state(state)
      for action in range(4):
        # Take an action
        next_state, reward, is_done, _= environment.step(action)
        # Act greedily. Take an action toward the higher values.
        if values[next_state] >= values_copy[state]:
          policy[state] = action
          values_copy[state] = values[next_state]
        environment.set_state(state)
    
    count += 1
   
  
  return values, policy
    

In [9]:
optimal_values, optimal_policy = improve_policy(environment)
print(f"Optimal Policy (0=UP, 1=DOWN, 2=RIGHT, 3=LEFT):\n\n {optimal_policy.reshape(5,5)}\n")
print(f"State optimal values:\n\n {optimal_values.reshape(5,5)}")

In [10]:
# Test your code
value_expected = np.array([[21.97, 24.41, 21.97, 19.41, 17.47] ,[19.77, 21.97, 19.77, 17.80, 16.02], [17.80, 19.77, 17.80, 16.02, 14.41 ], [16.02, 17.80, 16.02, 14.41,12.97],[14.41,  16.02, 14.41, 12.97, 11.67]])
np.testing.assert_array_almost_equal(optimal_values.reshape(5,5), value_expected, err_msg = "The values are incorrect", decimal=2)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>