d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# First-visit MC Prediction - Gridworld Problem

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you learn:<br>
 - Policy Evaluation

### GridWorld Problem V2 ###
Consider the following environment:

1. There is a 4 by 4 grid.
2. The goal is to reach top left or bottom right of the grid.
3. There are 4 actions: UP, DOWN, LEFT and RIGHT. Each action results in a move. The reward for each action is -1 until you reach the terminal points. The reward for the terminal points is 0. If you are at the edge, you do not move to a new state; however, you are given a reward. For example, if you are at the top right corner and decide to go RIGHT, you will end up at the same place and the reward is given to you.
4. We have already made your job easier by creating the environment for such a problem. Familiarize yourself with this environment [here]($./helper/GridWorldEnvironment).

<br>
![gridenv](https://files.training.databricks.com/images/rl/gridenv.png)

### Problem Statement ###

In this lab we are going to evaluate a policy in a full RL problem setting i.e. we do NOT know the dynamic of the environment nor are we given the MDP. We are going to use already-built environment to develop **MC first-visit** algorithm to evaluate a random policy.

In [4]:
%run "./helper/GridWorldEnvironment"

In [5]:
environment = GridWorldEnvironment()

In [6]:
#ANSWER
import random
import numpy as np
from statistics import mean

def monte_carlo(number_episodes):
  """This function generates multiple episodes for the gridworld problem."""
  
  np.random.seed(1234)
  # Initial transition list. We will add to this list as we create new ones. In the end, we end up with list of the list.
  visited_states = []
  immediate_rewards = []
  
  # Create samples of episodes
  for i in range (number_episodes):
    
    # Randomly pick the starting point
    start_index = random.randint(1,14)
    
    # Beginning of the episode. empty lists.
    realized_states = []
    realized_rewards = []
    realized_states.append(start_index)
    environment.set_state(start_index)
    while True:
      # Randomly pick an action. Remember this is a random policy
      action = random.randint(0,3)
      # Observe the state, reward, whether or not we have reached the terminal points
      next_state, reward, is_done, _= environment.step(action)
     
      # Keep the immediate reward
      realized_rewards.append(reward)
      
      # Leave if we are in the end
      if is_done:
        break
      # Record the next state
      realized_states.append(next_state)
      start_index = next_state
    
    # Add the list to the final list. visited_states is the list of list. each list contains one of the episodes. 
    # Immediate_rewards is list of list. each list contains the immediate rewards.
    visited_states.append(realized_states)
    immediate_rewards.append(realized_rewards)
    
  return visited_states, immediate_rewards


In [7]:
#ANSWER
def mc_first_visit(visited_states,immediate_rewards):
  """This function gets the list of episodes and list of immediate rewards associated with each episodes."""
  
  # Initialize 
  cumulative_rewards = np.zeros(16)
  cumulative_counts = np.ones(16)
  j = 0
  
  for episodes in visited_states:
    # Find first occurrences of each element
    first_occurrences = sorted(episodes.index(states) for states in set(episodes))
    for i in first_occurrences:
      # Calculate the cumulative reward and count for each episode
      cumulative_rewards[episodes[i]] += np.sum(immediate_rewards[j][i:])
      cumulative_counts[episodes[i]] += 1 
    j += 1
      
  return cumulative_rewards, cumulative_counts

In [8]:
visited_states, immediate_rewards = monte_carlo(100000)
cumulative_rewards, cumulative_counts = mc_first_visit(visited_states,immediate_rewards)
value = cumulative_rewards/cumulative_counts

In [9]:
# Test your code
value_expected = [0.0, -14.0, -19.9, -21.0, -14.0, -18.0, -19.9, -19.0, -20.0, -19.0, -17.0, -13.0, -21.0, -19.0, -13.0, 0.0 ]
np.testing.assert_array_almost_equal(value, value_expected, err_msg = "The values are incorrect", decimal = 0)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>