# Lab 9: Q-learning
Welcome to the ninth DS102 lab! 

The goals of this lab is to implement and gain a better understanding of Q-learning.
The code you need to write is commented out with a message "TODO: fill in". There is additional documentation for each part as you go along.


## Course Policies

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the cell below.

**Submission**: to submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope.


**This assignment should be completed and submitted before Tuesday November 19, 2019 at 11:59 PM.** 

In [None]:
import copy
import sys
import time

import numpy as np
from IPython.display import clear_output

## Defining the GridWorld class.
We begin by defining a class for the environment in which we will run Q-learning. This is the grid world from class where we can have both a stochastic or deterministic environment. In the stochastic case the robot will have a probability of 0.8 of going in the direction it's told to go an a probability of 0.1 of going in each direction orthogonal to the direction it's meant to go in.

In [None]:
FORWARD_PROB = 0.8
LEFT_PROB = 0.1
RIGHT_PROB = 0.1
BACKWARD_PROB = 0.0
FIXED_PROB = 0.0
class GridWorld():
    """The grid world class.
    
    Parameters
    ----------
    grid : list of list of str
        The starting representation of the world. A single element
        must be "R" which represents the starting location of the robot.
        Any element that is "" represents a cell on which the robot can travel,
        "X" represents a rock which the robot can not travel on, and any
        cell with a string that can be converted to a number represents
        a terminal state with its corresponding reward.
    stochastic : bool
        Whether the environment is stochastic or deterministic.

    """
    def __init__(self, grid, stochastic):
        self._grid = grid
        self.num_rows = len(grid)
        self.num_cols = len(grid[0])
        self._stochastic = stochastic
        # Determine the starting location of the robot.
        for i in range(self.num_rows):
            for j in range(self.num_cols):
                if self._grid[i][j] == "R":
                    self._grid[i][j] = ""
                    self._row_pos = i
                    self._col_pos = j
                    self._start_row_pos = i
                    self._start_col_pos = j

    def reset(self):
        """Reset the environment to its original state."""
        self._row_pos = self._start_row_pos
        self._col_pos = self._start_col_pos
        return (self._row_pos, self._col_pos)
        
    def step(self, action):
        """Move the robot a single step in the world.
        
        Parameters
        ----------
        action : str
            The desired direction to travel in. Can either be
            "north", "west", "east", "south".
            
        Returns
        -------
        pos : tuple of int
            The location the robot ends up at after taking a step.
            The first element represents the row and the second element
            represents the column.
        reward : float
            The reward from taking this step.
        done : bool
            Whether the robot has reached a terminal state or not.

        """
        # Determine the transition probabilities based on the action and
        # whether the environment is stochastic or deterministic.
        if self._stochastic:
            if action == "north":
                transition_probs = {
                    "north": FORWARD_PROB,
                    "west": LEFT_PROB,
                    "east": RIGHT_PROB,
                    "south": BACKWARD_PROB,
                    "fixed": FIXED_PROB
                }
            if action == "west":
                transition_probs = {
                    "north": RIGHT_PROB,
                    "west": FORWARD_PROB,
                    "east": BACKWARD_PROB,
                    "south": LEFT_PROB,
                    "fixed": FIXED_PROB
                }
            if action == "east":
                transition_probs = {
                    "north": LEFT_PROB,
                    "west": BACKWARD_PROB,
                    "east": FORWARD_PROB,
                    "south": RIGHT_PROB,
                    "fixed": FIXED_PROB
                }
            if action == "south":
                transition_probs = {
                    "north": BACKWARD_PROB,
                    "west":RIGHT_PROB,
                    "east": LEFT_PROB,
                    "south": FORWARD_PROB,
                    "fixed": FIXED_PROB
                }
        else:
            transition_probs = {
                "north": 0.0,
                "west": 0.0,
                "east": 0.0,
                "south": 0.0,
                "fixed": 0.0
            }
            transition_probs[action] = 1.0
            
        # Account for the cases where we are on the boundaries or
        # next to a rock.
        row = self._row_pos
        col = self._col_pos
        if row == 0 or self._grid[row - 1][col] == "X":
            transition_probs["fixed"] += transition_probs["north"]
            transition_probs["north"] = 0.0
        if col == 0 or self._grid[row][col - 1] == "X":
            transition_probs["fixed"] += transition_probs["west"]
            transition_probs["west"] = 0.0
        if row == self.num_rows - 1 or self._grid[row + 1][col] == "X":
            transition_probs["fixed"] += transition_probs["south"]
            transition_probs["south"] = 0.0
        if col == self.num_cols - 1 or self._grid[row][col + 1] == "X":
            transition_probs["fixed"] += transition_probs["east"]
            transition_probs["east"] = 0.0

        # Decide which direction the robot will go.
        directions = list(transition_probs.keys())
        probs = list(transition_probs.values())
        move = np.random.choice(directions, p=probs)
        if move == "north":
            self._row_pos -= 1
        elif move == "west":
            self._col_pos -= 1
        elif move == "east":
            self._col_pos += 1
        elif move == "south":
            self._row_pos += 1

        # Check if we are on a final state and determine the reward.
        if self._grid[self._row_pos][self._col_pos] != "":
            reward = float(self._grid[self._row_pos][self._col_pos])
            done = True
        else:
            reward = 0.0
            done = False
            
        return (self._row_pos, self._col_pos), reward, done
            
    def render(self):
        """Print an ASCII visualization of the world."""
        for i, row in enumerate(self._grid):
            row_strs = []
            for j, elt in enumerate(row):
                sys.stdout.write(" -----")
                if i == self._row_pos and j == self._col_pos:
                    elt = "R"
                row_strs.append(elt.center(5))
            sys.stdout.write("\n")
            sys.stdout.write("|" + "|".join(row_strs) + "|")
            sys.stdout.write("\n")
        for _ in range(self.num_cols):
            sys.stdout.write(" -----")
        sys.stdout.flush()


## Q-Learning
In this section we will implement Q-learning for the grid world environment defined above. Recall that the optimal Q-function at a given state $s$ for an action $a$ is defined as
$$Q(s, a) = \sum_{s'} T(s, a, s')\left[R(s, a, s') + \gamma \max_{a'} Q(s', a')\right]$$
where $\gamma$ is the discount factor, $T(s, a, s')$ is the state transition probability function, and $R(s, a, s')$ is the reward function.

Furthermore recall that we can learn the Q-function by updating our estimate of the optimal Q-function by averaging over the states and actions we observe. For example say we have some estimate of the Q-function $\hat{Q}_k$ after observing $k$ samples, and say we observe a new sample which consists of $s$ the state we were at, $a$ the action we performed, $s'$ the state we ended up at, and $r$ the reward we got. Then our updated $Q$ function is given by
$$\hat{Q}_{k + 1}(s, a) \leftarrow (1 - \alpha)\hat{Q}_k(s, a) + \alpha \left[r + \gamma\max_{a'} \hat{Q}_k(s', a')\right]$$
where $\alpha$ is a parameter between $0$ and $1$ that we set.

Given this goal fill in the function below

In [None]:
def update_Q(Q_values, old_state, action, new_state, reward, gamma, alpha):
    """Given an old estimate of the Q-function compute a new estimate
    inplace by using observed samples.
    
    Parameters
    ----------
    Q_values : dict of dict
        The estimate of the optimal Q values. The first index is over states
        while the second index is over actions. So for example
        Q_values[(1, 2)]["north"] is the Q-value for the state at position
        (1, 2) and with action "north".
    old_state : tuple of int
        The state we were previously at before making the given action. The
        first index represents the row while the second index represents the
        column of the state.
    action : string
        The action we made. Can either be "north", "east", "west" or "south".
    new_state : tuple of int
        The state we transitioned to after making the given action.
    reward : float
        The reward we obtained after making our action.
    gamma : float
        The discount factor for the Q-function.
    alpha : float
        The proportion that tells us how we will weigh new incoming estimates of Q.

    """
    # First compute the maximum Q-value at the new state.
    max_Q = # TODO: Fill in.

    # Now update the new Q value estimates.
    Q_values[old_state][action] = # TODO: Fill in.

We will also define two types of agents. Ones that will always pick the best estimate of the Q-function and ones that will, with probability $\epsilon$, pick a random action. The former is called a greedy agent while the latter is called an $\epsilon$-greedy agent. We will explore why epsilon greedy agents are sometimes useful. Fill in the function below.

In [None]:
def random_action():
    """Return a random action."""
    return np.random.choice(["north", "south", "east", "west"])

def init_Q(env):
    """Return initial Q-value estimates with all values 0."""
    Q_values = {}
    for i in range(env.num_rows):
        for j in range(env.num_cols):
            Q_values[i, j] = {
                "north": 0.0,
                "west": 0.0,
                "east": 0.0,
                "south": 0.0
            }
    return Q_values
    
def run_agent(Q_values, env, num_rollouts, gamma=0.9, alpha=0.1, epsilon=0.0, render=False):
    """Run a Q-learning agent in a given environment.
    
    Parameters
    ----------
    Q_values : dict of dict
        The Q value estimates to start from.
    env : GridWorld
        The environment in which to run the agent.
    num_rollouts : int
        The number of times we wish to reset the environment
        to its original state.
    gamma : float
        The discount factor for the Q-function.
    alpha : float
        The proportion that tells us how we will weigh new incoming estimates of Q.
    epsilon : float
        The proportion of times the agent will randomly pick an action instead
        of making the optimal move in terms of the current estimate Q-function.
        If epsilon is set to 0 this corresponds to a greedy agent.
    render : bool
        Whether to print the environment as it goes through each iteration.
    
    Returns
    -------
    Q_values : dict of dict
        The learned Q values. The first index is over states
        while the second index is over actions. So for example
        Q_values[(1, 2)]["north"] is the Q-value for the state at position
        (1, 2) and with action "north".

    """
    for i in range(num_rollouts):
        state = env.reset()
        if render:
            time.sleep(0.4)
            clear_output(wait=True)
            env.render()
        done = False
        samples = []
        while not done:
            if np.random.binomial(1, epsilon):
                action = random_action()
            else:
                # Take the best action according to the Q-value estimate.
                # If multiple values are equal, randomly chose between them.
                best_actions = # TODO: Fill in.
                action = np.random.choice(best_actions)
            old_state = state
            state, reward, done = env.step(action)
            samples.append((old_state, action, state, reward))
            if render:
                time.sleep(0.4)
                clear_output(wait=True)
                env.render()
        # Update the Q-function using samples from this rollout.
        # It is much more efficient to use samples in reverse chronological order.
        for old_state, action, state, reward in reversed(samples):
            update_Q(Q_values, old_state, action, state, reward, gamma, alpha)
    return Q_values

## Q-learning in Practice
Let's begin by considering the deterministic setting. We consider a two-path setting where the agent can either receive a small reward by following a short path or a large reward by following a long path.

In [None]:
np.random.seed(0)

# Initialize the world.
env = GridWorld([["",  "",  "1000",  "X"],
                 ["",  "X", "X", "X"],
                 ["",  "R", "",  "1"]], stochastic=False)

# Initialize the Q-value estimates.
Q_values = init_Q(env)

# Learn the Q-value for 100 rollouts.
Q_values = run_agent(Q_values, env, 100, alpha=1.0, render=False)

# Now let's see what the agent plays after learning the Q-values.
Q_values = run_agent(Q_values, env, 1, render=True)

Which reward did the agent go for? Why is that the case?

TODO: Fill in.

Now let's try to run a simple epsilon-greedy agent in this setting. Let's try setting epsilon to 0.5 in this case.

In [None]:
np.random.seed(0)

# Initialize the Q-value estimates.
Q_values = init_Q(env)

# Learn the Q-value for 100 rollouts.
Q_values = # TODO: Fill in.
# Now let's see what the agent plays after learning the Q-values.
Q_values = run_agent(Q_values, env, 1, render=True)

What did epsilon do here? Explain why it caused the observed behavior.

TODO: Fill in.

Now let's consider the stochastic setting. We'll consider a setting where we have a bridge that leads to a high-value end-state. However crossing carries with it a risk of falling down the side of the bridge. Try to find a value of gamma that will lead the robot to try to cross the bridge and a value of gamma that will lead the robot to take the lower valued option.

In [None]:
np.random.seed(0)

# Initialize the world.
env = GridWorld([["",  "", "", "", "", "", "",  "-10", "-10", ""],
                 ["100", "", "", "", "", "", "R", "",     "",     "2000"],
                 ["",  "", "", "", "", "", "",  "-10", "-10", ""]], stochastic=True)

# Initialize the Q-value estimates.
Q_values = init_Q(env)

# Learn the Q-value for 1000 rollouts.
Q_values = run_agent(Q_values,
                     env,
                     1000,
                     epsilon=0.1,
                     gamma=# TODO: Fill in
                     alpha=0.1,
                     render=False)

# Now let's see what the agent plays after learning the Q-values.
Q_values = run_agent(Q_values, env, 1, render=True)

What values of gamma did you use to achieve each behavior? Why did these values work?

TODO: Fill in. You don't need to include both gammas you used in the code cell above just mention them here.

## Efficient Q-learning
In `run_agent` we wait until the end of each rollout before we update the Q-function. Furthermore we do so in a reverse chronological order. Would updating the Q-function each time we see a new sample be better? Why or why not?

TODO: Fill in.