### Problem Set 05: Reinforcement Learning

In this problem set, you will implement model-free approaches for reinforcement learning.


0. [Credit for Contributors (required)](#contributors)

1. [Passive Reinforcement Learning (60 points)](#problem1)
    1. [Direct Evaluation (20 points)](#direct_evaluation)
    2. [Sample Sensitivity (10 points)](#sample_sensitivity)
    3. [Temporal Difference Learning (20 points)](#temporal_difference)
    4. [Learning Rate Sensitivity (10 points)](#learning_rate)
2. [Active Reinforcement Learning (35 points)](#problem2)
    1. [Q-Learning (20 points)](#Qlearning)
    2. [Epsilon-Greedy Q-Learning (10 points)](#epsilon_greedy)
    3. [Exploration vs. Exploitation (5 points)](#exploration)
3. [Homework survey (5 points)](#part3)
    
**100 points** total for Problem Set 5

## <a name="contributors"></a> Credit for Contributors

List the various students, lecture notes, or online resouces that helped you complete this problem set:

Ex: I worked with Bob on the cat activity planning problem.

<div class="alert alert-info">
Write your answer in the cell below this one.
</div>

--> *(double click on this cell to delete this text and type your answer here)*

In [2]:
# Be sure to run the cell below to import the code needed for this assignment.
from __future__ import division

%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import copy
from grid import MDPGrid, generate_mdp_plot, generate_grid_plot
from mdp_utils import *

# imports for autograder
from principles_of_autonomy.grader import Grader
from principles_of_autonomy.notebook_tests.pset_5 import TestPSet5

## <a name="problem1"></a> Problem 1: Passive Reinforcement Learning (60 points)

In this problem, you will implement the Passive Reinforcement Learning approaches we saw in class: Direct Evaluation and Temporal Difference Learning.

The problems in this problem set are based on the simple MDP class from Problem Set 4 defined in the `mdp_utils.py` file. Feel free to open Problem Set 4 for a reminder of the MDP definition and the grid world our robot operates in. This week, however, we assume that we don't know the transition function T or the reward function R. Instead, we will implement model-free reinforcement learning methods to calculate the Values and/or Q-Values.

We provide you with a `generate_episodes` function that generates episodes from an agent acting according to a policy in the MDP:

```python
def generate_episodes(mdp, policy, num_episodes=5, max_steps=15)
```

Let's build the simple MDP from last week and visualize it again:

In [None]:
n = 3
goal = (2, 2)
obstacles = [(0, 1)]

# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

# Visualize the MDP:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Draw goal and obstacle cells.
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')

Now let's create a policy to follow when we generate episodes:

In [4]:
# Create simple policy that always moves right.
policy = {s: 'right' for s in mdp.S}

Now let's generate some episodes according to this policy in the MDP:

In [5]:
episodes = generate_episodes(mdp, policy, num_episodes=10, max_steps=100)

Run the following print block to look at the generated episodes. Notice how they have different lengths because they reach terminal states at different times. Notice also how the stochasticity of the MDP manifests itself in the samples of an episode: sometimes you take an action and end up where you intended; other times the action takes you in a different state. The ratio of these outcomes is implicit from the transition probabilities of the environment.

In [None]:
# Each episode is a list of (state, action, state', reward) tuples
for i, episode in enumerate(episodes):
    print(f"Episode {i+1} of length {len(episode)}: ".format(i, episode), episode, "\n")

### <a name="direct_evaluation"></a> Direct Evaluation (20 points)

In this part of the problem, you will implement **Direct Evaluation** (also called Monte-Carlo Evaluation). Your function should have the following signature:
```python
def direct_evaluation(episodes, gamma)
```

The function takes in a list of episodes representing different experiences of the agent acting according to a fixed policy in the MDP, and the discount factor gamma. The function should return a Python dictionary with the estimated value for each state in the MDP.

In general Direct Evaluation has 2 variants: First-Visit and Every-Visit. In First-Visit, we only add a sample of discounted rewards to the value estimate whenever we encounter a state for the first time in the episode. In Every-Visit, we add a sample *every* time we encounter a state (even if it's repeated). You should implement the First-Visit variant of Direct Evaluation that we talked about in class. For this, you'll find it useful to keep track of states visited so far in the episode with a `visited_states` list (that you reset whenever you move on to the next episode).

<div class="alert alert-info">
Implement the function `direct_evaluation(episodes, gamma)` below.
</div>

In [7]:
def direct_evaluation(episodes, gamma):
    """
    Direct Evaluation method to estimate the value function for each state for a fixed policy.
    
    Args:
        episodes (list): A list of episodes. Each episode is a list of (state, action, state', reward) tuples.
        gamma (float): The discount factor.
        
    Returns:
        V: A dictionary mapping states to their estimated value V(s).
    """
    raise NotImplementedError()

Check out how your code performs on the initial grid world we defined:

In [None]:
# Test your code for direct evaluation with the example from before and a policy that always goes right.

# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
n = 3
goal = (2, 2)
obstacles = [(0, 1)]
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

# Create simple policy that always moves right.
policy = {s: 'right' for s in mdp.S}

# Generate some episodes.
episodes = generate_episodes(mdp, policy, num_episodes=100000, max_steps=100)

# Perform direct evaluation with these episodes.
V = direct_evaluation(episodes, mdp.gamma)

# Visualize values:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot values with colors and numbers.
g.plot_V(axes, V, print_numbers=True)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')

You can convince yourself this is correct by running Policy Evaluation in the PSet 4 notebook with this always-go-right policy and comparing the values.

In [None]:
"""Test your direct evaluation code here."""
Grader.run_single_test_inline(TestPSet5, "test_1_direct_evaluation", locals())

### <a name="sample_sensitivity"></a> Sample Sensitivity (10 points)

In class, we discussed that Direct Evaluation, while simple, has a number of limitations. We now want to observe its performance with varying sample sizes. Run the MDP example above with a varying number of samples; concretely, run the block with different values for `num_episodes` inside `generate_episodes`. Try 100, 1000, 10000, and 100000. For each `num_episodes` value run the block a few times, then answer the following:

- What trends do you observe in the value estimates as you increase the number of episodes?
- How does the number of samples affect the convergence of the value function?
- Why does Direct Evaluation require a large number of samples to provide accurate estimates?


<div class="alert alert-info">
**Discuss your results in the cell below**
</div>

--> *(double click on this cell to delete this text and type your answer here)*

### <a name="temporal_difference"></a> Temporal Difference Learning (20 points)

In this part of the problem, you will implement **Temporal Difference Learning**, which loops over the episode samples and performs the TD update to the value function one sample at a time. Your function should have the following signature:
```python
def td_learning(episodes, gamma, alpha)
```

The function takes in a list of episodes experienced by following the fixed policy in the MDP, the discount factor gamma, and the learning rate alpha for blending past values with the new sample. The function should return a Python dictionary with the estimated value for each state in the MDP.

<div class="alert alert-info">
Implement the function `td_learning(episodes, gamma, alpha)` below.
</div>

In [10]:
def td_learning(episodes, gamma, alpha):
    """
    Temporal-Difference Learning to estimate the value function for a fixed policy.
    
    Args:
        episodes (list): A list of episodes. Each episode is a list of (state, action, state', reward) tuples.
        gamma (float): The discount factor of the MDP.
        alpha (float): The learning rate.
        
    Returns:
        V: A dictionary mapping states to their estimated value V(s).
    """
    raise NotImplementedError()

Check out how your code performs on the initial grid world we defined:

In [11]:
# Test your code for TD Learning with the example from before and a policy that always goes right.

# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
n = 3
goal = (2, 2)
obstacles = [(0, 1)]
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

# Create simple policy that always moves right.
policy = {s: 'right' for s in mdp.S}

# Generate some episodes.
episodes = generate_episodes(mdp, policy, num_episodes=100000, max_steps=100)

In [None]:
# Perform TD learning with these episodes.
alpha = 0.01
V = td_learning(episodes, mdp.gamma, alpha)

# Visualize values:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot values with colors and numbers.
g.plot_V(axes, V, print_numbers=True)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')

In [None]:
"""Test your TD learning code here."""
Grader.run_single_test_inline(TestPSet5, "test_2_TD_learning", locals())

### <a name="learning_rate"></a> Learning Rate Sensitivity (10 points)

In the code above, we ran TD Learning with alpha = 0.01. Try playing with more values of alpha: 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5. Answer the following based on what you observe:

- How does a very low learning rate (e.g., alpha = 0.00001) affect the value function compared to a moderate learning rate (e.g., alpha = 0.01)? What about a very high learning rate (e.g., alpha = 0.5)?
- Identify a learning rate that seems to work well for your specific MDP setup and explain why you think it strikes the right balance.
- Reflecting on your results, how do you think the choice of learning rate might affect the performance of TD Learning in more complex environments?

<div class="alert alert-info">
**Discuss your results in the cell below**
</div>

--> *(double click on this cell to delete this text and type your answer here)*

## <a name="problem2"></a> Problem 2: Active Reinforcement Learning (35 points)

Passive Reinforcement Learning is useful for evaluating fixed policies without explicit knowledge of T and R. However, it can't be used for turning these values into a new (improved) policy. For that, we need active interactions with the environment (not just passive reflection on past episodes). In this problem, we switch over to Active Reinforcement Learning, specifically to Q-Learning, and we will investigate the exploration-exploitation tradeoff and how that can affect the algorithm's efficiency.

Since we are interacting with the world in real time now, instead of using `generate_episodes` to generate experiences to learn from, we will use a function `sample_environment` which given a state and an action it queries the MDP environment for the resulting next state and reward:

```python
def sample_environment(mdp, state, action)
```

You can try running the following block to see how to call `sample_environment` and what it returns. Try running it a few times (and possibly with different states and actions) to convince yourself that there is stochasticity involved. (Note if you runt the following cell after running code for Q-learning below you may find that it's no longer stochastic because we set a random seed in that part for grading.)

In [None]:
# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
n = 3
goal = (2, 2)
obstacles = [(0, 1)]
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

state = (1, 2)
action = "right"
next_state, reward = sample_environment(mdp, state, action)
print(state, action, next_state, reward)

Before we dive into implementing Q-Learning, let's first see some Q-values. We will store Q-values as a Python dictionary that maps each state in the MDP to another dictionary that maps each action to the corresponding Q-value. To see an example, here's what a random Q-Value function looks like:

In [None]:
# A random Q-value function:
def random_Q(states, actions):
    # Initialize empty Q-values dictionary.
    Q = {}
    # Generate random Q-values between minV and maxV.
    minQ, maxQ = -100, 100
    for state in states:
        Q[state] = {}
        for action in actions:
            Q[state][action] = np.random.uniform(minQ, maxQ)
    return Q

# Print values generated with the random Q-Value function.
print("Q_random is a valid Q-Value function, although definitely not optimal (it's just random!)")
n = 3
mdp = build_mdp(n, 0.8, [], (1, 1), 0.8)
Q_random = random_Q(mdp.S, mdp.A)
Q_random

Let's now visualize these Q-values in the grid:

In [None]:
# Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()

# Plot Q in the grid.
g.plot_Q(axes, Q_random, print_numbers = True)

### <a name="Qlearning"></a> Q-Learning (20 points)

You will now implement **Q-Learning**. While Direct Evaluation and TD Learning were on-policy (they relied on samples coming from a fixed policy), Q-Learning is off-policy (doesn't depend on samples from one fixed policy), can choose its own actions as we go, and can converge to the optimal policy even if the agent is acting suboptimally. You should thus make use of `sample_environment` in your code to try out actions in the environment. For this problem, you should use a random action selection strategy meaning you should choose randomly from the available actions to sample the environment.

Your function should have the following signature:
```python
def q_learning(num_episodes, gamma, alpha, max_steps=100)
```

The function takes in the number of episodes your agent can attempt in the MDP, the discount factor gamma, the learning rate alpha (similar to TD learning), and a maximum length for each episode (to avoid infinite episodes). The function should return a Python dictionary with the estimated Q-value for each (state, action) pair in the MDP, as well as a policy that results from maximizing the final values at each state.

To help you out, we provide some starter code that initializes an empty Q-value dictionary, loops through episodes, and starts an episode at a random state. The function `mdp.is_sink_state(state)` checks whether a state is terminal or not; the episode ends right after we enter a terminal state or when we reach the maximum allowable length of the episode max_steps.

<div class="alert alert-info">
Implement the function `q_learning(num_episodes, gamma, alpha, max_steps=100)` below.

Complete the two To dos.
</div>

In [17]:
def q_learning(num_episodes, mdp, gamma, alpha, max_steps=100):
    """
    Active Q-Learning with random action selection.
    
    Args:
        num_episodes (int): The number of episodes to run.
        gamma (float): The discount factor of the MDP.
        alpha (float): The learning rate.
        max_steps (int): The maximum number of steps per episode to avoid infinite loops.
        
    Returns:
        Q: A dictionary mapping states to another dictionary mapping actions to their estimated Q-values.
        policy: A dictionary mapping states to actions to take.
    """
    # in real life, you won't have access to mdp. We are including it as a parameter for testing purposes.
    # DO NOT USE mdp.R or mdp.T!

    # Initialize Q-value dictionary.
    Q = {state: {action: 0.0 for action in mdp.A} for state in mdp.S}
    
    # Loop through episodes.
    for episode in range(num_episodes):
        # Start episode in a random state.
        state = random.choice(list(mdp.S))
        # Track number of steps in episode.
        steps = 0
        # While the episode is not done.
        while not mdp.is_sink_state(state) and steps < max_steps:
            steps += 1

            # Pick an action randomly
            action = random.choice(mdp.A)

            ### TO DO 1: sample the environment, update Q-values using the Q-learning update rule, and transition to the next state.
            raise NotImplementedError()

    ### TO DO 2: Derive policy from Q-values by maximizing over actions at each state.
    raise NotImplementedError()
    
    return Q, policy

Check out how your code performs on the initial grid world we defined:

In [None]:
# Test your code for Q Learning with the example from before.

# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
n = 3
goal = (2, 2)
obstacles = [(0, 1)]
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

# Perform Q learning with these episodes.
alpha = 0.01
num_episodes = 10000
Q, policy = q_learning(num_episodes, mdp, mdp.gamma, alpha)

# Visualize values:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot values with colors and numbers.
g.plot_Q(axes, Q, print_numbers=True)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')   

Convince yourself that the optimal policy is the same one we got with Policy Extraction when we knew the ground truth T and R values in Problem Set 4.

In [None]:
# Visualize the result.
# 1. Create grid for plotting
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot the values and the policy.
g.plot_policy(axes, policy)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k') 

In [None]:
"""Test your Q learning code here."""
# test_3_Q_learning(q_learning)
Grader.run_single_test_inline(TestPSet5, "test_3_Q_learning", locals())

### <a name="epsilon_greedy"></a> Epsilon-Greedy Q-Learning (10 points)

We saw in class that an agent that selects actions randomly will take longer to explore the space and converge than one that has a more clever exploration strategy. We will now implement an **Epsilon-Greedy** variant of Q-Learning, meaning it chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise. Note that choosing a random action may result in choosing the best action - that is, you should not choose a random sub-optimal action, but rather any random legal action.

Your function should have the following signature:
```python
def q_learning_epsilon_greedy(num_episodes, gamma, alpha, epsilon, max_steps=100)
```

The function takes in the number of episodes your agent can attempt in the MDP, the discount factor gamma, the learning rate alpha, the epsilon exploration probability, and the maximum steps of an episode. The function should return a Python dictionary with the estimated Q-value for each state-action pair in the MDP, as well as a policy that results from maximizing the final values at each state.

A good place to start is to just copy your code for `q_learning` below. 

**Important Notes**

It's important to do the following for passing the test.
- Please use `action = random.choice(mdp.A)` to generate the action for each step as we provided above in q-learning. 
- For the **Epsilon-Greedy**  part, please use `if random.uniform(0, 1) < epsilon:` then choose random action otherwise act on current policy.
- You only need < 5 lines of change to your code.

<div class="alert alert-info">
Implement the function `q_learning_epsilon_greedy(num_episodes, gamma, alpha, epsilon, max_steps=100)` below.
</div>

In [23]:
def q_learning_epsilon_greedy(num_episodes, mdp, gamma, alpha, epsilon, max_steps=100):
    """
    Active Q-Learning with epsilon-greedy action selection.
    
    Args:
        num_episodes (int): The number of episodes to run.
        gamma (float): The discount factor.
        alpha (float): The learning rate.
        epsilon (float): Exploration probability.
        max_steps (int): The maximum number of steps per episode.
        
    Returns:
        Q: A dictionary mapping states to another dictionary mapping actions to their estimated Q-values.
        policy: A dictionary mapping states to the optimal actions.
    """
    # in real life, you won't have access to mdp. We are including it as a parameter for testing purposes.
    # BEGIN HERE!
    raise NotImplementedError()


Check out how your code performs on the initial grid world we defined:

In [None]:
# Test your code for Epsilon Greedy Q Learning with the example from before.

# Build MDP with p=0.8 and gamma=0.8. Use default rewards.
n = 3
goal = (2, 2)
obstacles = [(0, 1)]
mdp = build_mdp(n, 0.8, obstacles, goal, 0.8)

# Perform Q learning with these episodes.
alpha = 0.01
epsilon = 0.5
num_episodes = 10000
Q, policy = q_learning_epsilon_greedy(num_episodes, mdp, mdp.gamma, alpha, epsilon)

# Visualize values:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot values with colors and numbers.
g.plot_Q(axes, Q, print_numbers=True)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')   

In [None]:
"""Test your Epsilon Greedy Q learning code here."""
# test_4_Q_learning(q_learning_epsilon_greedy)
Grader.run_single_test_inline(TestPSet5, "test_4_Q_learning", locals())

### <a name="exploration"></a> Exploration vs. Exploitation (5 points)

We will now run your code on the larger grid world with more obstacles. Execute the following code:


In [None]:
n = 10
goal = (5,8)
obstacles = [(1,3), (9,0), (8,8)] + \
            [(4, 2), (4, 3), (4, 6)] + \
            [(6, 2), (6, 3), (6, 5), (6, 6)]
mdp = build_mdp(n, p=0.8, obstacles=obstacles, goal=goal, gamma=0.8, goal_reward=100, obstacle_reward=-500)

# Perform Q learning with these episodes.
alpha = 0.01
epsilon = 0.1
num_episodes = 50000
Q, policy = q_learning_epsilon_greedy(num_episodes, mdp, mdp.gamma, alpha, epsilon)

# Visualize values:
# 1. Create grid for plotting.
g = MDPGrid(n, n)
axes = g.draw()
# 2. Plot values with colors and numbers.
g.plot_Q(axes, Q, print_numbers=False)
g.plot_policy(axes, policy)
# 3. Draw goal and obstacle cells
g.draw_cell_circle(axes, goal, color='g')
for ob in obstacles:
    g.draw_cell_circle(axes, ob, color='k')   

Now try running the same code block with different values of epsilon: 0 (pure exploitation), 0.1, 0.5, 1.0 (pure exploration). Try also varying num_episodes to get an idea of how many samples the algorithm needs to converge with different epsilon values. Answer the following questions:

- How does changing the epsilon parameter affect the performance and policy learned by the agent?
- Can you find a balance between exploration and exploitation that leads to both fast convergence and a good policy?

<div class="alert alert-info">
**Discuss your results in the cell below**
</div>

--> *(double click on this cell to delete this text and type your answer here)*

## <a name="part4"></a> Time Spent on Pset (5 points)

Please use [this form](https://forms.gle/iRvW9zKmmJ8eFiPX7) to tell us how long you spent on this pset. After you submit the form, the form will give you a confirmation word. Please enter that confirmation word below to get an extra 5 points. 

In [28]:
form_confirmation_word = "" #"ENTER THE CONFIRMATION WORD HERE"

In [None]:
# Run all tests
Grader.grade_output([TestPSet5], [locals()], "results.json")
Grader.print_test_results("results.json")