### TODO
* Explain: Q-learning
* Explain: SARSA
* Explain: n-step methods
* Explain: cliff env.


### DONE
* code: Run and compare (on cliff)
* Explain: tabular


### NOTES
* Example 6.6: Cliff Walking


The goal of reinforcement learning is to achieve goal-directed learning from interactions with an environment.
At each time step, $t$, the agent recieves a state observation $S_t$ and a reward $R_t$, and performns an action $A_t$, like so:

![Image of environment-agent interaction](img/env-agent.png)

The agent must learn to pick actions that maximize the total expected future reward, called the **return**.

For simple cases, where the actions and observation space are small and discrete it is possible to use tabular approaches, where each possible state-action pair is enumerated (in a table).
Tabular approaches aren't applicable to real world problems, but are useful for understanding and inlluminating the fundamentals of reinforcement learning.

This notebook describes tabular versions of two of the classical reinforcement learning algorithms: Q-learning and SARSA.

## asdfs

**TODO**: Value function approximation

In practice we often use the discounted return instead of the actual return.
Discounted returns weights imminent rewards higher than rewards in the far future.
This is controlled by the discounting factor $\gamma \in [0, 1])$, like so:

$$
G_t = \sum_{k=0}^t \gamma^k R_{t+k+1}
$$


**TODO**: greedy actions

**TODO**: epsilon greedy

**TODO**: one step bootstrapping - n step boot strapping

# Environment: Cliff walker

**TODO**: Describe environment

In [None]:
%matplotlib inline

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import matplotlib.pyplot as plt

import utils
from cliff import Cliff
from agents import TabularNStepSARSA, TabularNStepQLearning

# Experiments

Below we train run tabular Q-learning and then SARSA agents.

During training we monitor 
 * the highest action value for each possible state (left plot), i.e. value of the greedy action of the agent.
 * the movement as heatmap (right plot) i.e. the number of times the agent has visited each state.

In [None]:
## Run settings
num_runs = 10  # Number of runs to average rewards over
eps_per_run = 500  # Number of episodes (terminations) per run
n = 1  # n parameter in n-step Bootstrapping

In [None]:
TN_QLearning_rewards = []
env = Cliff()
for i in range(num_runs):
    TN_QLearning = TabularNStepQLearning(env.state_shape, env.num_actions, n=n)
    _, rewards = utils.run_loop(env, TN_QLearning, 'QLearning, n='+str(n), max_e=eps_per_run)
    TN_QLearning_rewards.append(rewards)

TN_QLearning_rewards = np.array(TN_QLearning_rewards)

In [None]:
# Run the last agent using visualizations.
utils.run_loop(env, TN_QLearning, 'QLearning, n='+str(n), max_e=1, render=True)

In [None]:
TN_SARSA_rewards = []
env = Cliff()
for i in range(num_runs):
    try:
        TN_SARSA = TabularNStepSARSA(env.state_shape, env.num_actions, n=n)
        _, rewards = utils.run_loop(env, TN_SARSA, 'SARSA, n='+str(n), max_e=eps_per_run)
        TN_SARSA_rewards.append(rewards)
    except KeyboardInterrupt:
        break

TN_SARSA_rewards = np.array(TN_SARSA_rewards)

In [None]:
# Run the last agent using visualizations.
utils.run_loop(env, TN_SARSA, 'SARSA, n='+str(n), max_e=1, render=True)

For $n=1$ we see that the Q-learning agent learns the shortest path, right by the cliff, where as the SARSA agent learns 


# Discussion

The code cell below plots the (smoothed) average reward for Q-learning and SARSA as a function of episodes.
In the $n=1$ case we can clearly see that the risky 'shortes-path strategy' of the Q-learning agent doesn't payoff, and it generally recieves a lower reward.

In [None]:
plt.figure()
include_sd = False # include standard deviation in plot
utils.reward_plotter(TN_QLearning_rewards, 'QLearning', 'r', include_sd=include_sd, smooth_factor=2)
utils.reward_plotter(TN_SARSA_rewards, 'SARSA', 'b', include_sd=include_sd, smooth_factor=2)

axes = plt.gca()
axes.set_ylim([-100, 0])

plt.show()


**TODO**: Describe: Why does Q learn the bad way?

If we anneal the $\epsilon$ to zero over time both agents will eventually learn to follow the safe path.

Another way of helping the Q-learning agent is to increase the $n$ bootstrapping parameter, thus allowing it to account for longer consequences longer into the future.
For any $n>1$ the 


**TODO**: Describe: re-run everything, but with $n=5$

In [None]:
## Run settings
n = 5
# We leave the other settings as before

In [None]:
TN_QLearning_rewards = []
env = Cliff()
for i in range(num_runs):
    TN_QLearning = TabularNStepQLearning(env.state_shape, env.num_actions, n=n)
    _, rewards = utils.run_loop(env, TN_QLearning, 'QLearning, n='+str(n), max_e=eps_per_run)
    TN_QLearning_rewards.append(rewards)
TN_QLearning_rewards = np.array(TN_QLearning_rewards)



In [None]:
TN_SARSA_rewards = []
env = Cliff()
for i in range(num_runs):
    try:
        TN_SARSA = TabularNStepSARSA(env.state_shape, env.num_actions, n=n)
        _, rewards = utils.run_loop(env, TN_SARSA, 'SARSA, n='+str(n), max_e=eps_per_run)
        TN_SARSA_rewards.append(rewards)
    except KeyboardInterrupt:
        break
TN_SARSA_rewards = np.array(TN_SARSA_rewards)

In [None]:
plt.figure()
include_sd = False # include standard deviation in plot
utils.reward_plotter(TN_QLearning_rewards, 'QLearning', 'r', include_sd=include_sd, smooth_factor=2)
utils.reward_plotter(TN_SARSA_rewards, 'SARSA', 'b', include_sd=include_sd, smooth_factor=2)

axes = plt.gca()
axes.set_ylim([-100, 0])

plt.show()
