In [1]:
# automatically reload python modules if there is a change
# See https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

# matplotlib plots are embedded inside of the notebook
%matplotlib inline 

$\DeclareMathOperator*{\argmax}{arg\,max}$

# Udacity Banana Collector

This project demonstrates how to train an agent to collect bananas in a room using Deep Q-Networks algorithm.

## Environment

This project uses a Unity Banana Collector environment that is modified by Udacity.

The following code cell loads the environment.

You may need to install Mono on your Ubuntu machine if UnityEnvironment fails to load Mono library as shown below

    Unable to load mono library from ./Banana_Linux/Banana_Data/MonoBleedingEdge/x86_64/libmonobdwgc-2.0.so
    Failed to load mono
    
In that case, follow the instructions on https://www.mono-project.com/download/stable/#download-lin and install Mono. Below is the shell commands for Ubuntu 18.04

    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
    echo "deb https://download.mono-project.com/repo/ubuntu stable-bionic main" | sudo tee /etc/apt/sources.list.d/mono-official-stable.list
    sudo apt update
    sudo apt install mono-devel

In [2]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name='Banana_Linux/Banana.x86')

UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
	 The environment does not need user interaction to launch
	 The Academy and the External Brain(s) are attached to objects in the Scene
	 The environment and the Python interface have compatible versions.

## DQN

Loss is defined as:

\begin{equation}
L_{DQN} = (R_{t+1} + \gamma_{t+1} \max_{a'}{q_{\bar{\theta}}}(S_{t+1},a') - q_\theta(S_t,A_t))^2,
\end{equation}

where
  * $t$ : a time step randomly picked from the replay memory
  * $\theta$ : the parameters of the _online network_
  * $\bar{\theta}$ : the parameters of the _target network_

Notes:
  * The gradient of the loss is back-propagated only into $\theta$.
  * $\theta$ is periodically copied to $\bar{\theta}$.
  * Mini-batches are sampled uniformly from the experience replay.

## Double Q-learning

Double Q-learning addresses the overestimation of DQN by decoupling, in the maximization performed for the bootstrap target, the selection of the action from its evaluation.

Double Q-learning defines the loss as:

$$
L_{DDQN} = (R_{t+1} + \gamma_{t+1} q_{\bar{\theta}}(S_{t+1},\argmax_{a'}{q_\theta (S_{t+q},a')}) - q_\theta(S_t,A_t))^2
$$

## Prioritized replay

Prioritized experence replay samples transitions with probability $p_t$ relative to the last encountered absolute _TD error_:

$$
p_t \propto |R_{t+1} + \gamma_{t+1} \max_{a'} q_{\bar{\theta}}(S_{t+1},a') - q_\theta(S_t,A_t)|^w,
$$

where $w$ is a hyper-parameter that determines the shape of the distribution.

## Dueling networks

## Multi-step learning

A multi-step variant of DQN uses foward-view _multi-step_ targets and the alternative loss, which is defined as:

$$
L_{multi-step} = (R_t^{(n)} + \gamma_t^{(n)} \max_{a'} q_{\bar{\theta}}(S_{t+n},a') - q_\theta(S_t,A_t))^2,
$$
where
$$
R_t^{(n)} \equiv \sum_{k=0}^{n-1} \gamma_t^{(k)} R_{t+k+1}.
$$

In [1]:
import agent