In [1]:
# automatically reload python modules if there is a change
# See https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

# matplotlib plots are embedded inside of the notebook
%matplotlib inline 

$\DeclareMathOperator*{\argmax}{arg\,max}$

# Udacity Banana Collector

This project demonstrates how to train an agent to collect bananas in a room using Deep Q-Networks algorithm.

## Exploring the environment

In [2]:
from unityagents import UnityEnvironment

### Environment

The environment is a modified version of Unity ML-Agents [Banana-Collector][banana-collector].

[banana-collector]: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#banana-collector

### Goal

The goal of the agent is to collect yellow bananas though avoiding blue bananas. The environment is considered to be solved when the average return for the consecutive 100 episode is over 13.

### Reward

The agent gets +1 reward when it reaches a yellow banana and -1 when it does a blue one.

* +1 - yellow banana
* -1 - blue banana

### Observation Space

The observation space has 37 dimensions and contains the agent's velocity plus ray-based perception of objects around the agent's forward direction.

### Action Space

Based on the observation, the agent needs to learn how to best select actions. Four discrete actions are available:

* 0 - move forward
* 1 - move backward
* 2 - turn left
* 3 - turn right


In [3]:
# Creating an environment
env = UnityEnvironment('Banana_Windows_x86_64/Banana.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [4]:
env.brains

{'BananaBrain': <unityagents.brain.BrainParameters at 0x213ea1f1048>}

In [5]:
env.brain_names

['BananaBrain']

In [7]:
brain_name = 'BananaBrain'
brain = env.brains[brain_name]
brain

<unityagents.brain.BrainParameters at 0x213ea1f1048>

## DQN

Loss is defined as:

\begin{equation}
L_{DQN} = (R_{t+1} + \gamma_{t+1} \max_{a'}{q_{\bar{\theta}}}(S_{t+1},a') - q_\theta(S_t,A_t))^2,
\end{equation}

where
  * $t$ : a time step randomly picked from the replay memory
  * $\theta$ : the parameters of the _online network_
  * $\bar{\theta}$ : the parameters of the _target network_

Notes:
  * The gradient of the loss is back-propagated only into $\theta$.
  * $\theta$ is periodically copied to $\bar{\theta}$.
  * Mini-batches are sampled uniformly from the experience replay.

## Double Q-learning

Double Q-learning addresses the overestimation of DQN by decoupling, in the maximization performed for the bootstrap target, the selection of the action from its evaluation.

Double Q-learning defines the loss as:

$$
L_{DDQN} = (R_{t+1} + \gamma_{t+1} q_{\bar{\theta}}(S_{t+1},\argmax_{a'}{q_\theta (S_{t+q},a')}) - q_\theta(S_t,A_t))^2
$$

## Prioritized replay

Prioritized experence replay samples transitions with probability $p_t$ relative to the last encountered absolute _TD error_:

$$
p_t \propto |R_{t+1} + \gamma_{t+1} \max_{a'} q_{\bar{\theta}}(S_{t+1},a') - q_\theta(S_t,A_t)|^w,
$$

where $w$ is a hyper-parameter that determines the shape of the distribution.

## Dueling networks

## Multi-step learning

A multi-step variant of DQN uses foward-view _multi-step_ targets and the alternative loss, which is defined as:

$$
L_{multi-step} = (R_t^{(n)} + \gamma_t^{(n)} \max_{a'} q_{\bar{\theta}}(S_{t+n},a') - q_\theta(S_t,A_t))^2,
$$
where
$$
R_t^{(n)} \equiv \sum_{k=0}^{n-1} \gamma_t^{(k)} R_{t+k+1}.
$$

In [1]:
import agent