In [1]:
# automatically reload python modules if there is a change
# See https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

# matplotlib plots are embedded inside of the notebook
%matplotlib inline 

In [2]:
# Imports
import numpy as np
from unityagents import UnityEnvironment

$\DeclareMathOperator*{\argmax}{arg\,max}$

# Udacity Banana Collector

This project demonstrates how to train an agent to collect bananas in a room using Deep Q-Networks algorithm.

## Exploring the environment

In [3]:
# Create an environment
env = UnityEnvironment(file_name='Banana_Linux/Banana.x86')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain brains which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [4]:
env.brain_names

['BananaBrain']

In [5]:
env.brains

{'BananaBrain': <unityagents.brain.BrainParameters at 0x7fc83909e550>}

In [6]:
brain_name = 'BananaBrain'
brain = env.brains[brain_name]
brain

<unityagents.brain.BrainParameters at 0x7fc83909e550>

### Action Space

The agent can take one of 4 possible actions.

- 0 - walk forward
- 1 - walk backward
- 2 - turn left
- 3 - turn right

In [8]:
brain.vector_action_space_type

'discrete'

In [18]:
# The number of possible actions
n_actions = brain.vector_action_space_size
n_actions

4

### State Space

The state space has 37 dimensions and contains the agent's velocity along with ray-based perception of objects around agent's forward direction.

In [10]:
brain.vector_observation_space_type

'continuous'

In [19]:
# The number of state dimensions
n_state_dims = brain.vector_observation_space_size
n_state_dims

37

In [12]:
# Reset the environment
env_info = env.reset(train_mode=True)
env_info

{'BananaBrain': <unityagents.brain.BrainInfo at 0x7fc83909e048>}

In [15]:
# There is only one available brain
env_info = env.reset(train_mode=True)[brain_name]
env_info

<unityagents.brain.BrainInfo at 0x7fc83909eef0>

In [16]:
env_info.vector_observations

array([[0.        , 1.        , 0.        , 0.        , 0.16895212,
        0.        , 1.        , 0.        , 0.        , 0.20073597,
        1.        , 0.        , 0.        , 0.        , 0.12865657,
        0.        , 1.        , 0.        , 0.        , 0.14938059,
        1.        , 0.        , 0.        , 0.        , 0.58185619,
        0.        , 1.        , 0.        , 0.        , 0.16089135,
        0.        , 1.        , 0.        , 0.        , 0.31775284,
        0.        , 0.        ]])

In [17]:
env_info.vector_observations.shape

(1, 37)

### Explore the environment using a random agent

In [20]:
# Reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# Get the current state
state = env_info.vector_observations[0]

# Initialize the score
score = 0

# Keep exploring until done == True
done = False
while not done:
    
    # Select a random action
    action = np.random.randint(n_actions)
    
    # Take the action
    env_info = env.step(action)[brain_name]
    
    # Observe the result
    next_state = env_info.vector_observations[0]
    reward = env_info.rewards[0]
    done = env_info.local_done[0]
    
    # Accumulate the reward to the score
    score += reward
    
    # Prepare the state for the next action
    state = next_state
    
score

1.0

In [21]:
env_info.vector_observations

array([[1.        , 0.        , 0.        , 0.        , 0.04178659,
        0.        , 0.        , 1.        , 0.        , 0.21819802,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.39524487,
        0.        , 0.        , 1.        , 0.        , 0.58369923,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.21722141,
        0.        , 0.        ]])

In [22]:
env_info.rewards

[0.0]

In [23]:
env_info.local_done

[True]

## DQN

Loss is defined as:

\begin{equation}
L_{DQN} = (R_{t+1} + \gamma_{t+1} \max_{a'}{q_{\bar{\theta}}}(S_{t+1},a') - q_\theta(S_t,A_t))^2,
\end{equation}

where
  * $t$ : a time step randomly picked from the replay memory
  * $\theta$ : the parameters of the _online network_
  * $\bar{\theta}$ : the parameters of the _target network_

Notes:
  * The gradient of the loss is back-propagated only into $\theta$.
  * $\theta$ is periodically copied to $\bar{\theta}$.
  * Mini-batches are sampled uniformly from the experience replay.

## Double Q-learning

Double Q-learning addresses the overestimation of DQN by decoupling, in the maximization performed for the bootstrap target, the selection of the action from its evaluation.

Double Q-learning defines the loss as:

$$
L_{DDQN} = (R_{t+1} + \gamma_{t+1} q_{\bar{\theta}}(S_{t+1},\argmax_{a'}{q_\theta (S_{t+q},a')}) - q_\theta(S_t,A_t))^2
$$

## Prioritized replay

Prioritized experence replay samples transitions with probability $p_t$ relative to the last encountered absolute _TD error_:

$$
p_t \propto |R_{t+1} + \gamma_{t+1} \max_{a'} q_{\bar{\theta}}(S_{t+1},a') - q_\theta(S_t,A_t)|^w,
$$

where $w$ is a hyper-parameter that determines the shape of the distribution.

## Dueling networks

## Multi-step learning

A multi-step variant of DQN uses foward-view _multi-step_ targets and the alternative loss, which is defined as:

$$
L_{multi-step} = (R_t^{(n)} + \gamma_t^{(n)} \max_{a'} q_{\bar{\theta}}(S_{t+n},a') - q_\theta(S_t,A_t))^2,
$$
where
$$
R_t^{(n)} \equiv \sum_{k=0}^{n-1} \gamma_t^{(k)} R_{t+k+1}.
$$

In [1]:
import agent