# Continuous Deep Reinforcement Learning on Slot Car Racing

In this notebook we provide a quick introduction to the "Deep Deterministic Gradient Policy" algorithm presented in the 2016 paper "Continuous Control with Deep Reinforcement Learning" by Lillicrap et al. We demonstrate the algorithm in a Slot Car Racing (also known by the brand-name Carrera) environment.

The following consists of multiple parts:

  - The custom `Carrera` environment -- a Python class which allows agents to create, reset and perform "steps" on a slot car racing track. It also provides abstractions for visualizing the algorithms performance.
  - An `Agent` class which interacts with a provided environment.
  - The DDPG algorithm implemented in TensorFlow.
  
**TODO:** Fix introduction.

## Environment: Lunar Lander

> Landing pad is always at coordinates $(0, 0)$. Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about $100..140$ points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is $+10$. Firing main engine is $-0.3$ points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from $-1$ to $+1$. First controls main engine, $-1..0$ off, $0..+1$ throttle from $50\%$ to $100\%$ power. Engine can't work with less than $50\%$ power. Second value $-1.0..-0.5$ fire left engine, $+0.5..+1.0$ fire right engine, $-0.5..0.5$ off.

In order to run the environments visualizations in a notebook we implemented a small wrapper around it.

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import gym


class Lander:
    def __init__(self):
        self.gym = gym.make('LunarLanderContinuous-v2')
        self.reset = self.gym.reset
        self.episode_reward = 0
        self._fig = None

    def step(self, action):
        state, reward, terminal, info = self.gym.step(action)
        self.episode_reward += reward
        return state, reward, terminal, info

    def render(self):
        if self._fig is None:
            self._fig = plt.figure()
            self._img = plt.imshow(self.gym.render(mode='rgb_array'))
        else:
            self._img.set_data(self.gym.render(mode='rgb_array'))
        self._fig.canvas.draw()

In [None]:
# env = Lander()
# state = env.reset()
# terminal = False
# while not terminal:
#     action = env.gym.action_space.sample()
#     _, _, terminal, _ = env.step(action)
#     env.render()

## Memory
While the paper takes note of prioritized replay methods, it only makes use of sampling experiences uniformly from a limited sized buffer. The straight forward approach for implementing this would be to use `collections.deque`, but sampling from such a queue (as the name maybe already shows...) is [expensive](https://wiki.python.org/moin/TimeComplexity). Therefore we implement a custom memory class which makes use of a basic list and implements the element limit through a pointer which dictates which element is to be overwritten on insert.

In [None]:
import random


class Memory:
    """Uniform replay memory with maximum size."""

    def __init__(self, max_size):
        self.max_size = max_size
        self._buffer = []
        self._pointer = 0

    def __len__(self):
        return len(self._buffer)

    def add(self, experience):
        if len(self) < self.max_size:
            self._buffer.append(experience)
        else:
            self._buffer[self._pointer] = experience
            self._pointer = (self._pointer + 1) % self.max_size

    def sample(self, n):
        return random.sample(self._buffer, n)

## Agent

Rather abstract implementation of a reinforcement learning agent. The actual RL model is plug and play, as long as it is implemented consistently.

In [None]:
from itertools import count
from typing import Tuple
import numpy as np


class Agent:
    """A reinforcement learning agent."""
    theta = 0.15
    sigma = 0.2
    batchsize = 64
    eps = 1
    eps_min = 0
    eps_rate = 1 / 100000

    def __init__(self, env, model, render=False, memory_size=100000):
        """Create a new reinforcement learning agent."""
        self.memory = Memory(memory_size)
        self.env = env
        self.model = model
        self.render = render

    def train(self, episodes: int):
        """Training loop."""
        stats = []
        total_steps = 0
        for episode in range(1, episodes + 1):
            state = self.env.reset()
            for step in count():
                # Perform action, store new experience, train model.
                action = self._get_action(state)
                state_, reward, terminal, _ = self.env.step(action)
                self.memory.add((state, action, reward, state_, terminal))
                state = state_  # Next state becomes current state.
                if self.render:
                    self.env.render()
                if len(self.memory) >= self.batchsize:
                    self.model.train(self.memory.sample(self.batchsize))
                if terminal:  # Start new episode if in terminal state.
                    stats.append((self.env.episode_reward, step))
                    break
            total_steps += step
            if episode % 100 == 0:
                stats = np.asarray(stats)
                print('Episode {}, max reward/steps {:.2f}/{:.2f}, average reward/steps {:.2f}/{:.2f}'
                      .format(episode, *stats.max(0), *stats.mean(0)))
                print('total_steps', total_steps)
                stats = []

    def _get_action(self, state) -> Tuple[float]:
        if self.eps > self.eps_min:
            self.eps -= self.eps_rate
        return self.model.get_action(state, self.eps)

## DDPG Model

While we again will create a class for the Deep Deterministic Gradient Policy model, we will implement some of the parts as functions outside of the class in order to better walk through them in this notebook. When implementing this as a script one would want to integrate them all into the class.

In [None]:
import tensorflow as tf
from tensorflow.contrib.framework import get_variables

### Actor and Critic Networks
Our model consists of a total of two different networks -- an actor and a critic network. The problem with that approach is that during training we not only optimize those networks, we also use them to dirigate our agent. Manipulating the online policy leads to a feedback loop which leads to instability. While we already use a big memory buffer to mitigate this problem, the authors propose to additionally use two sets of parameters for each network.

In the implementation this leads to theoretically four networks, two actor and two critic networks, an online and a target version for each. While the online networks will be used for online predictions and will be updated at every timestep, the target networks will be used for determining the directions in which the online networks should be updated. From time to time the target networks will be updated using the online networks weights -- more on that below.

#### Abstract Networks
In order to easily model our networks we use the `namedtuple` collection. Similar to class instances named tuples allow dot-access to their members. For the target networks we only need the network outputs and internal variables, because we will directly manipulate them using the corresponding online network's variables. For the online networks we additionally need a reference to their gradient descent optimizers (for the online training) and direct access to the gradients themself -- more on that in the Actor section.

**TODO:** Refresh this section.

In [None]:
from collections import namedtuple

Network = namedtuple('Network', ['y', 'vars', 'ops'])

**TODO**: Explain `dense` layer function or get rid of it.

In [None]:
def dense(x, units, activation, bound=None, decay=None):
    if bound is None:
        bound = float(x.shape[1].value) ** -.5
    if decay is not None:
        decay = tf.contrib.layers.l2_regularizer(0.001)
    kernel = tf.random_uniform_initializer(-bound, bound)
    bias = tf.random_uniform_initializer(-bound, bound)
    return tf.layers.dense(x, units, activation=activation,
                           kernel_initializer=kernel,
                           bias_initializer=bias,
                           kernel_regularizer=decay)

#### Critic
The critic is the Q value function or Bellman equation approximator. Q values describe the expected reward for an action which normally would be determined through dynamic programming. The critic maps a state/action pair to a single scalar value. This stands in contrast to Deep Q Networks (Mnih et al 2015), where the Q network maps the environment's state to multiple Q values, one for each action. This is because in our case the Q network is not used to determine which action to take, but only to *criticize* whatever action the actor network decided on taking.

For the critic network we strickly to the structure described in the paper:

  - Two hidden layers with ReLu activation and 400 and 300 neurons respectivley.
  - Batch normalization applied to the input and first hidden layer.
  - Actions enter the network after the first hidden layer.
 
As common in Deep Q-Networks the single output neuron uses a linear activation.

In [None]:
def critic(x, actions, name='online'):
    """Build a critic network q, the value function approximator."""
    in_dim = x.shape[1].value
    with tf.variable_scope(name) as scope:
        training = tf.shape(x)[0] > 1
        norm_0 = tf.layers.batch_normalization(x, training=training)
        hidden_1 = dense(norm_0, 400, tf.nn.relu, decay=True)
        norm_1 = tf.layers.batch_normalization(hidden_1, training=training)
        hidden_1_ = tf.concat([norm_1, actions], axis=1)
        hidden_2 = dense(hidden_1_, 300, tf.nn.relu, decay=True)
        y = dense(hidden_2, 1, tf.identity, 3e-4, True)
        q_values = tf.squeeze(y)
        batch_ops = get_variables(scope, collection=tf.GraphKeys.UPDATE_OPS)
        variables = get_variables(scope)
    return Network(q_values, variables, batch_ops)

**TODO:** Explain critic optimization.

In [None]:
def train_critic(critic: Network, qtargets: tf.Tensor):
    """Build critic network optimizer minimizing MSE."""
    with tf.variable_scope('critic'):
        optimizer = tf.train.AdamOptimizer(1e-3)
        mse = tf.reduce_mean(tf.square(qtargets - critic.y))
        with tf.control_dependencies(critic.ops):
            return optimizer.minimize(mse)

#### Actor

    - Discuss gradient computation/application.

In [None]:
def actor(x, dim_out, name='online'):
    """Build an actor network mu, the policy function approximator."""
    with tf.variable_scope(name) as scope:
        training = tf.shape(x)[0] > 1
        norm_0 = tf.layers.batch_normalization(x, training=training)
        hidden_1 = dense(norm_0, 400, tf.nn.relu)
        norm_1 = tf.layers.batch_normalization(hidden_1, training=training)
        hidden_2 = dense(hidden_1, 300, tf.nn.relu)
        norm_2 = tf.layers.batch_normalization(hidden_2, training=training)
        actions = dense(hidden_2, dim_out, tf.nn.tanh, 3e-4)
        batch_ops = get_variables(scope, collection=tf.GraphKeys.UPDATE_OPS)
        variables = get_variables(scope)
    return Network(actions, variables, batch_ops)

**TODO**: Explain actor optimization. Discuss (action/policy) gradient computation/application thoroughly.

In [None]:
def train_actor(actor: Network, critic: Network, actions: tf.Tensor):
    """Build actor network optimizier performing action gradient ascent."""
    with tf.variable_scope('actor'):
        optimizer = tf.train.AdamOptimizer(1e-4)
        action_gradient, = tf.gradients(critic.y, actions)
        policy_gradients = tf.gradients(actor.y, actor.vars, -action_gradient)
        gradient_pairs = zip(policy_gradients, actor.vars)
        with tf.control_dependencies(actor.ops):
            return optimizer.apply_gradients(gradient_pairs)

#### Target Network Updates
While the online networks are trained directly (thus the *OptimizableNetwork* name), the target networks are only updated irregularily using the online network's parameters. For this paper describes a process named *soft updates*, which only slowly moves the target network's parameters into the direction of the online network. The original Deep Q- and also the Double Deep Q-Network approach instead just directly copies the parameters over.

##### Initial Hard Update
In order to ensure the online and target networks initial equallity, we first implement the hard parameter copying. This function will only be used after initial variable initialization to make sure the online and target network start off from the same foundation.

In [None]:
def hard_updates(src: Network, dst: Network):
    """Overwrite target with online network parameters."""
    return [target.assign(online)
            for online, target in zip(src.vars, dst.vars)]

##### Soft Update
The soft update also consists of the same assign operation as above, but not directly overwrites the target network's parameters but mashes the online and target parameters together. `tau` herein describes how strongly the new values influence the old values.

NOTE: *This could also be implemented using moving averages over the online networks. Might be more efficient?*

In [None]:
def soft_updates(src: Network, dst: Network, tau):
    """Soft update the dst net's parameters using those of the src net."""
    return [target.assign(tau * online + (1 - tau) * target)
            for online, target in zip(src.vars, dst.vars)]

### Noise

**TODO:** Explain Ornstein-Uhlenbeck process noise and RL exploration strategies.

Quote from Lillicrap et al:

> For the exploration noise process we used temporally correlated noise in order to explore well in physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) with θ = 0.15 and σ = 0.2. The Ornstein-Uhlenbeck process models the velocity of a Brownian particle with friction, which results in temporally correlated values centered around 0.

In [None]:
def noise(n, theta=.15, sigma=.2):
    state = tf.Variable(tf.zeros((n,)))
    noise = -theta * state + sigma * tf.random_normal((n,))
    return state.assign_add(noise)

### Bringing it all together

  - Initializes networks and session.
  - Resets TensorFlow graph because notebooks.
  - Copies the initial parameters to the target networks.
  - Provides `train` function which counts SGD steps.
  - Target networks are updated every n SGD steps.
  - Provides `get_action`.

In [None]:
from itertools import chain
import tensorflow as tf


class DDPG:
    """Deep Deterministic Policy Gradient RL Model."""
    gamma = 0.99  # Discount factor
    theta = 0.15  # Ornstein-Uhlenbeck process theta
    mu = 0.2  # Ornstein-Uhlenbeck process mu

    def __init__(self, dim_in=2, dim_out=1, tau=1e-3):
        """Create a new DDPG model."""
        tf.reset_default_graph()  # Graph might contain nodes from last run.

        self.states = tf.placeholder(tf.float32, (None, dim_in), 'inputs')
        self.actions = tf.placeholder(tf.float32, (None, dim_out), 'actions')
        self.qtargets = tf.placeholder(tf.float32, (None,), 'qtargets')
        self.rewards = tf.placeholder(tf.float32, (None,), 'rewards')
        self.terminals = tf.placeholder(tf.bool, (None,), 'terminals')

        with tf.variable_scope('actor'):
            self.actor = actor(self.states, dim_out)
            self.actor_ = actor(self.states, dim_out, 'target')
            self.noise = noise(dim_out, self.theta, self.mu)

        with tf.variable_scope('critic'):
            self.critic = critic(self.states, self.actions)
            self.critic_ = critic(self.states, self.actions, 'target')

        with tf.variable_scope('training'):
            self.critic_op = train_critic(self.critic, self.qtargets)
            self.actor_op = train_actor(self.actor, self.critic, self.actions)
            self.targets_op = (soft_updates(self.critic, self.critic_, tau) +
                               soft_updates(self.actor, self.actor_, tau))

        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())
        self.session.run(hard_updates(self.critic, self.critic_) +
                         hard_updates(self.actor, self.actor_))

    def train(self, batch):
        """Train the online and update the target networks."""
        states, actions, rewards, states_, terminals = zip(*batch)

        # Train critic.
        # 1. Get expected maximum rewards for next state from target critic.
        # 2. Approximate bellman/dynamic programming equation.
        # 3. Update critic network minimizing MSE.
        qvalues = self.session.run(self.critic_.y, {self.states: states_,
                                                    self.actions: actions})
        bellman = rewards + self.gamma * qvalues * np.invert(terminals)
        self.session.run(self.critic_op, {self.states: states,
                                          self.actions: actions,
                                          self.qtargets: bellman})

        # Train actor.
        # 1. Get most up-to-date actions for current state from online actor.
        # 2. Update actor network using policy gradient.
        actions_ = self.session.run(self.actor.y, {self.states: states})
        self.session.run(self.actor_op, {self.states: states,
                                         self.actions: actions_})

        # Update target networks using soft updates.
        self.session.run(self.targets_op)

    def get_action(self, state, exploration=0):
        actions, noise = self.session.run([self.actor.y, self.noise],
                                          {self.states: [state]})
        action = actions[0] + exploration * noise
        action[action < -1] = -1
        action[action > 1] = 1
        return action

## Let's Play

In [None]:
env = Lander()
print(env.gym.action_space.shape,
      env.gym.action_space.low,
      env.gym.action_space.high)
model = DDPG(dim_in=8, dim_out=2)
agent = Agent(env, model, render=True)
agent.train(1000)

In [None]:
env = Lander()
terminal = False
state = env.reset()
env.render()
while not terminal:
    action = model.get_action(state, False)
    state, _, terminal, _ = env.step(action)
    env.render()