# Continuous Deep Reinforcement Learning on Slot Car Racing

In this notebook we provide a quick introduction to the "Deep Deterministic Gradient Policy" algorithm presented in the 2016 paper "Continuous Control with Deep Reinforcement Learning" by Lillicrap et al. We demonstrate the algorithm in a Slot Car Racing (also known by the brand-name Carrera) environment.

The following consists of multiple parts:

  - The custom `Carrera` environment -- a Python class which allows agents to create, reset and perform "steps" on a slot car racing track. It also provides abstractions for visualizing the algorithms performance.
  - An `Agent` class which interacts with a provided environment.
  - The DDPG algorithm implemented in TensorFlow.

## Environment

In [None]:
%matplotlib notebook
import math
from typing import Tuple

import numpy as np
import matplotlib.pyplot as plt


class Carrera:
    """A simple carrera track, modeled by a maximum velocity function."""

    def __init__(self):
        """Create new carrera track environment."""
        self._track_len = 2 * math.pi
        self._position = 0
        self._velocity = 0
        self._positions = [0]
        self._velocities = [0]
        self._terminal = False
        self._episode_reward = 0
        self._max_velocities = np.vectorize(self._max_velocity)
        self._fig = None

    def _max_velocity(self, position: float):
        """Returns the maximum velocity for any position on the track."""
        return (math.sin(position) + 1.2) / 2.2

    def reset(self):
        """Reset environment."""
        self._position = 0
        self._velocity = 0
        self._terminal = False
        self._episode_reward = 0
        return (self._position, self._velocity)

    def step(self, acceleration=0.) -> Tuple[Tuple[float, float], float, bool]:
        """Perform a step in the environment.

        Returns new observation tuple (position, velocity) and reward.
        """
        acceleration = np.min((1, acceleration))
        self._velocity = np.max((0.8 * self._velocity, acceleration))
        self._position += self._velocity / 100
        self._position %= self._track_len
        max_velocity = self._max_velocity(self._position)
        if self._velocity > max_velocity:  # Cart flew out of the track
            self._terminal = True
            return (self._position, self._velocity), -1, self._terminal
        reward = ((self._velocity - max_velocity) + 1) * acceleration
        self._episode_reward += reward
        return (self._position, self._velocity), reward, self._terminal

    def render(self) -> plt.Figure:
        """Render current state."""
        if self._position < self._positions[-1]:
            self._positions = [0]
            self._velocities = [0]
        if self._fig is None:
            self._fig = plt.figure()
            x = np.linspace(0, self._track_len, 1000)
            y = self._max_velocities(x)
            plt.plot(x, y)
            self.plot, = plt.plot([0], [0])
        self._velocities.append(self._velocity)
        self._positions.append(self._position)
        self.plot.set_data(self._positions, self._velocities)
        self._fig.canvas.draw()
        return self._fig

    @property
    def episode_reward(self) -> float:
        """Get cummulated reward for the whole episode."""
        return self._episode_reward

In [None]:
env = Carrera()
env.render()

## Memory
While the paper takes note of prioritized replay methods, it only makes use of sampling experiences uniformly from a limited sized buffer. The straight forward approach for implementing this would be to use `collections.deque`, but sampling from such a queue (as the name maybe already shows...) is [expensive](https://wiki.python.org/moin/TimeComplexity). Therefore we implement a custom memory class which makes use of a basic list and implements the element limit through a pointer which dictates which element is to be overwritten on insert.

In [None]:
import random


class Memory:
    """Uniform replay memory with maximum size."""

    def __init__(self, max_size):
        self.max_size = max_size
        self._buffer = []
        self._pointer = 0

    def __len__(self):
        return len(self._buffer)

    def add(self, experience):
        if len(self) < self.max_size:
            self._buffer.append(experience)
        else:
            self._buffer[self._pointer] = experience
            self._pointer = (self._pointer + 1) % self.max_size

    def sample(self, n):
        return random.sample(self._buffer, n)

## Agent

Rather abstract implementation of a reinforcement learning agent. The actual RL model is plug and play, as long as it is implemented consistently.

**TODO:** Implement Ornstein-Uhlenbeck noising for exploration. Quote from Lillicrap et al:

> For the exploration noise process we used temporally correlated noise in order to explore well in physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) with θ = 0.15 and σ = 0.2. The Ornstein-Uhlenbeck process models the velocity of a Brownian particle with friction, which results in temporally correlated values centered around 0.

In [None]:
from itertools import count
from typing import Tuple
import numpy as np


class Agent:
    """A reinforcement learning agent."""
    theta = 0.15
    sigma = 0.2
    min_memory = 1e3
    batchsize = 64

    def __init__(self, env, model, render=False, memory_size=1e6):
        """Create a new reinforcement learning agent."""
        self.memory = Memory(memory_size)
        self.env = env
        self.model = model
        self.render = render

    def train(self, episodes: int):
        """Training loop."""
        total_steps = 0
        for episode in range(1, episodes + 1):
            state = self.env.reset()
            for step in count():
                # Perform action, store new experience, train model.
                action = self._get_action(state, True)
                state_, reward, terminal = self.env.step(action)
                self.memory.add((state, action, reward, state_, terminal))
                state = state_  # Next state becomes current state.
                if self.render and step % 10 == 0:
                    env.render()
                if len(self.memory) > self.min_memory:
                    self.model.train(self.memory.sample(self.batchsize))
                if terminal:  # Start new episode if in terminal state.
                    break
            total_steps += step

    def _get_action(self, state, exploration=False) -> Tuple[float]:
        action = self.model.get_action(state)
        if not exploration:
            return action
        return action + np.random.normal(0, 0.1, action.shape)

## DDPG Model

While we again will create a class for the Deep Deterministic Gradient Policy model, we will implement some of the parts as functions outside of the class in order to better walk through them in this notebook. When implementing this as a script one would want to integrate them all into the class.

In [None]:
import tensorflow as tf
from tensorflow.contrib.framework import get_variables

### Actor and Critic Networks
Our model consists of a total of two different networks -- an actor and a critic network. The problem with that approach is that during training we not only optimize those networks, we also use them to dirigate our agent. Manipulating the online policy leads to a feedback loop which leads to instability. While we already use a big memory buffer to mitigate this problem, the authors propose to additionally use two sets of parameters for each network.

In the implementation this leads to theoretically four networks, two actor and two critic networks, an online and a target version for each. While the online networks will be used for online predictions and will be updated at every timestep, the target networks will be used for determining the directions in which the online networks should be updated. From time to time the target networks will be updated using the online networks weights -- more on that below.

#### Abstract Networks
In order to easily model our networks we use the `namedtuple` collection. Similar to class instances named tuples allow dot-access to their members. For the target networks we only need the network outputs and internal variables, because we will directly manipulate them using the corresponding online network's variables. For the online networks we additionally need a reference to their gradient descent optimizers (for the online training) and direct access to the gradients themself -- more on that in the Actor section.

In [None]:
from collections import namedtuple

Network = namedtuple('Network', ['y', 'variables'])
OptimizableNetwork = namedtuple('Network', ['y', 'variables',
                                            'optimizer', 'gradients'])

#### Critic
The critic is the Q value function or Bellman equation approximator. Q values describe the expected reward for an action which normally would be determined through dynamic programming. The critic maps state and action to a single scalar value. This stands in contrast to Deep Q Networks (Mnih et al 2015), where the Q network maps the environment's state to multiple Q values, one for each action. This is because in our case the Q network is not used to determine which action to take, but only to *criticize* whatever action the actor network decided on taking.

For the critic network we strickly to the structure described in the paper:

  - Two hidden layers with ReLu activation and 400 and 300 neurons respectivley.
  - Batch normalization applied to the input and first hidden layer.
  - Actions enter the network after the first hidden layer.
 
As common in Deep Q-Networks the single output neuron uses a linear activation.

**TODO:** Variable initialization

> The final layer weights and biases of both the actor and critic were initialized from a uniform distribution $[−3*10^{−3}, 3*10^{−3}]$ and $[3*10^{−4}, 3*10^{−4}]$ [...]. The other layers were initialized from uniform distributions $[-\frac{1}{\sqrt{f}}, \frac{1}{\sqrt{f}}]$

In [None]:
def critic_network(x, actions, training, name):
    """Build a critic network q, the value function approximator."""
    with tf.variable_scope(name) as scope:
        decay = tf.contrib.layers.l2_regularizer(1e-2)
        l2 = {'kernel_regularizer': decay, 'bias_regularizer': decay}
        norm_0 = tf.layers.batch_normalization(x, training=training)
        hidden_1 = tf.layers.dense(norm_0, 400, activation=tf.nn.relu, **l2)
        norm_1 = tf.layers.batch_normalization(hidden_1, training=training)
        hidden_1_ = tf.concat([norm_1, actions], axis=1)
        hidden_2 = tf.layers.dense(hidden_1_, 300, activation=tf.nn.relu, **l2)
        y = tf.layers.dense(hidden_2, 1, activation=tf.identity, **l2)
        batch_ops = get_variables(scope, collection=tf.GraphKeys.UPDATE_OPS)
        variables = get_variables(scope)
    return Network(y, variables), batch_ops


def critic(x, actions, targets, training=False):
    """Build critic online and target network pair."""
    with tf.variable_scope('critic'):
        online, ops = critic_network(x, actions, training, 'online')
        target, ops_ = critic_network(x, actions, training, 'target')
        loss = tf.reduce_mean(tf.squared_difference(targets, online.y))
        gradients = tf.gradients(online.y, actions)
        with tf.control_dependencies(ops + ops_):
            optimizer = tf.train.AdamOptimizer(1e-3).minimize(loss)
    online = OptimizableNetwork(*online, optimizer, gradients)
    return online, target

#### Actor

    - Discuss gradient computation/application.

**TODO:** Variable initialization

> The final layer weights and biases of both the actor and critic were initialized from a uniform distribution $[−3*10^{−3}, 3*10^{−3}]$ and $[3*10^{−4}, 3*10^{−4}]$ [...]. The other layers were initialized from uniform distributions $[-\frac{1}{\sqrt{f}}, \frac{1}{\sqrt{f}}]$

In [None]:
def actor_network(x, training, name):
    """Build an actor network mu, the policy function approximator."""
    with tf.variable_scope(name) as scope:
        norm_0 = tf.layers.batch_normalization(x, training=training)
        hidden_1 = tf.layers.dense(norm_0, 400, activation=tf.nn.relu)
        norm_1 = tf.layers.batch_normalization(hidden_1, training=training)
        hidden_2 = tf.layers.dense(hidden_1, 300, activation=tf.nn.relu)
        norm_2 = tf.layers.batch_normalization(hidden_2, training=training)
        y = tf.layers.dense(hidden_2, 1, activation=tf.nn.tanh)
        batch_ops = get_variables(scope, collection=tf.GraphKeys.UPDATE_OPS)
        variables = get_variables(scope)
    return Network(y, variables), batch_ops


def actor(x, critic_gradients, training=False):
    """Build actor online and target network pair."""
    with tf.variable_scope('actor'):
        online, ops = actor_network(x, training, 'online')
        target, ops_ = actor_network(x, training, 'target')
        inverse_gradients = tf.multiply(-1., critic_gradients)
        gradients = tf.gradients(*online, inverse_gradients)
        pairs = zip(gradients, online.variables)
        with tf.control_dependencies(ops + ops_):
            optimizer = tf.train.AdamOptimizer(1e-4).apply_gradients(pairs)
    online = OptimizableNetwork(*online, optimizer, gradients)
    return online, target

#### Target Network Updates
While the online networks are trained directly (thus the *OptimizableNetwork* name), the target networks are only updated irregularily using the online network's parameters. For this paper describes a process named *soft updates*, which only slowly moves the target network's parameters into the direction of the online network. The original Deep Q- and also the Double Deep Q-Network approach instead just directly copies the parameters over.

##### Initial Hard Update
In order to ensure the online and target networks initial equallity, we first implement the hard parameter copying. This function will only be used after initial variable initialization to make sure the online and target network start off from the same foundation.

In [None]:
def hard_updates(src: Network, dst: Network):
    """Overwrite target with online network parameters."""
    return [target.assign(online)
            for online, target in zip(src.variables, dst.variables)]

##### Soft Update
The soft update also consists of the same assign operation as above, but not directly overwrites the target network's parameters but mashes the online and target parameters together. `tau` herein describes how strongly the new values influence the old values.

In [None]:
def soft_updates(src: Network, dst: Network, tau=tf.constant(0.001)):
    """Soft update the dst net's parameters using those of the src net."""
    return [target.assign(tau * online + (1 - tau) * target)
            for online, target in zip(src.variables, dst.variables)]

### Bringing it all together

  - Initializes networks and session.
  - Resets TensorFlow graph because notebooks.
  - Copies the initial parameters to the target networks.
  - Provides `train` function which counts SGD steps.
  - Target networks are updated every n SGD steps.
  - Provides `get_action`.

In [None]:
from itertools import chain
import tensorflow as tf


class DDPG:
    """Deep Deterministic Policy Gradient RL Model."""
    gamma = 0.99  # Discount factor

    def __init__(self, din=2, dout=1):
        """Create a new DDPG model."""
        tf.reset_default_graph()

        self.states = tf.placeholder(tf.float32, (None, din), 'states')
        self.actions = tf.placeholder(tf.float32, (None, dout), 'actions')
        self.targets = tf.placeholder(tf.float32, (None,), 'targets')
        self.training = tf.placeholder_with_default(False, None, 'training')

        self.critic, self.critic_ = critic(self.states, self.actions,
                                           self.targets, self.training)
        self.actor, self.actor_ = actor(self.states, self.critic.gradients,
                                        self.training)

        self.soft_updates = soft_updates(self.critic, self.critic_) +
                             soft_updates(self.actor, self.actor_)

        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

        self.session.run(hard_updates(self.critic, self.critic_) +
                         hard_updates(self.actor, self.actor_)

    def train(self, batch):
        """Train the online, maybe update the target networks.

        NOTE: The whole target computation could be move to TensorFlow by
        connecting the target actor outputs directly to the gradients instead
        of requesting them from the session and feeding them back in. This
        should be implemented, but might require some refactoring. Would
        reduce this whole block to a single session run.

        NOTE: Currently this completly ignore terminals -- not sure if thats
        desired. DQN normally only takes future rewards into consideration
        for states which are not terminal states. Lillicrap et al do not
        make this distinction.
        """
        states, actions, rewards, states_, terminals=zip(*batch)
        actions_=self.session.run(self.actor_.y, {self.states: states_})
        q_values=self.session.run(self.critic_.y, {self.states: states_,
                                                     self.actions: actions_})
        q_values=np.squeeze(q_values)
        targets=rewards + self.gamma * q_values
        # targets = rewards + self.gamma * q_values * np.invert(terminals)  #
        # DQN
        self.session.run([self.critic.optimizer, self.actor.optimizer],
                         {self.states: states, self.targets: targets,
                          self.actions: actions, self.training: True})

        # Update target networks.
        self.session.run(self.soft_updates)

    def get_action(self, state):
        action, = self.session.run(self.actor.y, {self.states: [state]})
        return action

## Let's Play

In [None]:
env = Carrera()
model = DDPG()
agent = Agent(env, model, render=True)
agent.train(1000)