# Continuous Deep Reinforcement Learning on Slot Car Racing

In this notebook we provide a quick introduction to the "Deep Deterministic Gradient Policy" algorithm presented in the 2016 paper "Continuous Control with Deep Reinforcement Learning" by Lillicrap et al. We demonstrate the algorithm in a Slot Car Racing (also known by the brand-name Carrera) environment.

The following consists of multiple parts:

  - The custom `Carrera` environment -- a Python class which allows agents to create, reset and perform "steps" on a slot car racing track. It also provides abstractions for visualizing the algorithms performance.
  - An `Agent` class which interacts with a provided environment.
  - The DDPG algorithm implemented in TensorFlow.

## Environment

In [None]:
import math
from typing import Tuple

import numpy as np
import matplotlib.pyplot as plt


class Carrera:
    """A simple carrera track, modeled by a maximum velocity function."""

    def __init__(self):
        """Create new carrera track environment."""
        self._track_len = 2 * math.pi
        self._position = 0
        self._velocity = 0
        self._terminal = False
        self._episode_reward = 0

    def _max_velocity(self, position: float):
        """Returns the maximum velocity for any position on the track."""
        return (math.sin(position) + 1) / 2

    def reset(self):
        """Reset environment."""
        self._position = 0
        self._velocity = 0
        self._terminal = False
        self._episode_reward = 0
        return (self._position, self._velocity)

    def step(self, acceleration=0.) -> Tuple[Tuple[float, float], float, bool]:
        """Perform a step in the environment.

        Returns new observation tuple (position, velocity) and reward.
        """
        self._velocity = np.max((0.8 * self._velocity, np.min((1, acceleration))))
        self._position = (self._position + (self._velocity /
                                            self._track_len)) % self._track_len
        max_velocity = self._max_velocity(self._position)
        if self._velocity > max_velocity:  # Cart flew out of the track
            self._terminal = True
            return (self._position, self._velocity), -1, self._terminal
        reward = (self._velocity - max_velocity) + 1
        self._episode_reward += reward
        return (self._position, self._velocity), reward, self._terminal

    def render(self, fig=None) -> plt.Figure:
        """Render current state."""
        if fig is None:
            fig = plt.figure()
        x = np.linspace(0, self._track_len, 1000)
        y = np.vectorize(self._max_velocity)(x)
        plt.plot(x, y)
        fig.canvas.draw()
        return fig

    @property
    def episode_reward(self) -> float:
        """Get cummulated reward for the whole episode."""
        return self._episode_reward

In [None]:
env = Carrera()
env.render()

## Memory
While the paper takes note of prioritized replay methods, it only makes use of samplying experiences uniformly from a limited sized buffer. The straight forward approach for implementing this would be to use `collections.deque`, sampling from such a queue (as the name says...) is [expensive](https://wiki.python.org/moin/TimeComplexity). Therefore we implement a custom memory class which makes use of a basic list and implements the element limit through a pointer which dictates which element to overwrite.

In [None]:
import random


class Memory:
    """Uniform replay memory with maximum size."""

    def __init__(self, max_size):
        self.max_size = max_size
        self._buffer = []
        self._pointer = 0

    def __len__(self):
        return len(self._buffer)

    def add(self, experience):
        if len(self) < self.max_size:
            self._buffer.append(experience)
        else:
            self._buffer[self._pointer] = experience
            self._pointer = (self._pointer + 1) % self.max_size

    def sample(self, n):
        return random.sample(self._buffer, n)

## Agent

Rather abstract implementation of a reinforcement learning agent. The actual RL model is plug and play, as long as it is implemented consistently.

**TODO:** Implement Ornstein-Uhlenbeck noising for exploration. Quote from Lillicrap et al:

> For the exploration noise process we used temporally correlated noise in order to explore well in physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) with θ = 0.15 and σ = 0.2. The Ornstein-Uhlenbeck process models the velocity of a Brownian particle with friction, which results in temporally correlated values centered around 0.

In [None]:
from itertools import count
import numpy as np


class Agent:
    """A reinforcement learning agent."""
    theta = 0.15
    sigma = 0.2
    min_memory = 1e3

    def __init__(self, env, model, memory_size=1e6):
        """Create a new reinforcement learning agent."""
        self.memory = Memory(memory_size)
        self.env = env
        self.model = model
        self.noise = .1

    def train(self, episodes: int):
        """Training loop."""
        total_steps = 0
        for episode in range(1, episodes + 1):
            state = self.env.reset()
            for step in count():
                # Perform action, store new experience, train model.
                action = self._get_action(state, True)
                state_, reward, terminal = self.env.step(action)
                self.memory.add((state, action, reward, state_, terminal))
                state = state_  # Next state becomes current state.
                if len(self.memory) > self.min_memory:
                    self.model.train(self.memory.sample(64))
                if terminal:  # Start new episode if in terminal state.
                    break
            total_steps += step

    def _get_action(self, state, exploration=False):
        action = self.model.get_action(state)
        if not exploration:
            return action
        return action + np.random.normal(0, self.sigma, action.shape)

## DDPG Model

While we again will create a class for the Deep Deterministic Gradient Policy model, we will implement some of the parts as functions outside of the class in order to better walk through them in this notebook. When implementing this as a script one would want to integrate them all into the class.

In [None]:
import tensorflow as tf

### Actor and Critic Networks
Our model consists of a total of two different networks -- an actor and a critic network. The problem with that approach is that during training we not only optimize those networks, we also use them to dirigate our agent. Manipulating the online policy leads to a feedback loop which leads to instability. While we already use a big memory buffer to mitigate this problem, the authors propose to additionally use two sets of parameters for each network.

In the implementation this leads to theoretically four networks, two actor and two critic networks, an online and a target version for each. While the online networks will be used for online predictions and will be updated at every timestep, the target networks will be used for determining the directions in which the online networks should be updated. From time to time the target networks will be updated using the online networks weights -- more on that below.

#### Abstract Networks
In order to easily model our networks we use the `namedtuple` collection. Similar to class instances named tuples allow dot-access to their members. For the target networks we only need the network outputs and internal variables, because we will directly manipulate them using the corresponding online network's variables. For the online networks we additionally need a reference to their gradient descent optimizers (for the online training) and direct access to the gradients themself -- more on that in the Actor section.

In [None]:
from typing import NamedTuple, List


class Network(NamedTuple):
    y: tf.Tensor
    variables: List[tf.Tensor]


class OptimizableNetwork(NamedTuple, Network):
    y: tf.Tensor
    variables: List[tf.Tensor]
    optimizer: tf.Operation
    gradients: List[tf.Tensor]

#### Critic
The critic is the Q value function or Bellman equation approximator. Q values describe the expected reward for an action which normally would be determined through dynamic programming. The critic maps state and action to a single scalar value. This stands in contrast to Deep Q Networks (Mnih et al 2015), where the Q network maps the environment's state to multiple Q values, one for each action. This is because in our case the Q network is not used to determine which action to take, but only to *criticize* whatever action the actor network decided on taking.

For the critic network we strickly to the structure described in the paper:

  - Two hidden layers with ReLu activation and 400 and 300 neurons respectivley.
  - Batch normalization applied to the input and first hidden layer.
  - Actions enter the network after the first hidden layer.
 
As common in Deep Q-Networks the single output neuron uses a linear activation.

In [None]:
def critic_network(states, actions, name):
    """Build a critic network q, the value function approximator.

    TODO: L2 weight decay with 1e-2
    TODO: Batchnorm
    """
    with tf.variable_scope(name) as scope:
        hidden_1 = tf.layers.dense(states, 400, activation=tf.nn.relu)
        hidden_1_ = tf.concat([hidden_1, actions], axis=1)
        hidden_2 = tf.layers.dense(hidden_1_, 300, activation=tf.nn.relu)
        outputs = tf.layers.dense(hidden_2, 1, activation=tf.identity)
        variables = tf.contrib.framework.get_variables(scope)
    return Network(outputs, variables)


def critic(states, actions, targets):
    """Build critic online and target network pair."""
    with tf.variable_scope('critic'):
        target = critic_network(states, actions, 'target')
        outputs, variables = critic_network(states, actions, 'online')
        loss = tf.reduce_mean(tf.squared_difference(targets, outputs))
        optimizer = tf.train.AdamOptimizer(1e-3).minimize(loss)
        gradients = tf.gradients(outputs, actions)
    online = OptimizableNetwork(outputs, variables, optimizer, gradients)
    return online, target

#### Actor

In [None]:
def actor_network(states, name):
    """Build an actor network mu, the policy function approximator."""
    with tf.variable_scope(name) as scope:
        hidden_1 = tf.layers.dense(states, 400, activation=tf.nn.relu)
        hidden_2 = tf.layers.dense(hidden_1, 300, activation=tf.nn.relu)
        outputs = tf.layers.dense(hidden_2, 1, activation=tf.nn.tanh)
        variables = tf.contrib.framework.get_variables(scope)
    return Network(outputs, variables)


def actor(states, critic_gradients):
    """Build actor online and target network pair."""
    with tf.variable_scope('actor'):
        target = actor_network(states, 'target')
        outputs, variables = actor_network(states, 'online')
        inverse_gradients = tf.multiply(-1., critic_gradients)
        gradients = tf.gradients(outputs, variables, inverse_gradients)
        pairs = zip(gradients, variables)
        optimizer = tf.train.AdamOptimizer(1e-4).apply_gradients(pairs)
    online = OptimizableNetwork(outputs, variables, optimizer, gradients)
    return online, target

#### Target Network Updates
While the online networks are trained directly (thus the *OptimizableNetwork* name), the target networks are only updated irregularily using the online network's parameters. For this paper describes a process named *soft updates*, which only slowly moves the target network's parameters into the direction of the online network. The original Deep Q- and also the Double Deep Q-Network approach instead just directly copies the parameters over.

##### Initial Hard Update
In order to ensure the online and target networks initial equallity we first implement the hard parameter copying, `copy_network_parameters`. This function will only be used after initial variable initialization to make sure the online and target network start off from the same foundation. The fun part here is, that we can do this directly in TensorFlow by directly assigning one Tensor to another. The function therefore just consists of two lines of code, one to match the pairs together and one to call the session with the assign operations.

In [None]:
def copy_network_parameters(session, src: Network, dst: Network):
    """Overwrite target with online network parameters."""
    pairs = zip(src.variables, dst.variables)
    session.run([dst.assign(src) for src, dst in pairs])

##### Soft Update
The soft update also consists of the assign operation, but we first need to mash the parameters together using the `soft_update` function. `tau` herein describes how strongly the new values influence the previous target values. Note the notation: From now on the trailing underscore will describe variables related to either one of our target networks, similar to the pipe ($Q'$ or $mu'$) used in the paper.

Because we actually need to manipulate the variables' values, the `update_network_parameters` function gets a bit more complex. It first needs to fetch the tensors' current values from the TensorFlow session, then match the correct triplets of `(target_tensor, online_value, target_value)` together and assign the result of our `soft_update` function to the target tensor back in the TensorFlow session.

A possible enhancement here would be to move the soft update value calculation to the TensorFlow graph itself.

In [None]:
def soft_update(val, val_, tau=0.001):
    """Calculate the soft update values."""
    return tau * val + (1 - tau) * val_


def update_network_parameters(session, src: Network, dst: Network, tau=0.001):
    """Soft update the dst net's parameters using those of the src net."""
    count = len(src.variables)
    values = session.run(src.variables + dst.variables)
    triplets = zip(dst.variables, values[:count], values[count:])
    session.run([tensor.assign(soft_update(val, val_, tau))
                 for tensor, val, val_ in triplets])

### Bringing it all together

  - Initializes networks and session.
  - Resets TensorFlow graph because notebooks.
  - Copies the initial parameters to the target networks.
  - Provides `train` function which counts SGD steps.
  - Target networks are updated every n SGD steps.
  - Provides `get_action`.

In [None]:
from itertools import chain
import tensorflow as tf


class DDPG:
    gamma = 0.99  # Discount factor
    update_frequency = 100

    def __init__(self, din=2, dout=1):
        tf.reset_default_graph()

        self.states = tf.placeholder(tf.float32, (None, din), 'states')
        self.actions = tf.placeholder(tf.float32, (None, dout), 'actions')
        self.targets = tf.placeholder(tf.float32, (None,), 'targets')

        self.critics = critic(self.states, self.actions, self.targets)
        self.critic, self.critic_ = self.critics

        self.actors = actor(self.states, self.critic.gradients)
        self.actor, self.actor_ = self.actors

        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

        copy_network_parameters(self.session, self.critic, self.critic_)
        copy_network_parameters(self.session, self.actor, self.actor_)

        self.update_count = 0

    def train(self, batch):
        """Train the online, maybe update the target networks.

        NOTE: The whole target computation could be move to TensorFlow by 
        connecting the target actor outputs directly to the gradients instead
        of requesting them from the session and feeding them back in. This
        should be implemented, but might require some refactoring. Would
        reduce this whole block to a single session run.

        NOTE: Currently this completly ignore terminals -- not sure if thats
        desired. DQN normally only takes future rewards into consideration
        for states which are not terminal states. Lillicrap et al do not
        make this distinction.
        """
        states, actions, rewards, states_, _ = zip(*batch)
        actions_ = self.session.run(self.actor_.y, {self.states: states_})
        q_values = self.session.run(self.critic_.y, {self.states: states_,
                                                     self.actions: actions_})
        targets = rewards + self.gamma * np.squeeze(q_values)
        # targets = rewards + self.gamma * ys * np.invert(terminals)  # DQN
        self.session.run([self.critic.optimizer, self.actor.optimizer],
                         {self.states: states, self.targets: targets,
                          self.actions: actions})

        self.update_count += 1
        if self.update_count % self.update_frequency == 0:
            update_network_parameters(self.session, self.critic, self.critic_)
            update_network_parameters(self.session, self.actor, self.actor_)

    def get_action(self, state):
        action, = self.session.run(self.actor.y, {self.states: [state]})
        return action

## Let's Play

In [None]:
env = Carrera()
model = DDPG()
agent = Agent(env, model)
agent.train(100)