# ATARI-BREAKOUT WITH A3C IN TENSORFLOW

***

## 1. Introduction

The focus of this notebook is the deep reinforcement learning algorithm **Asynchronous Actor-Critic Advantage (A3C)** (https://arxiv.org/pdf/1602.01783.pdf). In particular, we'll be focussing on:
* A high-level theoretical explanation of A3C in the context of deep reinforcement learning generally; and
* An implementation of the algorithm for OpenAI's Atari Breakout environment (https://gym.openai.com/envs/Breakout-v0/) using TensorFlow.

I've included a quick Background on reinforcement learning, but this should be treated more like an appendix (it doesn't really contain any code, just runs over some theory). The real substance of this notebook is in the third section (3. Asynchronous Actor-Critic Advantage).

***

## 2. Background

This section is included for the benefit of any reader that needs a quick primer on or introduction to deep reinforcement learning. If you're already pretty comfortable with the topic, then feel free to skip it (i.e. you've got working definitions of the terms "agent", "environment", "reward", "policy", and "value function"). Otherwise, it's worth reading for the definitions these keywords and a high-level intuition of deep reinforcement learning algorithms. We'll need those before we move on to A3C.

### Reinforcement Learning

Reinforcement learning is an area within machine learning concerned with how *agents* learn to *act* in an *environment* in order to maximize some comulative *reward*. In our Atari Breakout example, you can think of the agent as being the player (or computer); the possible actions as being move-left, move-right, or do-nothing; the environment as being the game itself; and rewards as being the points you accumulate (or some penalty you receive for losing).

Translating this interaction into pseudocode, it would look something like this:

    player starts game
    initialize points at 0
    while the game is still running:
        player looks at the screen
        player chooses action (move-left, move-right, do-nothing) by evaluating the current screen
        player executes the chosen action
        player receives reward (0, 1 point, or loses a life)
        increment points by reward

And translating this to agent-environment interactions generally:
    
    initialize agent and environment
    initialize accumulated_reward at 0
    for each timestep:
        agent observes current state of the environment
        agent chooses an action based on current state
        agent executes action, causing environment to transition to next state
        agent receives reward (can be zero or negative) for reaching next state
        increment accumulated_reward by reward
        
One major complication associated with reinforcement learning is delayed rewards. Taking a particular action in a given state may not generate reward immediately, the reward may be lagged by several transitions. Because of this, reinforcement learning relies on the notion of "discounted reward": think of the value associated with a given state as being the sum of all rewards that will be accumulated between that state and the terminal state, where each reward in the sequence discounted by some factor (conventionally called "gamma").

### Problem: Choosing an Action

You might already have the question on your mind: **how does agent the choose which action to execute in the current environment state**. This is the **core problem in reinforcement learning**: what we're trying to do is come up with some algorithm that will - over time - let agents figure out what actions they should take in each possible state.

If we could somehow come up with a **function that takes the current environment state as input, and outputs a numerical valuation of each of the possible actions** (let's call these "valuation functions"), then the solution would be trivial. The agent could use the valuation function and current state to estimate the value of each possible action, then simply **choose whichever action has the highest valuation**.

The bad news is that it's really difficuly to actually come up with these valuation functions. So how do we get around that?

### Solution: Neural Networks

This is where the "deep" part of reinforcement learning comes in: **without even knowing the definition of a given valuation function, we can approximate it using a neural network**. This is the approach that we take with A3C. Our problem is now reduced to defining:

1. The Network Architecture (a neural network which will function as a good approximator); and
2. The Training Regime (the loss function and optimizer, and the method for passing training samples through the network).

***

## 3. Asynchronous Actor-Critic Advantage (A3C)

### Conceptual Overview

Before we dive into the code, let's start off by briefly elaborating how A3C fits into the deep reinforcement learning paradigm, and identify what makes it unique from other reinforcement learning algorithms. A good way to do this is to talk about each of the four terms in its name.

* **Actor**. Recall that agents in reinforcement learning problems choose which action to take in each given state by calling some kind of "valuation function", which assigns a numerical value to each of the actions it could take. The A3C network approximates two different kinds of valuation functions. The first is a "policy", $ \pi(s) $. This is a function which takes a state $s$ as input and outputs a probability distribution over the action space. A3C construes the probabilities as valuations, such that the action which is assigned the highest probability is the action expected to generate the highest reward. Agents use this function to determine which action it should take out of the current state. It is hence referred to as the "actor".


* **Critic**. The same A3C network is also used approximate a "value function", $V(s)$ (yep, that's right, one network is being used to approximate two different functions). A value function takes a state $s$ as input and outputs a numeric estimate of the value of being in that state. This is actually used in the training stage to evaluate the actions suggested by the policy (the "Actor"). The "actor" (i.e. policy) recommends an action which causes the environment to transition to a new state, then the value function estimates how valuable it is to actually be in that new state (that is, it "critiques" the choice made by the policy).


* **Advantage**. This is also used in the training stage for the network. Defining this precisely requires a more in-depth discussion about value functions, policies, and Q-values (which we don't touch on here), as well as the relationship between the three. Here, suffice to say that "advantage" is (loosely) defined as the difference between the actual and expected values of being in a particular state. We can estimate actual value by computing the sum of the discounted stream of rewards between any given state and the terminal state (more on this later); the expected value is generated by the "critic".


* **Asynchronous**. A3C is asynchronous in that it generates training samples by running multiple agents/environments on different threads simultaneously. These samples are used to train the single global neural network approximating the policy and value functions. This has the advantage of diversifying the training set, since the experiences of each of those agents will be independent of each other (this was a problem in predecessors of A3C).

We should also clarify exactly what the interaction between the global network and any one of the independent agents. Say, for example, that an agent is in an environment at state $s_0$. The agent takes an observation of $s_0$, and uses that observation in a call to the global network (policy approximator) to get an estimate of the action valuations in $s_0$. It then chooses the action $a$ with the highest valuation, executes the action, and transitions to the next state $s_1$. The agent then receives a reward $r$ for reaching $s_1$. The four-tuple $(s_1, a, r, s_0)$ (called a "transition") constitutes a single training datum. Once an agent has collected enough transitions, it will send them in a training batch over to the global network, which can be used to update the network weights.

### Implementation Overview

At this point, we have enough to write out the scaffold for the major components of our implementation. The system should be comprised of:

* **A single master node**, which (1) hosts the neural network (2) receives and caches training samples from workers, (3) trains the neural network at intervals, and (4) supports policy queries from the workers; and

        class Master:
            initialize_network()
            cache_training_sample()
            train_network()
            predict_policy()
            

* **Several drone nodes**, each of which (1) collects training samples by running an agent in an independent copy of the environment, (2) chooses which actions to take in each possible environment state, and (3) sends the training samples it has collected over to the master node at regular intervals. The drones should be implemented as Threads to support arbitrarily many asynchronous duplicates.

        class Drone: Thread
            run_episode()
            choose_action()
            send_sample_to_master()
            
            

### Implementation

**Imports**. Let's start off by importing the packages that we're going to need.

In [None]:
import tensorflow as tf # to define the neural network
import numpy as np # manipulate tensors and arrays of training samples
from threading import Thread, Lock # to make each of the drones asynchronous
import gym # the OpenAI API to run the Atari Breakout emulator
import cv2 # image manipulation functions for preprocessing
import time, random

**Global Constants**. The next thing we'll do is define a bunch of global constants.

In [None]:
# save/load location for our neural networks
SAVE_DIR = "saved-networks"
SAVE_FILE = "network"

# the save frequency (in tf global_steps)
SAVE_FREQ = 100

# number of independent drones
NUM_DRONES = 4
THREAD_DELAY = 0.001

# environmental details
ENVIRONMENT = "Breakout-v0" # used to select the Atari Breakout emulator
tmp = gym.make(ENVIRONMENT) # create tmp environment to collect metadata
RAW_STATE_SHAPE = tmp.observation_space.shape # dimensions of the raw Atari Breakout (state) screen
NUM_ACTIONS = tmp.action_space.n # number actions that can be taken in the game
STATE_SHAPE = (84, 84, 1) # the dimensions of the preprocessed states
NONE_STATE = np.zeros(STATE_SHAPE) # a dummy state, used as a filler for terminal states

# discount factor
GAMMA = 0.99

# number of lags over which discounted reward will be computed
N_STEP = 8
GAMMA_N = GAMMA ** N_STEP

# penalty (negative reward) applied when the agent loses a game 
FAILURE_PENALTY = -5.0

# the reward received by the agent for simply keeping the game going (i.e. not losing)
LIVENESS_REWARD = 0.01

# epsilon-greedy training parameters
# to support trial-end-error, the agent will - at each timestep - choose a random action with probability epsilon
EPSILON_INIT = 0.50
EPSILON_STOP = 0.15

# decrement in epsilon at each training step (slowly decreases the likelihood of choosing a random action)
EPSILON_STEP = 0.000005

# minimum size of a training batch
MIN_TRAINING_BATCH = 32

# the update rate for network weights during back-propagation
LEARNING_RATE = 5e-3

# weights over different components of the loss function used in training (more on this later)
COEF_LOSS_V = 0.5
COEF_LOSS_ENT = 0.01

# flags to identify whether a given transition is into a terminal state
MASK_NON_TERMINAL = 1.0
MASK_TERMINAL = 0.0

**TensorFlow Helpers**. Define a few helper functions that we'll use to help write certain tensorflow variables and operations when it comes to defining the network.

In [None]:
def weight_variable(shape):
    """
    Creates a tensorflow variable.
    """
    return tf.Variable(tf.truncated_normal(shape, stddev=0.01))

def bias_variable(shape):
    """
    Creates a tensorflow bias variable.
    """
    return tf.Variable(tf.constant(0.01, shape=shape))

def conv_2d(x, W, stride):
    """
    Creates a convolution over the given input tensor (x), using the given weight variable (W).
    """
    return tf.nn.conv2d(x, W, strides=[1,stride,stride,1], padding="VALID")

def max_pool(x, stride):
    """
    Max pooling over the given input tensor (x).
    """
    return tf.nn.max_pool(x, ksize=[1,stride,stride,1], strides=[1,stride,stride,1], padding="SAME")

**Preprocessing**. So far we haven't really talked about preprocessing. This is what we'll do here. The raw Atari Breakout screen is a 210 (height) x 100 (width) x 3 (RGB) pixel vector. This is pretty computationally strenuous to pass through a neural network; on top of that, it contains a lot of redundant information (like the different colours of pixels, and the border). We preprocess the raw images by downsampling them to 84 (height) x 84 (width) and converting to binary.

In [None]:
def preprocess(state):
    """
    Preprocessing.
    """
    # convert to grayscale
    state = cv2.cvtColor(state, cv2.COLOR_BGR2GRAY)
    
    # then convert to binary
    _, state = cv2.threshold(state, 1, 255, cv2.THRESH_BINARY)
    
    # resize to 84 x 110
    state = cv2.resize(state, (84,110))
    
    # then crop to 84 x 84
    state = state[26:110, :]
    state = np.reshape(state, (84,84,1))
    
    #cv2.imshow("img",state)
    return state

**Master**. Fill out the scaffold for the Master node that we've defined above. Pay close attention to the the `initialize_network()` function, which defines both the network architecture of the policy/value approximator as well as the loss and optimization operations for training the network.

In [None]:
class Master(object):
    """
    Implementation of the A3C Master node. Responsible for:
        (1) Hosting the neural network
        (2) Receiving and caching training samples from workers
        (3) Training the neural network
        (4) Support policy queries from workers
    """
    
    
    def __init__(self):
        # training cache, stores (s, a, r, s_, mask) transitions received from workers
        self.cache = [ [],[],[],[],[] ]
        
        # used to lock access to the training cache
        self.lock = Lock()

        # initialize tf network
        self.sess = tf.Session()
        self.initialize_architecture()
        self.saver = tf.train.Saver()
        self.sess.run(tf.global_variables_initializer())
        
        # load an existing network, if one exists
        self.load_network()

        
    def initialize_architecture(self):
        """
        Initializes the neural network that approximates the policy and value functions.
        """
        
        # (1) NETWORK ARCHITECTURE
        # Adapted from Minh et al. 2015 (https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)
        #    * State Input : 84 x 84 x 1 pixel matrix
        #    * Convolutional Layer 1 : 8x8 convolution, 16 output channels, 4 stride, ReLU activation, 2x2 max pooling
        #    * Convolutional Layer 2 : 4x4 convolution, 32 output channels, 2 stride, ReLU activation
        #    * Fully Connected Layer : 512 input units, 256 output units, ReLU activation
        #    * LSTM Layer : 256 input cells, 256 output cells
        #    * Policy Output : 256 input cells (from LSTM), NUM_ACTIONS output cells
        #    * Value Output : 256 input cells (from LSTM), 1 output cell
        
        # input placeholders
        with tf.name_scope("input"):
            # state tensor, input to the policy and value functions
            self.s = tf.placeholder(tf.float32, shape=[None,84,84,1], name="state")
            
            # action and reward tensor, used during training
            self.a = tf.placeholder(tf.float32, shape=[None,NUM_ACTIONS], name="action")
            self.r = tf.placeholder(tf.float32, shape=[None,1], name="reward")

        # first convolutional layer
        with tf.name_scope("convolutional-1"):
            # 8x8 convolution, 16 output channels, 4 strides
            self.conv1_W = weight_variable([8,8,1,16])
            self.conv1_b = bias_variable([16])
            self.conv1 = conv_2d(self.s, self.conv1_W, 4) + self.conv1_b
            
            # passed through ReLU layer
            self.conv1 = tf.nn.relu(self.conv1)
            
            # passed through 2x2 max pooling
            self.conv1 = max_pool(self.conv1, 2)

        # second convolutional layer
        with tf.name_scope("convolutional-2"):
            # 4x4 convolution, 32 output channels, 2 strides
            self.conv2_W = weight_variable([4,4,16,32])
            self.conv2_b = weight_variable([32])
            self.conv2 = conv_2d(self.conv1, self.conv2_W, 2) + self.conv2_b
            
            # passed through ReLU layer
            self.conv2 = tf.nn.relu(self.conv2)

        # fully-connected layer
        with tf.name_scope("fully-connected"):
            # flatten the 16x32 output from the second convolutional layer
            self.conv2_flat = tf.reshape(self.conv2, [-1,16*32])
            
            # 512 input cells, 256 outputs
            self.fc_W = weight_variable([16*32,256])
            self.fc_b = bias_variable([256])
            self.fc = tf.matmul(self.conv2_flat, self.fc_W) + self.fc_b
            
            # passed through ReLU layer
            self.fc = tf.nn.relu(self.fc)

        # lstm cell
        with tf.name_scope("lstm"):
            # 256 input cells
            self.lstm_cell = tf.contrib.rnn.BasicLSTMCell(256)
            self.fc_seq = tf.expand_dims(self.fc, axis=1)
            init_state = self.lstm_cell.zero_state(batch_size=1, dtype=tf.float32)
            self.lstm, _ = tf.nn.dynamic_rnn(self.lstm_cell, self.fc_seq, initial_state=init_state, time_major=True)
            
            # flatten lstm output
            self.lstm_flat = tf.reshape(self.lstm, [-1,256])

        # output layer for policy approximator
        with tf.name_scope("output-policy"):
            # 256 input cells, NUM_ACTIONS outputs (the probability distribution)
            self.policy_W = weight_variable([256,NUM_ACTIONS])
            self.policy_b = bias_variable([NUM_ACTIONS])
            self.policy = tf.matmul(self.lstm_flat, self.policy_W) + self.policy_b
            
            # apply softmax to normalize into a probability distribution
            self.policy = tf.nn.softmax(self.policy)

        # output layer for value function approximator
        with tf.name_scope("output-value"):
            # 256 input cells, single output (the state value)
            self.value_W = weight_variable([256,1])
            self.value_b = weight_variable([1])
            self.value = tf.matmul(self.lstm_flat, self.value_W) + self.value_b

        
        # (2) LOSS OPERATION
        # Adapted from Minh et al. 2016 (https://arxiv.org/pdf/1602.01783.pdf)
        
        # policy loss operation
        with tf.name_scope("policy-loss"):
            self.log_ap = tf.log(tf.reduce_sum(self.policy*self.a, axis=1, keepdims=True) + 1e-10)
            self.advantage = self.r - self.value
            self.loss_p = - self.log_ap * tf.stop_gradient(self.advantage)

        # value loss operation
        with tf.name_scope("value-loss"):
            self.loss_v = COEF_LOSS_V * tf.square(self.advantage)

        # entropy loss operation
        # Minh et al. 2016 found that including this loss function minimized the likelihood
        #   of convergence of the network to a suboptimal policy
        with tf.name_scope("entropy-loss"):
            self.loss_e = COEF_LOSS_ENT * tf.reduce_sum(self.policy * tf.log(self.policy + 1e-10),
                                                        axis=1, keepdims=True)

        # combined loss operation
        with tf.name_scope("total-loss"):
            self.loss_t = tf.reduce_mean(self.loss_p + self.loss_v + self.loss_e)

        # global step variable, keeps track of number of training runs
        with tf.name_scope("global-step"):
            self.global_step = tf.Variable(0, name="step", trainable=False)

            
        # (3) OPTIMIZER OPERATION
        # Adapted from Minh et al. 2016 (https://arxiv.org/pdf/1602.01783.pdf))
        
        # optimizer operation
        with tf.name_scope("optimizer"):
            self.optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=0.99)
            self.minimize = self.optimizer.minimize(self.loss_t, global_step=self.global_step)

            
    def cache_training_sample(self, s, a, r, s_):
        """
        Used by drones to add training samples to the master cache.
        """
        # make sure that no other thread can touch the cache - necessary to prevent race conditions
        with self.lock:
            # add s, a, and r to their respective queues in the cache
            self.cache[0].append(preprocess(s))
            self.cache[1].append(a)
            self.cache[2].append(r)
            
            # if s_ is a terminal state ...
            if s_ is None:
                # ... add the dummy state to the cache
                self.cache[3].append(NONE_STATE)
                self.cache[4].append(MASK_TERMINAL)
            else:
                # otherwise, add s_ to the cache
                self.cache[3].append(preprocess(s_))
                self.cache[4].append(MASK_NON_TERMINAL)

                
    def train_network(self):
        """
        Training pass over the neural network.
            (1) Pulls all training samples from the master cache
            (2) Trains the network on these samples
        """
        # check that the cache has sufficient data for a training pass
        if len(self.cache[0]) < MIN_TRAINING_BATCH:
            return

        # pull all data from the cachce
        with self.lock:
            s, a, r, s_, mask = self.cache
            self.cache = [ [],[],[],[],[] ]
        s = np.array(s)
        a = np.vstack(a)
        r = np.vstack(r)
        s_ = np.array(s_)
        mask = np.vstack(mask)

        # generate the value predictions for every state s_
        v = self.sess.run(self.value, feed_dict={self.s : s_})
        
        # add discounted reward for reaching s_ to the reward training vector
        #   notice how we use mask here, mask will be 0.0 for terminal states, 1.0 otherwise
        #   this ensures that no reward is added for reaching terminal states
        r = r + GAMMA_N * v * mask

        # call the train operation
        self.sess.run(self.minimize, feed_dict={self.s : s, self.a : a, self.r : r})

        # save the network
        if self.sess.run(self.global_step) % SAVE_FREQ == 0:
            self.save_network()
            
            
    def predict_policy(self, s):
        """
        Used by drones to fetch the probability distribution over actions for a given state.
        """
        # predict the policy distribution of a given state
        policy = self.sess.run([self.policy], feed_dict={self.s : [s]})
        policy = np.squeeze(policy, axis=1)
        return policy

    
    def load_network(self):
        """
        Load network weights from external file.
        """
        checkpoint = tf.train.latest_checkpoint(
            checkpoint_dir=SAVE_DIR)
        if not (checkpoint is None):
            print("Existing network found at " + SAVE_DIR + ". Loading ...")
            self.saver.restore(self.sess, checkpoint)
            print("... loaded.")
        else:
            print("No network found. New network initialized.")

            
    def save_network(self):
        """
        Save network weights to external file.
        """
        print("Saving network ...")
        self.saver.save(self.sess, SAVE_DIR + "/" + SAVE_FILE,
            global_step=self.global_step)
        print("... saved.")

**Drone**. And now we'll implement the scaffold for the Drone class.

In [None]:
class Drone(Thread):
    """
    Implementation of a Drone. Each Drone encapsulates an independent environment/agent.
        (1) Collects training samples by running the agent in the environment;
        (2) Chooses actions in each state by querying the Master node;
        (3) Sends training samples to the Master node at regular intervals.
    """
    
    def __init__(self, master, drone_id, exemplar=False):
        super(Drone, self).__init__()
        
        # reference to the master node
        self.master = master
        self.sess = self.master.sess
        self.drone_id=drone_id
        
        # this is to mark whether we should render the environment for this Drones
        #   only one of the Drones can be an exemplar (i.e. render the Atari emulator at once)
        self.exemplar = exemplar
        
        # launch the Atari environment
        self.env = gym.make(ENVIRONMENT)
        
        # likelihood of choosing a random action and ignoring the policy
        self.epsilon = EPSILON_INIT
        
        # local training cache, regularly passed to master
        self.training_samples = []
        
        # used to accumulate discounted reward over a sequence of transitions
        self.R = 0.0

        
    def run(self):
        """
        Implementation of run() method required by Thread. Repeatedly runs training episodes.
        """
        while True:
            self.run_episode()

    def run_episode(self):
        """
        Runs a single training episode.
        """
        # reset the emulator, and take an observation of the start screen
        s = self.env.reset()
        
        # accumulator for discounted reward over the episode
        episode_R = 0
        while True:
            time.sleep(THREAD_DELAY) # yield control to other threads
            
            # render the emulator if this drone is the exemplar
            # NOTE: this doesn't work in Notebook
            #if self.exemplar:
            #    self.env.render()

            # choose an action
            a = self.act(s)
            
            # execute the action, and collect the transition variables
            s_, r, end, _ = self.env.step(a)
            if end:
                s_ = None
                r = FAILURE_PENALTY
            elif r < 0.001:
                r = LIVENESS_REWARD
            
            # convert the action to one-hot representation (necessary for tensorflow)
            a_onehot = np.zeros(NUM_ACTIONS)
            a_onehot[a] = 1

            # save the transition in local memory
            transition = (s, a_onehot, r, s_)
            self.training_samples.append(transition)

            # update the accumulated discounted reward
            self.R = (self.R + r * GAMMA_N) / GAMMA

            # if we've hit the terminal state
            if s_ is None:
                while len(self.training_samples) > 0:
                    n = len(self.training_samples)
                    s, a, r, s_ = self.sample_memory(n)
                    self.master.cache_training_sample(s, a, r, s_)
                    self.R = (self.R - self.training_samples[0][2]) / GAMMA
                    self.training_samples.pop(0)
                self.R = 0

            # if the local cache has reached capacity
            if len(self.training_samples) >= N_STEP:
                s, a, r, s_ = self.sample_memory(N_STEP)
                self.master.cache_training_sample(s, a, r, s_)
                self.R = self.R - self.training_samples[0][2]
                self.training_samples.pop(0)

            # transition to the next state
            s = s_
            
            # accumulate total episode reward
            episode_R += r

            # end the episode once we've reached a terminal state
            if end:
                break

        # record episode reward for the exemplar
        if self.drone_id == 0:
            log = open("log-a3c-"+str(self.drone_id)+".txt", "a")
            log.write(str(episode_R) + "\n")
            print("Episode Reward: " + str(episode_R))
     
    
    def sample_memory(self, n):
        """
        Fetches an n-step transition from local memory.
        """
        s, a, _, _ = self.training_samples[0]
        _, _, _, s_ = self.training_samples[n-1]
        return s, a, self.R, s_
    
    
    def act(self, s):
        """
        Chooses an action in a given state, by querying the Master for a policy.
        """
        # with probability epsilon, choose a random action
        if random.random() < self.epsilon:
            # ... and if we do that, make sure we decrement epsilon
            if self.epsilon > EPSILON_INIT:
                self.epsilon -= EPSILON_STEP
            return random.randint(0, NUM_ACTIONS-1)
        
        # otherwise, ask the master for a policy
        else:
            policy = self.master.predict_policy(preprocess(s))
            # choose an action in accordance with the policy
            # Note, an alternative would be to choose the policy with the highest probability
            return np.random.choice(NUM_ACTIONS, p=policy)

In [None]:
class Optimizer(Thread):
    """
    Simple optimizer class. All this does is repeatedly call train_network() in master.
    """
    def __init__(self, master):
        super(Optimizer, self).__init__()
        self.master = master
    
    def run(self):
        while True:
            self.master.train_network()

In [None]:
def main():
    
    # initialize master
    master = Master()
    
    # initialize the drones
    drones = [Drone(master, drone_id=i) for i in range(NUM_DRONES-1)]
    
    # make the first drone an exemplar
    drones[0].exemplar = True
    
    for drone in drones:
        drone.start()
        
    optimizer = Optimizer(master)
    optimizer.run()

In [None]:
main()

A few things to note about running this code in the Notebook:
* Jupyter Notebook doesn't support rendering for the Atari emulator, so - unfortunately - I've had to disable the rendering functionality. This means that you won't be able to watch A3C actually playing Atari Breakout unless you export the code into a .py file and run it from the terminal.
* I've run into a few bugs that I haven't implemented fixes for. The most annoying one is that if you run `main()` once, then force stop it, you won't be able to start it again unless you restart the Notebook kernel.
* Because A3C can take several hours to converge, you won't be able to make out any trend in the discounted reward figures that are reported when running `main()`. Check out https://www.youtube.com/watch?v=V1eYniJ0Rnk (only 1:30 mins) for how the network should perform as it starts to converge. The video is actually for DQN (A3C's predecessor), but the behaviour should be the same.