<a href="https://colab.research.google.com/github/Yawapi/100-Days-of-machine-learning-challenge/blob/master/SLM_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
# SLM Lab
  

## A Modular Deep Reinforcement Learning Framework in PyTorch

### Laura Graesser, Wah Loon Keng, Siraj Raval

### September 24, 2018

####  Abstract

SLM Lab is a modular deep reinforcement learning framework in PyTorch. It implements numerous canonical algorithms using modular and reusable components, provides an automated experiment and analytics framework focusing on reproducibility, and integrates with OpenAI Gym and Unity ML-Agents. This whitepaper discusses the features and design principles of SLM Lab.

![alt text](https://camo.githubusercontent.com/4f7b42837ee4a737ef2f43deb23ebebc212697d4/68747470733a2f2f6d656469612e67697068792e636f6d2f6d656469612f6c30444149796d75694d533348795739472f67697068792e676966)
  




# 1 - Introduction
Reinforcement learning (RL) is a subfield of artificial intelligence that dates back to optimal control theory
and Markov decision processes (MDP). RL is concerned with sequential decision making in the context of an
unknown objective function. Unlike supervised learning, algorithms which try to solve RL problems do not
have access to the “correct answer” during the learning phase. Instead, an RL system consists of an agent
and an environment, subject to alternating interactions in succession. The environment provides a signal,
known as the state, to the agent, who then observes it, deliberates, and performs an action. The action is
sent back to the environment. It accepts the action, transitions into the next state, and provides feedback
to the agent in the form of a reward and the next state. The cycle then repeats until the environment
terminates. The objective or goal for the agent in this system is to maximize the discounted sum of rewards
received over time. For this reason, RL has been referred to as “learning with a critic” (the rewards received)
in contrast to the supervised learning setup of “learning with a teacher” (the correct action to take in every
state) [[31]](https://doi.org/10.1109/TSMC.1973.4309272).


![alt text](https://www.ibm.com/developerworks/library/cc-models-machine-learning/figure01.png)


Deep learning is a method for approximating highly complex non-linear functions by learning from data.
Recently, deep learning has been combined with RL to great effect. It is being applied to robotics [[1, 8, 22]](https://arxiv.org/abs/1808.00177),
navigation [[23]](http://arxiv.org/abs/1611.04201), datacenter cooling [[9]](https://www.technologyreview.com/s/611902/google-just-gave-control-over-data-center-cooling-to-an-ai/), natural language understanding [[19]](http://arxiv.org/abs/1705.04304), and has led to expert level or
superhuman play in a number of games: Backgammon [[29]](http://doi.acm.org/10.1145/203330.203343), Atari video games [[13, 12]](http://arxiv.org/abs/1312.5602), Go [[27, 28]](https://www.nature.com/articles/nature16961), Dota
[[16, 17, 18]](https://blog.openai.com/more-on-dota-2/), and chess [[26]](https://arxiv.org/abs/1712.01815). However a number of challenges remain.


![alt text](https://cdn-images-1.medium.com/max/693/1*ZX05x1xYgaVoa4Vn2kKS9g.png)


It is a well known problem that RL algorithms are hard to make work [[6]](https://www.alexirpan.com/2018/02/14/rl-hard.html), can be unstable [[5]](http://arxiv.org/abs/1709.06560), and their
results are hard to reproduce [[21, 5]](http://amid.fish/reproducing-deep-rl). The combination of two factors make learning difficult. First, complex
non-linear function approximation with neural networks. Second, non stationary, potentially stochastic,
environments with unknown dynamics. Consequently, to get RL to work there are many tricks which are needed and many hyperparameters to tune. This contributes to the challenge of reproducibility and also
make deep RL hard to learn.

SLM Lab is a modular deep reinforcement learning (DRL) framework in PyTorch[[20]](https://github.com/pytorch/pytorch) created for DRL
research and applications. It emerged from the authors facing these problems and wanting to make a
contribution towards solving them. The design was guided by four principles: modularity, simplicity,
analytical clarity, and reproducibility.


**Modularity:** the modularity of component design in SLM Lab makes research easier and more accessible.
It reuses well-tested components and enables users to focus only on the relevant work. It also makes it easier
to learn deep RL by breaking down the complex Deep RL algorithms into more manageable, digestible
components. Moreover, when components are maximally reused, there is less code, more tests, and fewer
bugs.


```python
class PPO(ActorCritic):
    '''
    Implementation of PPO
    This is actually just ActorCritic with a custom loss function
    Original paper: "Proximal Policy Optimization Algorithms"
    https://arxiv.org/pdf/1707.06347.pdf
    Adapted from OpenAI baselines, CPU version https://github.com/openai/baselines/tree/master/baselines/ppo1
    Algorithm:
    for iteration = 1, 2, 3, ... do
        for actor = 1, 2, 3, ..., N do
            run policy pi_old in env for T timesteps
            compute advantage A_1, ..., A_T
        end for
        optimize surrogate L wrt theta, with K epochs and minibatch size M <= NT
    end for
```

*Example - Proximal Policy Optimization algorithm is implemented as an extension of the actor-critic algorithm* (PPO.py)


**Simplicity:** the lab components are designed to closely correspond to the way papers or books discuss RL.
This minimizes the burden of translating ideas to implementations, or understanding them. We understand
that a modular library is not necessarily simple, as it may be over-engineered. Simplicity serves to balance
modularity to prevent overly complex abstractions that are difficult to understand and use.

```python
class Algorithm(ABC):
    '''
    Abstract class ancestor to all Algorithms,
    specifies the necessary design blueprint for agent to work in Lab.
    Mostly, implement just the abstract methods and properties.
    '''
```
*Example - All reinforcement learning algorithms in SLM Lab inherit from the Algorithm class (base.py) *italicized text*

**Analytical clarity**: experiment results are automatically analyzed and presented hierarchically in increasingly
granular detail. A big experiment with multiple trials contains a lot of data. With hierarchical
results, one can quickly look at the experiment graph (Figure 2) to determine if it was a success overall, then
pick the best trials and drill down to their trial (Figure 1b) and session (Figure 1a) graphs, which contains
lower level details such as reward and loss.

![alt text](https://i.imgur.com/KCXIpnx.png)

**Reproducibility:** a long term aspiration in deep RL is that all research is reproducible. To do so, one
needs the source code, hyperparameters, and result data. Using SLM Lab, an experiment can be fully
reproduced using a spec file, which exposes all the hyperparameters, and a git SHA, which allows one to
checkout the version of the code used to run it. These and the experiment results are submitted to the lab
via a Pull Request. The complete experiment data is uploaded to a public cloud storage for open access,
and the result is tracked in the lab’s benchmark page.

```JSON
{
  "dqn_cartpole": {
    "agent": [{
      "name": "DQN",
      "algorithm": {
        "name": "DQN",
        "action_pdtype": "Argmax",
        "action_policy": "epsilon_greedy",
        "action_policy_update": "linear_decay",
        "explore_var_start": 1.0,
        "explore_var_end": 0.1,
        "explore_anneal_epi": 20,
        "gamma": 0.99,
        "training_batch_epoch": 10,
        "training_epoch": 4,
        "training_frequency": 8,
        "training_min_timestep": 32,
        "normalize_state": true
      },
      "memory": {
        "name": "Replay",
        "batch_size": 32,
        "max_size": 10000,
        "use_cer": true
      },
      "net": {
        "type": "MLPNet",
        "hid_layers": [64],
        "hid_layers_activation": "selu",
        "clip_grad": false,
        "clip_grad_val": 1.0,
        "loss_spec": {
          "name": "MSELoss"
        },
        "optim_spec": {
          "name": "Adam",
          "lr": 0.002
        },
        "lr_decay": "rate_decay",
        "lr_decay_frequency": 1000,
        "lr_decay_min_timestep": 1000,
        "lr_anneal_timestep": 100000,
        "update_type": "polyak",
        "update_frequency": 1,
        "polyak_coef": 0,
        "gpu": false
      }
    }],
    "env": [{
      "name": "CartPole-v0",
      "max_timestep": null,
      "max_episode": 150,
      "save_epi_frequency": 50
    }],
    "body": {
      "product": "outer",
      "num": 1
    },
    "meta": {
      "distributed": false,
      "graph_x": "epi",
      "max_trial": 4,
      "max_session": 4,
      "search": "RandomSearch",
      "resources": {
        "num_cpus": 4,
        "num_gpus": 0
      }
    },
    "search": {
      "agent": [{
        "algorithm": {
          "gamma__choice": [0.95, 0.99]
        },
        "net": {
          "optim_spec": {
            "lr__choice": [0.001, 0.01]
          }
        }
      }]
    }
  }
}
```

The main contributions of the lab can be grouped into three parts. First, the lab implements most of
the canonical algorithms in deep RL. This is enabled by a standardized modular design with well-tested,
reusable components. Second, SLM Lab provides an automated experiment and analytics framework. The
framework is organized into three tiers: experiment, trial, and session. It incorporates hyperparameter
search, distributed training, replication with random seeds, automated analysis, and introduces a fitness
metric. Third, it integrates with OpenAI gym and Unity ML-Agents, and has an API for adding more
future environments



# 2 - Features

## 2.1 - Algorithms

SLM Lab implements many reinforcement learning algorithms, listed below:
• SARSA, DQN[[13]](http://arxiv.org/abs/1312.5602), DDQN[[3]](http://arxiv.org/abs/1509.06461), DuelingDQN[[30]](http://arxiv.org/abs/1511.06581), and their asynchronous versions[[11]](http://dl.acm.org/citation.cfm?id=3045390.3045594)

• Prioritized Experience Replay[[24]](http://arxiv.org/abs/1511.05952), Combined Experience Replay[[33]](http://arxiv.org/abs/1712.01275)

• REINFORCE[[32]](https://doi.org/10.1007/BF00992696), A2C, PPO[[25]](http://arxiv.org/abs/1707.06347), and their asynchronous versions including A3C[[11]](http://dl.acm.org/citation.cfm?id=3045390.3045594) and DPPO[[4]](http://arxiv.org/abs/1707.02286)

• (A2C)SIL and PPOSIL[[15]](https://arxiv.org/abs/1806.05635)

Implementations are organized into standardized modular components that closely correspond to how they
are discussed in deep RL papers and books: algorithm, network, replay memory, and policy. The modular
design does not compromise simplicity, making it efficient to translate between the code and theory.

These components are grouped under an Agent class, which handles interaction with the environments.
Primarily, the Algorithm class controls the other components and their interactions, computes the algorithmspecific
loss functions, and runs the training step. Under it, Net class serves as the function approximators for
the algorithm, while the Memory class provides the necessary data storage and retrieval for training. Policy
is a generalized abstraction which takes a network output layer to construct a probability distribution used
for sampling actions and calculating log probabilities and entropies.

The implemented Net classes are MLPNet, HydraMLPNet, DuelingMLPNet, RecurrentNet, ConvNet,
DuelingConvNet. There are two types of Memory classes: Replay, SeqReplay, SILReplay, SILSeqReplay,
ConcatReplay, AtariReplay, PrioritizedReplay, AtariPrioritizedReplay for off-policy replay,
and OnPolicyReplay, OnPolicySeqReplay, OnPolicyBatchReplay, OnPolicySeqBatchReplay, OnPolicyAtariReplay
for on-policy replay. The policy module handles discrete and continuous controls, with
exploration methods like epsilon-greedy and Boltzmann.

Most of the above are implemented concisely using class inheritance. Furthermore, the lab provides
a standard API for them to interact and integrate. These together make it easy to implement a new
component and have it immediately apply to all relevant algorithms. As a result, components are maximally
reused and enjoy the benefits of shorter code, more tests, and fewer bugs. With this, the workflow becomes
lightweight and fast with the lab, as one can trust the components and focus only on those under research
or development.

To illustrate, the hogwild asynchronous training of A3C[[11]](http://dl.acm.org/citation.cfm?id=3045390.3045594) is implemented by constructing and updating
the global networks using the Net API, so it applies to all the algorithms in the lab. For example, one can
turn A2C into A3C with by specifying "distributed": true in the spec file, as shown in appendix listing 2. The PPO + SIL is implemented using multi-inheritance by symbolically calling the parent classes SIL and PPO without any newly written code. Likewise, listing 1 shows how AtariPrioritizedReplay inherits
from the existing PrioritizedReplay and AtariReplay, again without new code.

Such implementation techniques have saved tremendous time and effort when developing with the lab,
and have maximized the benefits of existing reliable components. This has been the main enabling factor
for expanding the lab to include these many algorithms and components.




In [0]:
#Example Demo - A Deep Q Network learning to balance using the Cartpole environment

#Step 1 - Download repository
!git clone https://github.com/theschoolai/SLM-Lab
  
#Step 2 - Install anaconda 
!wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
!chmod +x Anaconda3-5.1.0-Linux-x86_64.sh
!bash ./Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p /usr/local
!conda install

## Step 3 - Run the install script
%cd SLM-Lab
!bin/setup

## Step 4 - Install remaining dependencies
!pip install torch
!pip install pydash
!pip install regex
!pip install ujson
!pip install yaml
!pip install colorlog
!pip install pandas
!pip install gym
!pip install unityagents
!pip install plotly
!pip install colorlover
!pip install deap
!pip install ray==0.3.1
!pip install box2d
!apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python

## Step 5 - Run the Demo
!conda activate lab
!python run_lab.py

## 2.2 - Experimentation & Analysis

SLM Lab provides an automated experimentation and analytics framework. The main features include:

• hyperparameter search using Ray Tune[[14][10]](http://arxiv.org/abs/1712.05889)
• distributed training which automatically scales to the available CPU and GPU resources
• automatic replication of training with different random seeds for the same hyperparameter settings
• automatic analysis and plotting of results. These are presented at multiple levels of granularity. Summarizing
results in this way makes it easy and fast to get an overview of an experiment and drill down
into detailed results if desired
• agent fitness metric. Instead of considering just the total reward, the lab also measures the convergence
speed, stability and consistency to provide a richer set of metrics

**Framework components** The experiment framework is organized hierarchically into three components:


**1. Session:** A session is the lowest level of SLM Lab. It initializes the agents and environments, and
runs the control loop in a single process. Each session runs for some defined maximum episodes; each
episode runs for some maximum number of timesteps before resetting the environments. A session is
associated with an algorithm plus a particular set of hyperparameter values and a distinct random seed.
The output of a session is an instance of a trained agent, with a particular set of learned parameters.

```python
class Session:
    '''
    The base unit of instantiated RL system.
    Given a spec,
    session creates agent(s) and environment(s),
    run the RL system and collect data, e.g. fitness metrics, till it ends,
    then return the session data.
```


**2. Trial:** A trial holds a fixed configuration (a spec) which is an algorithm plus a particular set of
hyperparameter values. A trial uses its spec to run multiple sessions with different random seeds to
measure reproducibility. It then analyzes the sessions and takes the average. If a trial’s configuration
is a stable solution, then all the sessions should solve the environment with similar performance. This
is the basic idea behind the “consistency” aspect of the fitness score.

```python
class Trial:
    '''
    The base unit of an experiment.
    Given a spec and number s,
    trial creates and runs s sessions,
    gather and aggregate data from sessions as trial data,
    then return the trial data.
    '''
```

**3. Experiment:** An experiment is the highest level of SLM Lab’s experiment framework. It can be
thought of as a study, e.g. “what values of gamma and learning rate provide the fastest, most stable
solution, while the other variables are held constant?” The input variables are encoded as the hyperparameters,
and the outcome is measured by the fitness vector. An experiment runs a trial for every
set of parameters generated by a hyperparameter search algorithm (random search for example), and
a trial runs multiple replicated, seeded sessions to average the results. Then, the search is to find the
hyperparameters that yield maximum fitness for a given experiment. The hyperparameters to search
over, their range, and sampling method, are specified in a spec file.

```python
class Experiment:
    '''
    The core high level unit of Lab.
    Given a spec-space/generator of cardinality t,
    a number s,
    a hyper-optimization algorithm hopt(spec, fitness-metric) -> spec_next/null
    experiment creates and runs up to t trials of s sessions each to optimize (maximize) the fitness metric,
    gather the trial data,
    then return the experiment data for analysis and use in evolution graph.
    Experiment data will include the trial data, notes on design, hypothesis, conclusion, analysis data, e.g. fitness metric, evolution link of ancestors to potential descendants.
    An experiment then forms a node containing its data in the evolution graph with the evolution link and suggestion at the adjacent possible new experiments
    On the evolution graph level, an experiment and its neighbors could be seen as test/development of traits.
    '''
```

The organization above, when combined with the standard modular component design, helps expose all
the hyperparameters of an algorithm using a JSON spec file. This allows for wider search options, as well as
reduces the number of hidden hyperparameters, which has contributed to the problem of reproducibility. As
a result, an experiment can be fully reproduced in SLM Lab using just the JSON spec file and the correct
git commit of the code repository, which is specified with the git SHA that recorded automatically in the
generated spec during an experiment.

**Experiment data**

Experiments generate data which is automatically saved to the data/(experiment-id) folder. Each session generates a plot of the rewards and loss per episode, a table of metrics per episode
(e.g, reward, loss, exploration variable value, episode length), a session’s fitness, and checkpoints of the
algorithm models. See Figure 1a for an example.

![alt text](https://i.imgur.com/5Oc5KCP.png)

Figure 1: Session and Trial graphs
A2C in Cartpole-v0[[2]](https://arxiv.org/abs/1606.01540) environment

Each trial generates a plot of the average reward per episode plus error bars, a spec file detailing the
specific hyperparameter values, and a trial’s fitness. See Figure 1b for an example.

Each experiment generates a summary plot of each trial’s fitness for each of the different hyperparameters
searched over. This plot makes it clear which parameters an algorithm is sensitive to, and which values
work best. The multi-dimensional fitness metric also allows one to measure the effect on convergence speed,
stability and consistency. It is immediately clear from this graph whether any trials succeeded. See Figure
2 for an example.

The experiment df CSV file contains a summary of each trial sorted from highest to lowest fitness. It is
intended to clearly show the hyperparameter value ranges and combinations from the most successful to the
least. Finally, each experiment saves a spec file detailing the search specifications and other hyperparameter
settings, along with the SHA of the git commit used to run it. They encapsulates all the information needed
to reproduce an experiment.

All of these reproduction instructions and experiment data can be contributed back to the SLM Lab via a
Github Pull Request, which follows the format of a scientific report to encourage rigor and reproducibility.
The complete experiment data is uploaded to a public cloud storage for open access, and the result is tracked
in the lab’s benchmark page. Both of them are linked from the SLM Lab website.

![alt text](https://i.imgur.com/yZnvUWs.png)

Figure 2: Experiment graph: A2C performance on Cartpole-v0[[2]](https://arxiv.org/abs/1606.01540), with a high level view of the effects of
the searched hyperparameters on various fitness metrics

## 2.3 Fitness metric

The lab introduces a vectorial fitness metric as a richer measurement of an agent’s performance. It measures
an agent’s strength (in terms of rewards), speed (how fast it attains its strength), stability (once learned, is
the solution stable), and consistency (how reproducible is the trial result). It serves a dual purpose in the
lab:

1. provide a standard metric to compare algorithms and environments
2. provide a value for the hyperparameter optimizer to maximize

The fitness metric was motivated by the fact it is common to perform hundreds or thousands of hyperparameter
searches in the course of DRL research, even when working on a single algorithm or environment.However it is not always straightforward to conclude which are the best set of hyperparameter values. Which are “best” depends on what aspect of performance you are looking at. Does the agent achieve high reward?
Does it learn fast? Is learning stable? Is learning reproducible? Reinforcement learning research typically
reports on the first two metrics; the total rewards achieved, and the speed of learning (number of time steps
to achieve a result). Increasingly the variance of the rewards over time is also reported. However, it is
not common to measure the stability during training. Furthermore, measuring results using un-normalized
rewards makes it difficult to compare agents across environments because the units are not comparable. It
also acts as an obstacle to developing a measure of environment difficulty.

What then are the desirable properties of a performance, or fitness, metric? The authors propose the
following properties:

• **multi-dimensional:** to measure multiple metrics

• **standardizable:** to allow comparison among different experiments

• **extensible, backward compatible:** when new metrics are added, the old vector becomes a special
case, similar to how x,y is a restriction of x,y,z space. It is not necessary to re-run experiments when
fitness is updated.

• **reducible to a single number:** fitness = L2 norm of the fitness vector.

• **magnification:** under L2 norm, higher fitness vectors have bigger differences than lower fitness vectors.
e.g. 

>$2^2 − 1.9^2 > 0.9^2 − 0.8^2$

• **easy to understand:** with familiar, simple definitions, and a nice geometric intepretation

A *fitness vector* satisfies all of the above. Given a fitness vector, we can also ensure that every dimension
(and the resultant norm fitness) has these properties:

• the value is **0 for a random agent**
• the value is **1 for a standard, universal baseline**
• the value is **higher for better** fitness; the scale should not be too extreme

SLM Lab proposes a metric that satisfies these properties, the fitness metric. Appendix A details how
SLM Lab’s fitness vector and the resulting fitness metric are calculated.

## 2.4 Environments

Currently, SLM Lab integrates with environments from OpenAI gym[[2]](https://arxiv.org/abs/1606.01540) and Unity ML-Agents[[7]](https://arxiv.org/abs/1809.02627). Even
though the interface methods of these two are different, they are standardized into the environment API
of the lab using a lightweight Env wrapper class. The same technique can also be applied to add new
environments into the lab in the future.

The lab makes use of the environment properties to automatically inference the cardinality and shape of
the state and action spaces, as well as the type of actions, such as discrete, continuous, or a hybrid from
multi-environments. This information is further used by the policy module aforementioned to construct
the proper probability distribution for sampling actions.

Integrating multiple environments greatly expand the testing ground for all the algorithms in the lab,
allowing us to study their behaviors in greater variations. An interesting consequence from standardizing
the environment APIs is that the lab can create composite environments, such as one from OpenAI gym
and one from Unity ML-Agents, to let a multitask agent solve both simultaneously. Moreover, the lab has a
Body class which handles the interface between agent and environment. This is especially useful for keeping
track of the interfaces in multi-agent, multi-environment settings, which the lab too provides.

## 2.5 Python Package

Another feature of SLM Lab is portability. Its algorithms and components can be imported and used outside
of the framework. This makes it possible to develop with SLM Lab then deploy the agent elsewhere in a
software application, with minimal effort. To do so, one simply needs to install SLM Lab as a pip module by
doing a pip install -e . at the repository root, and require any of its components as a normal Python
module. This is further detailed in the documentation of the lab.

# 3 - Benchmarks

TODO these numbers will come in soon

# 4 - Future work

We hope to continue the work on SLM Lab in several directions:

1. Expand the benchmark results. All the algorithms will gradually be benchmarked on more environments,
including the Atari suite and robotics tasks.

2. Implement new algorithms and components. Having more components to build with while maintaining
ease of use is a strength of the lab, and it will encourage the development of more advanced algorithms.

3. Research and application. We hope to support the lab users to publish research and deploy DRL in
industrial applications.

# 5 - Acknowledgements

We thank the School of AI for their advice and support. We also thank Michael Cvitkovic from Caltech for
his help and contributions.

# References

[1] Marcin Andrychowicz et al. “Learning Dexterous In-Hand Manipulation”. In: (2018). url: https://arxiv.org/abs/1808.00177.

[2] Greg Brockman et al. OpenAI Gym. 2016. eprint: arXiv:1606.01540.

[3] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Qlearning”.
In: CoRR abs/1509.06461 (2015). arXiv: 1509.06461. url: http://arxiv.org/abs/1509.06461.

[4] Nicolas Heess et al. “Emergence of Locomotion Behaviours in Rich Environments”. In: CoRR abs/1707.02286
(2017). arXiv: 1707.02286. url: http://arxiv.org/abs/1707.02286.

[5] Peter Henderson et al. “Deep Reinforcement Learning that Matters”. In: CoRR abs/1709.06560 (2017).
arXiv: 1709.06560. url: http://arxiv.org/abs/1709.06560.

[6] Alex Irpan. Deep Reinforcement Learning Doesn’t Work Yet. https://www.alexirpan.com/2018/02/14/rl-hard.html. 2018.

[7] Arthur Juliani et al. Unity: A General Platform for Intelligent Agents. 2018. eprint: arXiv:1809.02627.

[8] Dmitry Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic
Manipulation”. In: CoRR abs/1806.10293 (2018). arXiv: 1806.10293. url: http://arxiv.org/abs/1806.10293.

[9] Will Knight. Google just gave control over data center cooling to an AI. https://www.technologyreview.com/s/611902/google-just-gave-control-over-data-center-cooling-to-an-ai/. 2018.

[10] Richard Liaw et al. “Tune: A Research Platform for Distributed Model Selection and Training”. In:
arXiv preprint arXiv:1807.05118 (2018).

[11] Volodymyr Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: Proceedings
of the 33rd International Conference on International Conference on Machine Learning - Volume 48.
ICML’16. New York, NY, USA: JMLR.org, 2016, pp. 1928–1937. url: http://dl.acm.org/citation.cfm?id=3045390.3045594.

[12] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540
(Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/nature14236.

[13] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: CoRR abs/1312.5602
(2013). arXiv: 1312.5602. url: http://arxiv.org/abs/1312.5602.

[14] Philipp Moritz et al. “Ray: A Distributed Framework for Emerging AI Applications”. In: CoRR
abs/1712.05889 (2017). arXiv: 1712.05889. url: http://arxiv.org/abs/1712.05889.

[15] Junhyuk Oh et al. “Self-Imitation Learning”. In: (2018). url: https://arxiv.org/abs/1806.05635.

[16] OpenAI. More on DotA 2. 2017. url: https://blog.openai.com/more-on-dota-2/.

[17] OpenAI. OpenAI Five. 2018. url: https://blog.openai.com/openai-five/.

[18] OpenAI. OpenAI Five Benchmark: Results. 2018. url: https://blog.openai.com/openai-fivebenchmark-results/.

[19] Romain Paulus, Caiming Xiong, and Richard Socher. “A Deep Reinforced Model for Abstractive
Summarization”. In: CoRR abs/1705.04304 (2017). arXiv: 1705.04304. url: http://arxiv.org/abs/1705.04304.

[20] PyTorch. 2018. url: https://github.com/pytorch/pytorch.

[21] Matthew Rahtz. Lessons Learned Reproducing a Deep Reinforcement Learning Paper. http://amid.fish/reproducing-deep-rl. 2018.

[22] Fereshteh Sadeghi. Teaching Uncalibrated Robots to Visually Self-Adapt. https://ai.googleblog.com/2018/06/teaching-uncalibrated-robots-to_22.html. 2018.

[23] Fereshteh Sadeghi and Sergey Levine. “CAD2RL: Real Single-Image Flight without a Single Real
Image”. In: CoRR abs/1611.04201 (2016). arXiv: 1611.04201. url: http://arxiv.org/abs/1611.04201.

[24] Tom Schaul et al. “Prioritized Experience Replay”. In: CoRR abs/1511.05952 (2015). arXiv: 1511.05952. url: http://arxiv.org/abs/1511.05952.

[25] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017).
arXiv: 1707.06347. url: http://arxiv.org/abs/1707.06347.

[26] David Silver et al. “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning
Algorithm”. In: arXiv preprint arXiv:1712.01815 (2017).

[27] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: nature
529.7587 (2016), pp. 484–489.

[28] David Silver et al. “Mastering the game of go without human knowledge”. In: Nature 550.7676 (2017),
p. 354.

[29] Gerald Tesauro. “Temporal Difference Learning and TD-Gammon”. In: Commun. ACM 38.3 (Mar.
1995), pp. 58–68. issn: 0001-0782. doi: 10.1145/203330.203343. url: http://doi.acm.org/10.1145/203330.203343.

[30] Ziyu Wang, Nando de Freitas, and Marc Lanctot. “Dueling Network Architectures for Deep Reinforcement
Learning”. In: CoRR abs/1511.06581 (2015). arXiv: 1511.06581. url: http://arxiv.org/abs/1511.06581.

[31] B. Widrow, N. K. Gupta, and S. Maitra. “Punish/Reward: Learning with a Critic in Adaptive Threshold
Systems”. In: IEEE Transactions on Systems, Man, and Cybernetics SMC-3.5 (1973), pp. 455–465.
issn: 0018-9472. doi: 10.1109/TSMC.1973.4309272.

[32] Ronald J. Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement
Learning”. In: Mach. Learn. 8.3-4 (May 1992), pp. 229–256. issn: 0885-6125. doi: 10.1007/
BF00992696. url: https://doi.org/10.1007/BF00992696.

[33] Shangtong Zhang and Richard S. Sutton. “A Deeper Look at Experience Replay”. In: CoRR abs/1712.01275
(2017). arXiv: 1712.01275. url: http://arxiv.org/abs/1712.01275.