A collection of Reinforcement Learning agents
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



A collection of Reinforcement Learning agents

Build Status


pip install --user git+https://github.com/eleurent/rl-agents


Most experiments can be run from scripts/experiments.py

  experiments evaluate <environment> <agent> (--train|--test)
                                             [--episodes <count>]
                                             [--seed <str>]
  experiments benchmark <benchmark> (--train|--test)
                                    [--processes <count>]
                                    [--episodes <count>]
                                    [--seed <str>]
  experiments -h | --help

  -h --help            Show this screen.
  --analyze            Automatically analyze the experiment results.
  --episodes <count>   Number of episodes [default: 5].
  --processes <count>  Number of running processes [default: 4].
  --seed <str>         Seed the environments and agents.
  --train              Train the agent.
  --test               Test the agent.

The evaluate command allows to evaluate a given agent on a given environment. For instance,

# Train a DQN agent on the CartPole-v0 environment
$ python3 experiments.py evaluate envs/cartpole.json agents/dqn.json --train --episodes=200

The environments are described by their gym registration id


And the agents by their class, and configuration dictionary.

    "__class__": "<class 'rl_agents.agents.dqn.pytorch.DQNAgent'>",
    "model": {
        "type": "DuelingNetwork",
        "layers": [512, 512]
    "gamma": 0.99,
    "n_steps": 1,
    "batch_size": 32,
    "memory_capacity": 50000,
    "target_update": 1,
    "exploration": {
        "method": "EpsilonGreedy",
        "tau": 50000,
        "temperature": 1.0,
        "final_temperature": 0.1

If keys are missing from these configurations, default values will be used instead.

Finally, a batch of experiments can be scheduled in a benchmark. All experiments are then executed in parallel on several processes.

# Run a benchmark of several agents interacting with environments
$ python3 experiments.py benchmark cartpole_benchmark.json --test --processes=4

A benchmark configuration files contains a list of environment configurations and a list of agent configurations.

    "environments": ["envs/cartpole.json"],
    "agents":["agents/dqn.json", "agents/mcts.json"]


The following agents are currently implemented:


Value Iteration

Perform a Value Iteration to compute the state-action value, and acts greedily with respect to it.

Only compatible with finite-mdp environments, or environments that handle an env.to_finite_mdp() conversion method.

Reference: Dynamic Programming, Bellman R., Princeton University Press (1957).

Monte-Carlo Tree Search

Implemented as Upper Confidence Trees (UCT). A world transition model is leveraged for trajectory search. A search tree is expanded though a selection rule tailored to focus the search around the most promising moves, and leaves are evaluated by sampling.


Deterministic Optimistic Planning

Reference: Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).

Open Loop Optimistic Planning

Reference: Open Loop Optimistic Planning, Bubeck S., Munos R. (2010).


Reference: Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).

Robust planning

Robust Value Iteration

A list of possible finite-mdp models is provided in the agent configuration. The MDP ambiguity set is constrained to be rectangular: different models can be selected at every transition.The corresponding robust state-action value is computed so as to maximize the worst-case total reward.


Discrete Robust Optimistic Planning

The MDP ambiguity set is assumed to be finite, and is constructed from a list of modifiers to the true environment. The corresponding robust value is approximately computed by Deterministic Optimistic Planning so as to maximize the worst-case total reward.

Interval-based Robust Planning




A neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.

Implemented variants:

  • Double DQN
  • Dueling architecture
  • N-step targets