Skip to content
A collection of Reinforcement Learning agents
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


A collection of Reinforcement Learning agents

Build Status


pip install --user git+


Most experiments can be run from scripts/

  experiments evaluate <environment> <agent> (--train|--test)
                                             [--episodes <count>]
                                             [--seed <str>]
  experiments benchmark <benchmark> (--train|--test)
                                    [--processes <count>]
                                    [--episodes <count>]
                                    [--seed <str>]
  experiments -h | --help

  -h --help            Show this screen.
  --analyze            Automatically analyze the experiment results.
  --episodes <count>   Number of episodes [default: 5].
  --processes <count>  Number of running processes [default: 4].
  --seed <str>         Seed the environments and agents.
  --train              Train the agent.
  --test               Test the agent.

The evaluate command allows to evaluate a given agent on a given environment. For instance,

# Train a DQN agent on the CartPole-v0 environment
$ python3 evaluate envs/cartpole.json agents/dqn.json --train --episodes=200

The environments are described by their gym registration id


And the agents by their class, and configuration dictionary.

    "__class__": "<class 'rl_agents.agents.dqn.pytorch.DQNAgent'>",
    "model": {
        "type": "DuelingNetwork",
        "layers": [512, 512]
    "gamma": 0.99,
    "n_steps": 1,
    "batch_size": 32,
    "memory_capacity": 50000,
    "target_update": 1,
    "exploration": {
        "method": "EpsilonGreedy",
        "tau": 50000,
        "temperature": 1.0,
        "final_temperature": 0.1

If keys are missing from these configurations, default values will be used instead.

Finally, a batch of experiments can be scheduled in a benchmark. All experiments are then executed in parallel on several processes.

# Run a benchmark of several agents interacting with environments
$ python3 benchmark cartpole_benchmark.json --test --processes=4

A benchmark configuration files contains a list of environment configurations and a list of agent configurations.

    "environments": ["envs/cartpole.json"],
    "agents":["agents/dqn.json", "agents/mcts.json"]


The following agents are currently implemented:


Value Iteration

Perform a Value Iteration to compute the state-action value, and acts greedily with respect to it.

Only compatible with finite-mdp environments, or environments that handle an env.to_finite_mdp() conversion method.

Reference: Dynamic Programming, Bellman R., Princeton University Press (1957).

MCTS Monte-Carlo Tree Search

A world transition model is leveraged for trajectory search. A look-ahead tree is expanded so as to explore the trajectory space and quickly focus around the most promising moves.


UCT Upper Confidence bounds applied to Trees

The tree is traversed by iteratively applying an optimistic selection rule at each depth, and the value at leaves is estimated by sampling. Empirical evidence shows that this popular algorithms performs well in many applications, but it has been proved theoretically to achieve a much worse performance (doubly-exponential) than uniform planning in some problems.


OPD Optimistic Planning for Deterministic systems

This algorithm is tailored for systems with deterministic dynamics and rewards. It exploits the reward structure to achieve a polynomial rate on regret, and behaves efficiently in numerical experiments with dense rewards.

Reference: Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).

OLOP Open Loop Optimistic Planning

Reference: Open Loop Optimistic Planning, Bubeck S., Munos R. (2010).


Reference: Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).

Robust planning

Robust Value Iteration

A list of possible finite-mdp models is provided in the agent configuration. The MDP ambiguity set is constrained to be rectangular: different models can be selected at every transition.The corresponding robust state-action value is computed so as to maximize the worst-case total reward.


Discrete Robust Optimistic Planning

The MDP ambiguity set is assumed to be finite, and is constructed from a list of modifiers to the true environment. The corresponding robust value is approximately computed by Deterministic Optimistic Planning so as to maximize the worst-case total reward.


Interval-based Robust Planning

We assume that the MDP is a parametrized dynamical system, whose parameter is uncertain and lies in a continuous ambiguity set. We use interval prediction to compute the set of states that can be reached at any time t, given that uncertainty, and leverage it to evaluate and improve a robust policy.

If the system is Linear Parameter-Varying (LPV) with polytopic uncertainty, an fast and stable interval predictor can be designed. Otherwise, sampling-based approaches can be used instead, with an increased computational load.




A neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.

Implemented variants:

  • Double DQN
  • Dueling architecture
  • N-step targets



A Q-function model is trained by performing each step of Value Iteration as a supervised learning procedure applied to a batch of transitions covering most of the state-action space.

Reference: Tree-Based Batch Mode Reinforcement Learning, Ernst D. et al (2005).

You can’t perform that action at this time.