Skip to content

facebookresearch/DeepRL-continuing-tasks

Deep Reinforcement Learning in Continuing Tasks

This repository provides the codebase for the paper An Empirical Study of Deep Reinforcement Learning in Continuing Tasks. The paper explores challenges that continuing tasks present to current deep reinforcement learning (RL) algorithms using a suite of continuing task testbeds. It empirically demonstrates the effectiveness of several reward-centering techniques that improve the performance of all studied algorithms on these continuing testbeds.

The code is based on the existing RL package Pearl, which itself is built with PyTorch. The testbeds are based on Mujoco and Atari environments provided by Gymnasium. Experiments use AlphaEx for configuration sweeping.

Why studying deep RL in continuing tasks?

Continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards—including those beyond resets—are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored.

Why studying the reward centering technique in these testbeds?

Recent research (source) shows that discounted RL methods for solving continuing tasks can perform significantly better if they center their rewards by subtracting out the rewards’ empirical average. Empirical analysis of the paper primarily focused on a temporal-difference-based reward centering method in conjunction with Q-learning. Our paper extends their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger scale tasks, and outperforms two other reward-centering approaches.

Audience

Those replicating the results in the paper.

Those evaluating new RL algorithms in our testbeds.

Those extending our suite of testbeds to a larger set.

Those seeking deeper insights into deep RL algorithms and reward-centering techniques in continuing tasks.

Testbeds

Overall we have 21 testbeds, including

  • 5 continuous control testbeds without any environment resets, based on five Mujoco environments: Swimmer, HumanoidStandup, Reacher, Pusher, and Ant. The episodic versions of these testbeds are included in Gymnasium (https://github.com/Farama-Foundation/Gymnasium/). The continuing testbeds are the same as the episodic ones except for the following differences. First, the continuing testbeds do not involve time-based or state-based resets. For Reacher, we resample the target position every 50 steps while leaving the robot's arm untouched, so that the robot needs to learn to reach a new position every 50 steps. Similarly, for Pusher, everything remains the same except that the object's position is randomly sampled every 100 steps. As for Ant, we increase the range of the angles at which its legs can move, so that the ant robot can recover when it flips over.

  • 5 continuous control testbeds with predefined environment resets built upon five Mujoco environments: HalfCheetah, Ant, Hopper, Humanoid, and Walker2d. The corresponding existing episodic testbeds involve time-based truncation of the agent’s experience followed by an environment reset. In the continuing testbeds, we remove this time-based truncation and reset. We retain state-based resets, such as when the robot is about to fall (in Hopper, Humanoid, and Walker2d) or when it flips its body (in Ant). In addition, we add a reset condition for HalfCheetah when it flips, which is not available in the existing episodic testbeds. Each reset incurs a penalty to the reward, punishing the agent for falling or flipping.

  • 6 discrete control testbeds adapted from Atari environments: Breakout, Pong, Space Invaders, BeamRider, Seaquest, and Ms. PacMan. Like the Mujoco environments, the episodic versions include time-based resets, which we omit in the continuing testbeds. In these Atari environments, the agent has multiple lives, and the environment is reset when all lives are lost. Upon losing a life, a reward of -1 is issued as a penalty. Furthermore, in existing algorithmic solutions to episodic Atari testbeds, the rewards are transformed into -1, 0, or 1 by taking their sign for stable learning, though performance is evaluated based on the original rewards. We treat the transformed rewards as the actual rewards in our continuing testbeds, removing such inconsistency.

  • 5 Mujoco testbeds with agent-controlled resets, based on five Mujoco environments: HalfCheetah, Ant, Hopper, Humanoid, and Walker2d. In these testbeds, the agent can choose to reset the environment at any time step. This is achieved by augmenting the environment's action space in these testbeds by adding one more dimension. This additional dimension has a range of [0, 1], representing the probability of reset.

Because our testbeds are simple modifications of existing episodic Mujoco and Atari testbeds available from Gymnasium, we do not provide a separate package that implements these testbeds.

Tested algorithms

Continuous control: DDPG, TD3, SAC, PPO

Discrete control: DQN, SAC, PPO

Results

Here we show the learning curves of tested algorithms in the testbeds, using the hyperparameters that achieve the best overall average reward rate. More results can be found in our paper.

Mujoco testbeds without resets
Image 1
HumanoidStandup
Image 2
Pusher
Image 3
Reacher
Image 4
SpecialAnt
Image 5
Swimmer
Mujoco testbeds with predefined resets
Image 6
HalfCheetah
Image 7
Ant
Image 8
Hopper
Image 9
Humanoid
Image 10
Walker2d
Mujoco testbeds with agent-controlled resets
Image 11
HalfCheetah
Image 12
Ant
Image 13
Hopper
Image 14
Humanoid
Image 15
Walker2d
Atari testbeds with predefined resets
Image 16
Breakout
Image 17
BeamRider
Image 18
MsPacman
Image 19
Pong
Image 20
Seaquest
Image 21
SpaceInvader
Performance improvement when applying reward centering to the tested algorithms.
Image 16
Mujoco Testbeds
Image 16
Atari Testbeds

How to use the codebase

Setup conda environment and dependencies

  • Create a conda environment conda create --name pearl python==3.10
  • conda activate pearl
  • Install pearl dependencies using ./setup.sh
  • For mujoco games, copy user_envs/special_ant.xml to CONDA_DIR_PATH/envs/pearl/lib/python3.10/site-packages/gymnasium/envs/mujoco/assets/. This is the xml file of the Ant task with a wider range of the angles at which its legs can move. Replace CONDA_DIR_PATH by the path to the conda directory in your machine.
  • For Atari games, we have to manually increase the default maximum episode length in CONDA_DIR_PATH/envs/pearl/lib/python3.10/site-packages/ale_py/registration.py. The default is 108000. You may change it to any number that is larger than the training steps so that the maximum episode length is not reached during training.

Experiment configurations

The codebase has several experiment folders, each of which includes a file inputs.json, which specifies a set of experiment configurations. This configuration file is compatible with AlphaEx's sweeper for configuration sweeping. https://github.com/AmiiThinks/AlphaEx?tab=readme-ov-file#sweeper explains how to understand the configuration file. Running experiments given these configurations gives experiment results. The table below shows the correspondence between these folders and the figures/tables in the paper summarizing the experiment results.

Experiment folder Description Figures/Tables in the paper
experiments/no_resets_mujoco/ mujoco tasks without resets Figure 1 (row 1), Table 1, Figure 2,
experiments/predefined_resets_mujoco/ mujoco tasks with predefined resets Figure 1 (row 2), Tables 2, 3
experiments/agent_resets_mujoco/ mujoco tasks with agent resets Figure 1 (row 3), Table 4, Table 14
experiments/rc_mujoco/ reward centered algorithms in mujoco tasks without resets or with predefined resets Table 5 (first two groups), Table 15 (first two groups), Table 17 (first two groups), Table 17 (first two groups), Figures 4, 5
experiments/rc_mujoco_agent_reset/ reward centered algorithms in Mujoco tasks with agent controlled resets Table 5 (third group), Table 15 (third groups), Table 17 (third group), Figure 6
experiments/rc_mujoco_offset/ reward centered algorithms in Mujoco tasks without resets or with predefined resets with reward offsets Table 16 (first two groups)
experiments/rc_mujoco_agent_reset_offset/ reward centered algorithms in Mujoco tasks with agent-controlled resets with reward offsets Table 16 (third group)
experiments/predefined_resets_atari/ atari tasks with predefined resets Figure 3, Table 13
experiments/rc_atari/ reward centered algorithms in atari tasks Table 6, Table 18, Figure 7

Running experiments

  • Suppose we want to perform all experiments specified in experiments/no_resets_mujoco/inputs.json for ten runs. Note that there are overall 168 experiment configurations in experiments/no_resets_mujoco/inputs.json. Therefore, overall there will be 168 * 10 = 1680 experiments. One could run these experiments sequentially using for i in {0..1679}; do ./run.sh run.py --config-file experiments/no_resets_mujoco/inputs.json --out-dir=experiments/no_resets_mujoco --base-id=i; done. Alternatively, one could run them in parallel using tools like Slurm, depending on the infrastructure available.

Post-processing experiment results and generating figures and tables

  • The first step of post-processing experiment results is to evaluate the final learned policies in predefined_resets_mujoco and predefined_resets_atari experiments. Run for i in {0..1599}; do run.sh run.py --config-file experiments/predefined_resets_mujoco/inputs.json --out-dir=experiments/predefined_resets_mujoco --base-id=i --eval-agent and for i in {0..719}; do run.sh run.py --config-file experiments/predefined_resets_atari/inputs.json --out-dir=experiments/predefined_resets_atari --base-id=i --eval-agent.

  • Then generate learning curves and latex code of tables shown in the paper using ./run_post_processing.sh.

Cite us

@article{wan2024continuingtasks, title = {An Empirical Study of Deep Reinforcement Learning in Continuing Tasks}, author = {Yi Wan, Dmytro Korenkevych, Zheqing Zhu}, year = {2024} }

License

Pearl is MIT licensed, as found in the LICENSE file.

About

This repository contains the codebase for the research paper "An Empirical Study of Deep Reinforcement Learning in Continuing Tasks".

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors