This repository provides the codebase for the paper An Empirical Study of Deep Reinforcement Learning in Continuing Tasks. The paper explores challenges that continuing tasks present to current deep reinforcement learning (RL) algorithms using a suite of continuing task testbeds. It empirically demonstrates the effectiveness of several reward-centering techniques that improve the performance of all studied algorithms on these continuing testbeds.
The code is based on the existing RL package Pearl, which itself is built with PyTorch. The testbeds are based on Mujoco and Atari environments provided by Gymnasium. Experiments use AlphaEx for configuration sweeping.
Continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards—including those beyond resets—are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored.
Recent research (source) shows that discounted RL methods for solving continuing tasks can perform significantly better if they center their rewards by subtracting out the rewards’ empirical average. Empirical analysis of the paper primarily focused on a temporal-difference-based reward centering method in conjunction with Q-learning. Our paper extends their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger scale tasks, and outperforms two other reward-centering approaches.
Those replicating the results in the paper.
Those evaluating new RL algorithms in our testbeds.
Those extending our suite of testbeds to a larger set.
Those seeking deeper insights into deep RL algorithms and reward-centering techniques in continuing tasks.
Overall we have 21 testbeds, including
-
5 continuous control testbeds without any environment resets, based on five Mujoco environments: Swimmer, HumanoidStandup, Reacher, Pusher, and Ant. The episodic versions of these testbeds are included in Gymnasium (https://github.com/Farama-Foundation/Gymnasium/). The continuing testbeds are the same as the episodic ones except for the following differences. First, the continuing testbeds do not involve time-based or state-based resets. For Reacher, we resample the target position every 50 steps while leaving the robot's arm untouched, so that the robot needs to learn to reach a new position every 50 steps. Similarly, for Pusher, everything remains the same except that the object's position is randomly sampled every 100 steps. As for Ant, we increase the range of the angles at which its legs can move, so that the ant robot can recover when it flips over.
-
5 continuous control testbeds with predefined environment resets built upon five Mujoco environments: HalfCheetah, Ant, Hopper, Humanoid, and Walker2d. The corresponding existing episodic testbeds involve time-based truncation of the agent’s experience followed by an environment reset. In the continuing testbeds, we remove this time-based truncation and reset. We retain state-based resets, such as when the robot is about to fall (in Hopper, Humanoid, and Walker2d) or when it flips its body (in Ant). In addition, we add a reset condition for HalfCheetah when it flips, which is not available in the existing episodic testbeds. Each reset incurs a penalty to the reward, punishing the agent for falling or flipping.
-
6 discrete control testbeds adapted from Atari environments: Breakout, Pong, Space Invaders, BeamRider, Seaquest, and Ms. PacMan. Like the Mujoco environments, the episodic versions include time-based resets, which we omit in the continuing testbeds. In these Atari environments, the agent has multiple lives, and the environment is reset when all lives are lost. Upon losing a life, a reward of -1 is issued as a penalty. Furthermore, in existing algorithmic solutions to episodic Atari testbeds, the rewards are transformed into -1, 0, or 1 by taking their sign for stable learning, though performance is evaluated based on the original rewards. We treat the transformed rewards as the actual rewards in our continuing testbeds, removing such inconsistency.
-
5 Mujoco testbeds with agent-controlled resets, based on five Mujoco environments: HalfCheetah, Ant, Hopper, Humanoid, and Walker2d. In these testbeds, the agent can choose to reset the environment at any time step. This is achieved by augmenting the environment's action space in these testbeds by adding one more dimension. This additional dimension has a range of [0, 1], representing the probability of reset.
Because our testbeds are simple modifications of existing episodic Mujoco and Atari testbeds available from Gymnasium, we do not provide a separate package that implements these testbeds.
Continuous control: DDPG, TD3, SAC, PPO
Discrete control: DQN, SAC, PPO
Here we show the learning curves of tested algorithms in the testbeds, using the hyperparameters that achieve the best overall average reward rate. More results can be found in our paper.
![]() HumanoidStandup |
![]() Pusher |
![]() Reacher |
![]() SpecialAnt |
![]() Swimmer |
![]() HalfCheetah |
![]() Ant |
![]() Hopper |
![]() Humanoid |
![]() Walker2d |
![]() HalfCheetah |
![]() Ant |
![]() Hopper |
![]() Humanoid |
![]() Walker2d |
![]() Breakout |
![]() BeamRider |
![]() MsPacman |
![]() Pong |
![]() Seaquest |
![]() SpaceInvader |
![]() Mujoco Testbeds |
![]() Atari Testbeds |
- Create a conda environment
conda create --name pearl python==3.10 conda activate pearl- Install pearl dependencies using
./setup.sh - For mujoco games, copy
user_envs/special_ant.xmltoCONDA_DIR_PATH/envs/pearl/lib/python3.10/site-packages/gymnasium/envs/mujoco/assets/. This is the xml file of the Ant task with a wider range of the angles at which its legs can move. ReplaceCONDA_DIR_PATHby the path to the conda directory in your machine. - For Atari games, we have to manually increase the default maximum episode length in
CONDA_DIR_PATH/envs/pearl/lib/python3.10/site-packages/ale_py/registration.py. The default is 108000. You may change it to any number that is larger than the training steps so that the maximum episode length is not reached during training.
The codebase has several experiment folders, each of which includes a file inputs.json, which specifies a set of experiment configurations. This configuration file is compatible with AlphaEx's sweeper for configuration sweeping. https://github.com/AmiiThinks/AlphaEx?tab=readme-ov-file#sweeper explains how to understand the configuration file. Running experiments given these configurations gives experiment results. The table below shows the correspondence between these folders and the figures/tables in the paper summarizing the experiment results.
| Experiment folder | Description | Figures/Tables in the paper |
|---|---|---|
experiments/no_resets_mujoco/ |
mujoco tasks without resets | Figure 1 (row 1), Table 1, Figure 2, |
experiments/predefined_resets_mujoco/ |
mujoco tasks with predefined resets | Figure 1 (row 2), Tables 2, 3 |
experiments/agent_resets_mujoco/ |
mujoco tasks with agent resets | Figure 1 (row 3), Table 4, Table 14 |
experiments/rc_mujoco/ |
reward centered algorithms in mujoco tasks without resets or with predefined resets | Table 5 (first two groups), Table 15 (first two groups), Table 17 (first two groups), Table 17 (first two groups), Figures 4, 5 |
experiments/rc_mujoco_agent_reset/ |
reward centered algorithms in Mujoco tasks with agent controlled resets | Table 5 (third group), Table 15 (third groups), Table 17 (third group), Figure 6 |
experiments/rc_mujoco_offset/ |
reward centered algorithms in Mujoco tasks without resets or with predefined resets with reward offsets | Table 16 (first two groups) |
experiments/rc_mujoco_agent_reset_offset/ |
reward centered algorithms in Mujoco tasks with agent-controlled resets with reward offsets | Table 16 (third group) |
experiments/predefined_resets_atari/ |
atari tasks with predefined resets | Figure 3, Table 13 |
experiments/rc_atari/ |
reward centered algorithms in atari tasks | Table 6, Table 18, Figure 7 |
- Suppose we want to perform all experiments specified in
experiments/no_resets_mujoco/inputs.jsonfor ten runs. Note that there are overall 168 experiment configurations inexperiments/no_resets_mujoco/inputs.json. Therefore, overall there will be 168 * 10 = 1680 experiments. One could run these experiments sequentially usingfor i in {0..1679}; do ./run.sh run.py --config-file experiments/no_resets_mujoco/inputs.json --out-dir=experiments/no_resets_mujoco --base-id=i; done. Alternatively, one could run them in parallel using tools like Slurm, depending on the infrastructure available.
-
The first step of post-processing experiment results is to evaluate the final learned policies in predefined_resets_mujoco and predefined_resets_atari experiments. Run
for i in {0..1599}; do run.sh run.py --config-file experiments/predefined_resets_mujoco/inputs.json --out-dir=experiments/predefined_resets_mujoco --base-id=i --eval-agentandfor i in {0..719}; do run.sh run.py --config-file experiments/predefined_resets_atari/inputs.json --out-dir=experiments/predefined_resets_atari --base-id=i --eval-agent. -
Then generate learning curves and latex code of tables shown in the paper using
./run_post_processing.sh.
@article{wan2024continuingtasks, title = {An Empirical Study of Deep Reinforcement Learning in Continuing Tasks}, author = {Yi Wan, Dmytro Korenkevych, Zheqing Zhu}, year = {2024} }
Pearl is MIT licensed, as found in the LICENSE file.






















