Important
This repository contains the code for D-ARL, our ICML 2026 accepted paper on asynchronous reinforcement learning for language reasoning.
D-ARL is built on top of verl, and focuses on one central challenge in async RL for LLM post-training:
distribution mismatch between stale behavior policies and the current policy.
- 2026.05 D-ARL is accepted to ICML 2026.
Asynchronous RL is attractive for LLM post-training because it decouples rollout generation from policy optimization and improves hardware utilization.
However, this efficiency comes with a price:
- rollouts are produced by stale behavior policies,
- replayed data may no longer match the current policy,
- and naive async training can become unstable.
D-ARL addresses this problem with:
- Distribution-matched replay over the most recent
Kbehavior policies, - Variance-guided sample selection for better-aligned asynchronous data,
- Multi-behavior policy optimization to exploit multi-source replay.
In short, D-ARL aims to make async RL both fast and stable.
Most async RL pipelines use stale data because it is available.
D-ARL uses stale data only when it is still useful.
- Keep a replay buffer over recent behavior policies
- Measure how well samples match the current policy
- Select distribution-matched high-quality samples
- Optimize with a multi-behavior objective instead of collapsing all replay data into one source
This is the key difference between D-ARL and plain async RL.
- Stable and fast async RL for language reasoning
- Replay buffer over multiple behavior policies
- Distribution-aware sample selection
- Multi-behavior policy optimization
- Implemented on top of verl
- Ready-to-run scripts for GRPO/PPO-style LLM post-training
shells/: launch scripts for async, decoupled, and D-ARL experimentsrecipe/: algorithm entrypoints and replay-related training recipesdocs/: extended documentation from theverlecosystemexamples/: example usages inherited fromverltests/: tests
We recommend Python >= 3.10 in a clean conda environment.
conda create -n darl python=3.10 -y
conda activate darl
pip install -r requirements.txt
pip install -e .Additional dependency files are also provided:
For a direct D-ARL-style run:
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.shUse the following scripts as a minimal comparison suite:
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_async.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_decoupled.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.shConceptually:
*_async.sh: naive asynchronous baseline*_decoupled.sh: decoupled baseline*_pmis_sg.sh: D-ARL configuration with replay-based selection
If you are using Slurm:
bash shells/run.shYou will likely need to adapt:
- environment name
- CUDA / NCCL setup
- model path
- data path
- GPU allocation
The D-ARL logic is exposed directly in the shell configs, e.g.:
+replay.enable=True
+replay.k=4
+replay.pmis=True
+replay.warmup=5Commonly used async stability options include:
algorithm.rollout_is=True
algorithm.rollout_is_threshold=2.0These options control:
- whether replay is enabled,
- how many recent behavior policies are retained,
- whether distribution-aware replay selection is used,
- and how aggressively off-policy mismatch is controlled.
We evaluate D-ARL on six public reasoning benchmarks:
- Mathematical reasoning
- GSM8K
- LightEval
- MATH-500
- AIME24
- Code reasoning
- LiveCodeBench
- HumanEval
Backbone models in the paper include:
- Qwen3-1.7B
- Qwen3-4B
For more implementation details, see:
These are particularly useful for understanding:
- one-step off-policy training,
- fully asynchronous training,
- rollout importance sampling,
- and system-level async RL design.
If you find D-ARL helpful for your research, please cite:
@inproceedings{darl2026,
title = {D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026}
}You can replace this with the full camera-ready BibTeX once the official proceedings metadata is available.
This repository is built on top of verl, an excellent open-source RL training framework for LLMs.
D-ARL extends this foundation with a research focus on:
- asynchronous policy staleness,
- distribution-matched replay,
- and robust optimization for language reasoning.

