D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning

ICML 2026

Stable and efficient asynchronous RL for LLM reasoning

Important

This repository contains the code for D-ARL, our ICML 2026 accepted paper on asynchronous reinforcement learning for language reasoning.

D-ARL is built on top of verl, and focuses on one central challenge in async RL for LLM post-training: distribution mismatch between stale behavior policies and the current policy.

News 🔥

2026.05 D-ARL is accepted to ICML 2026.

Introduction 📖

Asynchronous RL is attractive for LLM post-training because it decouples rollout generation from policy optimization and improves hardware utilization.

However, this efficiency comes with a price:

rollouts are produced by stale behavior policies,
replayed data may no longer match the current policy,
and naive async training can become unstable.

D-ARL addresses this problem with:

Distribution-matched replay over the most recent K behavior policies,
Variance-guided sample selection for better-aligned asynchronous data,
Multi-behavior policy optimization to exploit multi-source replay.

In short, D-ARL aims to make async RL both fast and stable.

Why D-ARL?

Most async RL pipelines use stale data because it is available.
D-ARL uses stale data only when it is still useful.

Core idea

Keep a replay buffer over recent behavior policies
Measure how well samples match the current policy
Select distribution-matched high-quality samples
Optimize with a multi-behavior objective instead of collapsing all replay data into one source

This is the key difference between D-ARL and plain async RL.

Highlights

Stable and fast async RL for language reasoning
Replay buffer over multiple behavior policies
Distribution-aware sample selection
Multi-behavior policy optimization
Implemented on top of verl
Ready-to-run scripts for GRPO/PPO-style LLM post-training

Repository Structure

shells/: launch scripts for async, decoupled, and D-ARL experiments
recipe/: algorithm entrypoints and replay-related training recipes
docs/: extended documentation from the verl ecosystem
examples/: example usages inherited from verl
tests/: tests

Quick Start 💻

Step 1: Install

We recommend Python >= 3.10 in a clean conda environment.

conda create -n darl python=3.10 -y
conda activate darl

pip install -r requirements.txt
pip install -e .

Additional dependency files are also provided:

Step 2: Run D-ARL

For a direct D-ARL-style run:

bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh

Step 3: Compare baselines

Use the following scripts as a minimal comparison suite:

bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_async.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_decoupled.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh

Conceptually:

*_async.sh: naive asynchronous baseline
*_decoupled.sh: decoupled baseline
*_pmis_sg.sh: D-ARL configuration with replay-based selection

Step 4: Cluster launch

If you are using Slurm:

bash shells/run.sh

You will likely need to adapt:

environment name
CUDA / NCCL setup
model path
data path
GPU allocation

Main D-ARL Knobs

The D-ARL logic is exposed directly in the shell configs, e.g.:

+replay.enable=True
+replay.k=4
+replay.pmis=True
+replay.warmup=5

Commonly used async stability options include:

algorithm.rollout_is=True
algorithm.rollout_is_threshold=2.0

These options control:

whether replay is enabled,
how many recent behavior policies are retained,
whether distribution-aware replay selection is used,
and how aggressively off-policy mismatch is controlled.

Benchmarks

We evaluate D-ARL on six public reasoning benchmarks:

Mathematical reasoning
- GSM8K
- LightEval
- MATH-500
- AIME24
Code reasoning
- LiveCodeBench
- HumanEval

Backbone models in the paper include:

Qwen3-1.7B
Qwen3-4B

Documentation

For more implementation details, see:

These are particularly useful for understanding:

one-step off-policy training,
fully asynchronous training,
rollout importance sampling,
and system-level async RL design.

Citation

If you find D-ARL helpful for your research, please cite:

@inproceedings{darl2026,
  title     = {D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

You can replace this with the full camera-ready BibTeX once the official proceedings metadata is available.

Acknowledgement

This repository is built on top of verl, an excellent open-source RL training framework for LLMs.

D-ARL extends this foundation with a research focus on:

asynchronous policy staleness,
distribution-matched replay,
and robust optimization for language reasoning.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
shells		shells
tests		tests
verl		verl
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
ppo_test.sh		ppo_test.sh
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning

ICML 2026

Stable and efficient asynchronous RL for LLM reasoning

News 🔥

Introduction 📖

Why D-ARL?

Core idea

Highlights

Repository Structure

Quick Start 💻

Step 1: Install

Step 2: Run D-ARL

Step 3: Compare baselines

Step 4: Cluster launch

Main D-ARL Knobs

Benchmarks

Documentation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning

ICML 2026

Stable and efficient asynchronous RL for LLM reasoning

News 🔥

Introduction 📖

Why D-ARL?

Core idea

Highlights

Repository Structure

Quick Start 💻

Step 1: Install

Step 2: Run D-ARL

Step 3: Compare baselines

Step 4: Cluster launch

Main D-ARL Knobs

Benchmarks

Documentation

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages