Skip to content

YinqiBai962/D-ARL

Repository files navigation

D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning

ICML 2026

Stable and efficient asynchronous RL for LLM reasoning

Paper License Built on verl Docs


Important

This repository contains the code for D-ARL, our ICML 2026 accepted paper on asynchronous reinforcement learning for language reasoning.

D-ARL is built on top of verl, and focuses on one central challenge in async RL for LLM post-training: distribution mismatch between stale behavior policies and the current policy.


News 🔥

  • 2026.05 D-ARL is accepted to ICML 2026.

Introduction 📖

Asynchronous RL is attractive for LLM post-training because it decouples rollout generation from policy optimization and improves hardware utilization.

However, this efficiency comes with a price:

  • rollouts are produced by stale behavior policies,
  • replayed data may no longer match the current policy,
  • and naive async training can become unstable.

D-ARL addresses this problem with:

  1. Distribution-matched replay over the most recent K behavior policies,
  2. Variance-guided sample selection for better-aligned asynchronous data,
  3. Multi-behavior policy optimization to exploit multi-source replay.

In short, D-ARL aims to make async RL both fast and stable.

darl-framework


Why D-ARL?

Most async RL pipelines use stale data because it is available.
D-ARL uses stale data only when it is still useful.

Core idea

  • Keep a replay buffer over recent behavior policies
  • Measure how well samples match the current policy
  • Select distribution-matched high-quality samples
  • Optimize with a multi-behavior objective instead of collapsing all replay data into one source

This is the key difference between D-ARL and plain async RL.

darl-two-stage-framework


Highlights

  • Stable and fast async RL for language reasoning
  • Replay buffer over multiple behavior policies
  • Distribution-aware sample selection
  • Multi-behavior policy optimization
  • Implemented on top of verl
  • Ready-to-run scripts for GRPO/PPO-style LLM post-training

Repository Structure

  • shells/: launch scripts for async, decoupled, and D-ARL experiments
  • recipe/: algorithm entrypoints and replay-related training recipes
  • docs/: extended documentation from the verl ecosystem
  • examples/: example usages inherited from verl
  • tests/: tests

Quick Start 💻

Step 1: Install

We recommend Python >= 3.10 in a clean conda environment.

conda create -n darl python=3.10 -y
conda activate darl

pip install -r requirements.txt
pip install -e .

Additional dependency files are also provided:

Step 2: Run D-ARL

For a direct D-ARL-style run:

bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh

Step 3: Compare baselines

Use the following scripts as a minimal comparison suite:

bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_async.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_decoupled.sh
bash shells/one_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh

Conceptually:

  • *_async.sh: naive asynchronous baseline
  • *_decoupled.sh: decoupled baseline
  • *_pmis_sg.sh: D-ARL configuration with replay-based selection

Step 4: Cluster launch

If you are using Slurm:

bash shells/run.sh

You will likely need to adapt:

  • environment name
  • CUDA / NCCL setup
  • model path
  • data path
  • GPU allocation

Main D-ARL Knobs

The D-ARL logic is exposed directly in the shell configs, e.g.:

+replay.enable=True
+replay.k=4
+replay.pmis=True
+replay.warmup=5

Commonly used async stability options include:

algorithm.rollout_is=True
algorithm.rollout_is_threshold=2.0

These options control:

  • whether replay is enabled,
  • how many recent behavior policies are retained,
  • whether distribution-aware replay selection is used,
  • and how aggressively off-policy mismatch is controlled.

Benchmarks

We evaluate D-ARL on six public reasoning benchmarks:

  • Mathematical reasoning
    • GSM8K
    • LightEval
    • MATH-500
    • AIME24
  • Code reasoning
    • LiveCodeBench
    • HumanEval

Backbone models in the paper include:

  • Qwen3-1.7B
  • Qwen3-4B

Documentation

For more implementation details, see:

These are particularly useful for understanding:

  • one-step off-policy training,
  • fully asynchronous training,
  • rollout importance sampling,
  • and system-level async RL design.

Citation

If you find D-ARL helpful for your research, please cite:

@inproceedings{darl2026,
  title     = {D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

You can replace this with the full camera-ready BibTeX once the official proceedings metadata is available.


Acknowledgement

This repository is built on top of verl, an excellent open-source RL training framework for LLMs.

D-ARL extends this foundation with a research focus on:

  • asynchronous policy staleness,
  • distribution-matched replay,
  • and robust optimization for language reasoning.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors