Applying On-Policy Distillation (OPD) to Tool-Integrated Reasoning (TIR) suffers from cascading error propagation: incorrect tool calls inject out-of-distribution observations that progressively amplify the student-teacher distribution shift, rendering the teacher's token-level supervision unreliable or even harmful.
SOD (Step-wise On-policy Distillation) addresses this by introducing an adaptive step-level weighting mechanism that:
- Suppresses distillation loss on steps where the student has drifted far from the teacher (erroneous pattern)
- Restores full supervision when the student recovers alignment (recovery pattern)
- Maintains dense token-level guidance on well-aligned steps (stable pattern)
All at negligible additional computational cost — the divergence metric reuses log-probabilities already computed in the OPD forward pass.
Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on average@32 at AIME 2025.
git clone https://github.com/YoungZ365/SOD.git
conda create -n SOD python=3.11
conda activate SOD
cd SOD
bash scripts/install_vllm_sglang_mcore.sh
pip install -e .[vllm]Download the following datasets:
| Dataset | Link | Usage |
|---|---|---|
| 3K Agentic SFT Data | 🤗 HuggingFace | Cold-start SFT |
| 30K Agentic RL Data | 🤗 HuggingFace | RL / Distillation Training |
| Evaluation Benchmarks | 🤗 HuggingFace | AIME2024/2025, GPQA-Diamond, LiveCodeBench |
Configure SandboxFusion for code execution:
- Local Deployment: Refer to SandboxFusion deployment docs
- Cloud Service: Use Volcano Engine Code Sandbox
After obtaining an API endpoint, configure it in:
recipe/demystify/sandbox_fusion_tool_config.yaml- The function
check_correctnessinverl/utils/reward_score/livecodebench/code_math.py
Configure examples/SOD/run_sft.sh with your paths:
MODEL_PATH: Base model path (e.g., Qwen3-1.7B or Qwen3-0.6B)TRAIN_DATA: Path to the SFT.parquetfileSAVE_PATH: Directory to save SFT checkpoints
bash examples/SOD/run_sft.shAfter SFT, merge the model checkpoint:
python3 -m verl.model_merger merge --backend fsdp \
--local_dir <checkpoint_dir>/global_step_xxx \
--target_dir <checkpoint_dir>/global_step_xxx/huggingfaceConfigure examples/SOD/run_sod.sh with your paths:
MODEL_PATH: Path to the SFT student modelTEACHER_MODEL_PATH: Path to the teacher model (e.g., a GRPO-trained 4B model)TRAIN_DATA: Path to the RL.parquetfile (30K dataset)- Evaluation data paths for AIME2024/2025
bash examples/SOD/run_sod.shTraining Resources: 8× NVIDIA H20 96GB GPUs, batch size 64.
You can monitor training dynamics and evaluation results via Weights & Biases (wandb).
We support evaluation on AIME 2024/2025, GPQA-Diamond, and LiveCodeBench-v6.
Taking AIME as an example:
bash examples/SOD/eval/run_eval_aime.shYou can observe average@32 / pass@32 / maj@32 metrics from your wandb project.
Coming soon.
Our implementation builds upon the excellent codebases of VeRL, Open-AgentRL, and ReTool. We sincerely thank these projects for their valuable contributions to the community.

