Skip to content

YoungZ365/SOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SOD: Step-wise On-policy Distillation for
Small Language Model Agents

Paper on arXiv Datasets for Agent RL

Introduction

Applying On-Policy Distillation (OPD) to Tool-Integrated Reasoning (TIR) suffers from cascading error propagation: incorrect tool calls inject out-of-distribution observations that progressively amplify the student-teacher distribution shift, rendering the teacher's token-level supervision unreliable or even harmful.

SOD (Step-wise On-policy Distillation) addresses this by introducing an adaptive step-level weighting mechanism that:

  • Suppresses distillation loss on steps where the student has drifted far from the teacher (erroneous pattern)
  • Restores full supervision when the student recovers alignment (recovery pattern)
  • Maintains dense token-level guidance on well-aligned steps (stable pattern)

All at negligible additional computational cost — the divergence metric reuses log-probabilities already computed in the OPD forward pass.

Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on average@32 at AIME 2025.

Framework

🚀 Get Started

Environment Setup

git clone https://github.com/YoungZ365/SOD.git
conda create -n SOD python=3.11
conda activate SOD
cd SOD
bash scripts/install_vllm_sglang_mcore.sh
pip install -e .[vllm]

Data Preparation

Download the following datasets:

Dataset Link Usage
3K Agentic SFT Data 🤗 HuggingFace Cold-start SFT
30K Agentic RL Data 🤗 HuggingFace RL / Distillation Training
Evaluation Benchmarks 🤗 HuggingFace AIME2024/2025, GPQA-Diamond, LiveCodeBench

Sandbox Configuration

Configure SandboxFusion for code execution:

  1. Local Deployment: Refer to SandboxFusion deployment docs
  2. Cloud Service: Use Volcano Engine Code Sandbox

After obtaining an API endpoint, configure it in:

  • recipe/demystify/sandbox_fusion_tool_config.yaml
  • The function check_correctness in verl/utils/reward_score/livecodebench/code_math.py

🔧 Training

Step 1: Cold-Start SFT

Configure examples/SOD/run_sft.sh with your paths:

  • MODEL_PATH: Base model path (e.g., Qwen3-1.7B or Qwen3-0.6B)
  • TRAIN_DATA: Path to the SFT .parquet file
  • SAVE_PATH: Directory to save SFT checkpoints
bash examples/SOD/run_sft.sh

After SFT, merge the model checkpoint:

python3 -m verl.model_merger merge --backend fsdp \
    --local_dir <checkpoint_dir>/global_step_xxx \
    --target_dir <checkpoint_dir>/global_step_xxx/huggingface

Step 2: SOD Training (Step-wise On-policy Distillation)

Configure examples/SOD/run_sod.sh with your paths:

  • MODEL_PATH: Path to the SFT student model
  • TEACHER_MODEL_PATH: Path to the teacher model (e.g., a GRPO-trained 4B model)
  • TRAIN_DATA: Path to the RL .parquet file (30K dataset)
  • Evaluation data paths for AIME2024/2025
bash examples/SOD/run_sod.sh

Training Resources: 8× NVIDIA H20 96GB GPUs, batch size 64.

You can monitor training dynamics and evaluation results via Weights & Biases (wandb).

📊 Evaluation

We support evaluation on AIME 2024/2025, GPQA-Diamond, and LiveCodeBench-v6.

Taking AIME as an example:

bash examples/SOD/eval/run_eval_aime.sh

You can observe average@32 / pass@32 / maj@32 metrics from your wandb project.

📝 Citation

Coming soon.

🙏 Acknowledgements

Our implementation builds upon the excellent codebases of VeRL, Open-AgentRL, and ReTool. We sincerely thank these projects for their valuable contributions to the community.

About

PyTorch-based open-source code for paper "SOD: Step-wise On-policy Distillation for Small Language Model Agents"

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors