When signal is lost for hard prompts during RL (all trajectories are wrong), the LLM self-generates a hint to help sampling, improving both prompt usage and LLM performance.
- [02/08/2026] SAGE reproduction code is released! A huggingface collection of datasets and models is also released.
- [02/03/2026] SAGE paper is released on arXiv!
When an LLM can’t sample any correct trajectory for a hard prompt, the LLM self-generates a hint from the reference solution for the prompt. The hint is then used together with the difficult prompt as input to the LLM, avoiding advantage collapse and ensuring the sampling of correct trajectories to update the policy model.
Without a hint, some hard prompts are never used for GRPO, while SAGE increases the prompt usage rate by 10% for the weaker LLM.The usage of hard prompts during RL encourages LLM's exploration, leading to consistently better performance.
Among all methods, SAGE retains the on-policy property of GRPO, having a similar entropy scale. And the learning from hard prompts promotes exploration, with the response length growing steadily for various LLMs.
Our code is based on verl. If you already have a verl environment, you can use it and install the extra packages when prompted.
Note: SAGE has been tested on NVIDIA B200 (Blackwell, sm_100) GPUs. If you use older GPUs (e.g., A100, H100), you may use
cu124orcu126index URLs instead ofcu128, and adjust the torch/vllm versions accordingly.
-
Create a new environment
python -m venv ~/.python/sage source ~/.python/sage/bin/activate # Or use conda # conda create -n sage python==3.10 # conda activate sage
-
Install PyTorch and build tools
pip install --upgrade pip pip install uv python -m uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128 python -m uv pip install torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 python -m uv pip install -U pip setuptools wheel packaging psutil
-
Install flash-attn (compiled from source, may take ~10 minutes)
python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation
-
Install SAGE and dependencies
git clone https://github.com/BaohaoLiao/SAGE.git cd ./SAGE python -m uv pip install -r requirements.txt python -m uv pip install -e .
-
Install vllm (must match torch 2.8.0)
python -m uv pip install vllm==0.10.2
-
Pin compatible transformers version
vllm may upgrade transformers to a version that removes
AutoModelForVision2Seq. Pin it back:python -m uv pip install "transformers>=4.45.0,<4.54.0" -
Rebuild flash-attn if torch was changed by vllm
If vllm modified the torch version, rebuild flash-attn:
python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation --force-reinstall --no-cache-dir --no-deps
-
Install additional dependencies
python -m uv pip install json5
-
Verify the installation
python -c " import torch; print('torch', torch.__version__) from vllm import LLM, SamplingParams; print('vllm OK') from flash_attn import flash_attn_func; print('flash-attn OK') from transformers import AutoModelForVision2Seq; print('transformers OK') x = torch.randn(2, 2, device='cuda'); print('CUDA OK:', torch.cuda.get_device_name()) "
-
Prepare training set
bash scripts/prepare_data.sh
-
Train with SAGE / SAGE-light. The key code is located in
recipe/hint.bash scripts/run_sage.sh
-
Baselines (optional)
- GRPO:
bash scripts/run_grpo.sh
- LUFFY: We use LUFFY's open-sourced code. The training set is already preprocessed to LUFFY's style.
- SFT: We use LUFFY's open-sourced code for SFT. The training set is already preprocessed to LUFFY's style.
- Scaf-GRPO: We use Scaf-GRPO's open-sourced code. The training set is already preprocessed to Scaf-GRPO's style.
- GRPO:
| Model name | Link |
|---|---|
| SAGE_Llama_3.2-3B-Instruct | https://huggingface.co/baohao/SAGE_Llama-3.2-3B-Instruct |
| SAGE-light_Llama-3.2-3B-Instruct | https://huggingface.co/baohao/SAGE-light_Llama-3.2-3B-Instruct |
| SAGE_Qwen2.5-7B-Instruct | https://huggingface.co/baohao/SAGE_Qwen2.5-7B-Instruct |
| SAGE-light_Qwen2.5-7B-Instruct | https://huggingface.co/baohao/SAGE-light_Qwen2.5-7B-Instruct |
| SAGE_Qwen3-4B-Instruct-2507 | https://huggingface.co/baohao/SAGE_Qwen3-4B-Instruct-2507 |
| SAGE-light_Qwen3-4B-Instruct-2507 | https://huggingface.co/baohao/SAGE-light_Qwen3-4B-Instruct-2507 |
bash scripts/eval.shIf you find SAGE useful, please cite as:
@misc{liao2026selfhintinglanguagemodelsenhance,
title={Self-Hinting Language Models Enhance Reinforcement Learning},
author={Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian},
year={2026},
eprint={2602.03143},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.03143},
}Our code is based on verl for training, vllm for sampling, and oat for response grader. We really appreciate their contributions to the RL community.



