GitHub - BaohaoLiao/SAGE: Self-Hinting Language Models Enhance Reinforcement Learning

Self-Hinting Language Models Enhance Reinforcement Learning

When signal is lost for hard prompts during RL (all trajectories are wrong), the LLM self-generates a hint to help sampling, improving both prompt usage and LLM performance.

🔥 News

[02/08/2026] SAGE reproduction code is released! A huggingface collection of datasets and models is also released.
[02/03/2026] SAGE paper is released on arXiv!

🌟 Overview

When an LLM can’t sample any correct trajectory for a hard prompt, the LLM self-generates a hint from the reference solution for the prompt. The hint is then used together with the difficult prompt as input to the LLM, avoiding advantage collapse and ensuring the sampling of correct trajectories to update the policy model.

Without a hint, some hard prompts are never used for GRPO, while SAGE increases the prompt usage rate by 10% for the weaker LLM.

The usage of hard prompts during RL encourages LLM's exploration, leading to consistently better performance.

Among all methods, SAGE retains the on-policy property of GRPO, having a similar entropy scale. And the learning from hard prompts promotes exploration, with the response length growing steadily for various LLMs.

📦 Installation

Our code is based on verl. If you already have a verl environment, you can use it and install the extra packages when prompted.

Note: SAGE has been tested on NVIDIA B200 (Blackwell, sm_100) GPUs. If you use older GPUs (e.g., A100, H100), you may use cu124 or cu126 index URLs instead of cu128, and adjust the torch/vllm versions accordingly.

Create a new environment

python -m venv ~/.python/sage
source ~/.python/sage/bin/activate

# Or use conda
# conda create -n sage python==3.10
# conda activate sage

Install PyTorch and build tools

pip install --upgrade pip
pip install uv

python -m uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128
python -m uv pip install torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
python -m uv pip install -U pip setuptools wheel packaging psutil

Install flash-attn (compiled from source, may take ~10 minutes)

python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation

Install SAGE and dependencies

git clone https://github.com/BaohaoLiao/SAGE.git
cd ./SAGE
python -m uv pip install -r requirements.txt
python -m uv pip install -e .

Install vllm (must match torch 2.8.0)
```
python -m uv pip install vllm==0.10.2
```
Pin compatible transformers version

vllm may upgrade transformers to a version that removes AutoModelForVision2Seq. Pin it back:
```
python -m uv pip install "transformers>=4.45.0,<4.54.0"
```

Rebuild flash-attn if torch was changed by vllm

If vllm modified the torch version, rebuild flash-attn:

python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation --force-reinstall --no-cache-dir --no-deps

Install additional dependencies
```
python -m uv pip install json5
```

Verify the installation

python -c "
import torch; print('torch', torch.__version__)
from vllm import LLM, SamplingParams; print('vllm OK')
from flash_attn import flash_attn_func; print('flash-attn OK')
from transformers import AutoModelForVision2Seq; print('transformers OK')
x = torch.randn(2, 2, device='cuda'); print('CUDA OK:', torch.cuda.get_device_name())
"

⚡ Training

Prepare training set
```
bash scripts/prepare_data.sh
```
Train with SAGE / SAGE-light. The key code is located in recipe/hint.
```
bash scripts/run_sage.sh
```
Baselines (optional)
- GRPO:
```
bash scripts/run_grpo.sh
```
- LUFFY: We use LUFFY's open-sourced code. The training set is already preprocessed to LUFFY's style.
- SFT: We use LUFFY's open-sourced code for SFT. The training set is already preprocessed to LUFFY's style.
- Scaf-GRPO: We use Scaf-GRPO's open-sourced code. The training set is already preprocessed to Scaf-GRPO's style.

🤗 Trained Models

Model name	Link
SAGE_Llama_3.2-3B-Instruct	https://huggingface.co/baohao/SAGE_Llama-3.2-3B-Instruct
SAGE-light_Llama-3.2-3B-Instruct	https://huggingface.co/baohao/SAGE-light_Llama-3.2-3B-Instruct
SAGE_Qwen2.5-7B-Instruct	https://huggingface.co/baohao/SAGE_Qwen2.5-7B-Instruct
SAGE-light_Qwen2.5-7B-Instruct	https://huggingface.co/baohao/SAGE-light_Qwen2.5-7B-Instruct
SAGE_Qwen3-4B-Instruct-2507	https://huggingface.co/baohao/SAGE_Qwen3-4B-Instruct-2507
SAGE-light_Qwen3-4B-Instruct-2507	https://huggingface.co/baohao/SAGE-light_Qwen3-4B-Instruct-2507

🎓 Evaluation

bash scripts/eval.sh

📝 Citation

If you find SAGE useful, please cite as:

@misc{liao2026selfhintinglanguagemodelsenhance,
      title={Self-Hinting Language Models Enhance Reinforcement Learning}, 
      author={Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian},
      year={2026},
      eprint={2602.03143},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.03143}, 
}

🙏 Acknowledgments

Our code is based on verl for training, vllm for sampling, and oat for response grader. We really appreciate their contributions to the RL community.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
eval		eval
paper		paper
recipe/hint		recipe/hint
scripts		scripts
verl		verl
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Hinting Language Models Enhance Reinforcement Learning

🔥 News

🌟 Overview

📦 Installation

⚡ Training

🤗 Trained Models

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Hinting Language Models Enhance Reinforcement Learning

🔥 News

🌟 Overview

📦 Installation

⚡ Training

🤗 Trained Models

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages