![]() |
|
|
|
Code for A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning.
SePT (Self-Evolving Post-Training) is a self-help, reward-free method that improves LLM reasoning by finetuning the model on its own generated responses.
Note
This codebase is not fully optimized (e.g. unnecessary loggings and computation; not tested on the newest version of verl and transformers), and the LoRA implementation has not been fully checked for correctness yet; feel free to raise any issues you encountered.
git clone https://github.com/ElementQi/SePT.git
cd SePT/sept
conda create -n sept python=3.10
conda activate sept
# to keep pkg_resources alive
pip install "setuptools<81"
# some machines need this
# conda install -c conda-forge pyzmq
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1 --no-build-isolation # choose a suitable version for your own machine
pip install -e . --no-dependenciesNote
Installation process could be different from various machines and system, but the installation script above is tested on our A800 and 3090 clusters.
If you have any issues when installing the packages, you can refer to the installation tutorial from verl and check our runnable conda environment.
Before running the scripts below, please ensure you are in the project root directory (SePT/sept).
Warning
Configuration is required before running!
You MUST modify the scripts inside the examples/ directory to fit your environment. Key parameters to change include:
- Model Path: Update the path to your base model.
- Dataset Selection: Choose between
sept/dataandsept/data_user_reasoningbased on your model's behavior.sept/data_user_reasoningappendsPlease reason step by step, and put the final answer within \boxed{}.to each question, which can help some models but hurt others. - Chat Template and Prompting: There is no general system-prompt version that works for all models. Some models do not even ship with a chat template in their model folder. You may need to modify the chat template yourself, and also adjust any system prompt or reasoning prompt settings to match your model.
- Logger: The default logger is
swanlab. If you preferwandbor justconsole, modifytrainer.logger=['console','swanlab']. - Validation Workload: For faster iteration, you can remove some benchmark files from
VAL_FILE_LISTin scripts such asexamples/sept_1e7_dsr.sh. - Validation Sampling:
actor_rollout_ref.rollout.val_kwargs.n=32is fairly expensive for pass@k evaluation. For smoke tests, you can lower it to4or less. Likewise, in the sweep scripts, you can reduceN=32. - Shared Memory: The example scripts default to
model.use_shm=False. Only enable shared memory if you have verified it works well on your machine. - Other Hyperparameters: Like rollout number, batch size...
conda activate sept
# for SePT training
bash examples/sept_1e7_dsr.sh
# for GRPO training
bash examples/grpo_5e7_dsr.sh
# for EM-FT training
bash examples/em_1e7_dsr.sh
# for SePT (Offline) training
bash examples/generate_solutions.sh
bash examples/offline_train.shFor evaluation, we re-write the evaluation code for _validate function inside the Trainer. And if you want to evaluate a specific checkpoint or base model, try to use examples/trained_model_sweep.sh and examples/base_model_sweep.sh.
If you want to transform the training checkpoints via verl, you should follow the instructions from verl official tutorial for model converting. The model merge script is located in here.
The original datasets are located in the sept/data folder. There are two training sets DSR (DeepScaleR) and OTM (Openthoughts-Math), and there are 6 benchmark files inside its benchmarks folder.
We also provide a second version in sept/data_user_reasoning. It has the same dataset structure as sept/data, but each question is appended with Please reason step by step, and put the final answer within \boxed{}..
-
recipe/sept/sept_trainer: SePT training logic -
recipe/sept/dp_actor: Cross entropy calculating
-
verl/trainer/ppo/metric_utils.py: Added a pass@k calculation logic during validation. -
verl/trainer/ppo/ray_trainer.py: Modified the validation logic by selecting 16 different question-response pairs for different validation dataset in_validatefunction.
sept/verl/utils/reward_score/__init__.py
This repository is built based on VERL at commit hash 38d9a88170786a45cb189a08290c4651e6d6f671.
For verifier, we used the verifier from The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning., which uses HuggingFace Math-Verify. The source code could be found in VERL entropy recipe.
If you find our work useful, please consider citing our paper:
@article{li2026modelhelpitselfrewardfree,
title = {A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning},
author = {Mengqi Li and Lei Zhao and Anthony Man-Cho So and Ruoyu Sun and Xiao Li},
journal = {arXiv preprint arXiv:2510.18814},
year = {2026}
}
