Skip to content

ElementQi/SePT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

base_figure
probability_results

SePT

Code for A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning.

SePT (Self-Evolving Post-Training) is a self-help, reward-free method that improves LLM reasoning by finetuning the model on its own generated responses.

Note

This codebase is not fully optimized (e.g. unnecessary loggings and computation; not tested on the newest version of verl and transformers), and the LoRA implementation has not been fully checked for correctness yet; feel free to raise any issues you encountered.

Setup

git clone https://github.com/ElementQi/SePT.git
cd SePT/sept

conda create -n sept python=3.10
conda activate sept

# to keep pkg_resources alive
pip install "setuptools<81"

# some machines need this
# conda install -c conda-forge pyzmq
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1 --no-build-isolation  # choose a suitable version for your own machine
pip install -e . --no-dependencies

Note

Installation process could be different from various machines and system, but the installation script above is tested on our A800 and 3090 clusters.

If you have any issues when installing the packages, you can refer to the installation tutorial from verl and check our runnable conda environment.

Reproduce Experiments

Before running the scripts below, please ensure you are in the project root directory (SePT/sept).

Warning

Configuration is required before running! You MUST modify the scripts inside the examples/ directory to fit your environment. Key parameters to change include:

  • Model Path: Update the path to your base model.
  • Dataset Selection: Choose between sept/data and sept/data_user_reasoning based on your model's behavior. sept/data_user_reasoning appends Please reason step by step, and put the final answer within \boxed{}. to each question, which can help some models but hurt others.
  • Chat Template and Prompting: There is no general system-prompt version that works for all models. Some models do not even ship with a chat template in their model folder. You may need to modify the chat template yourself, and also adjust any system prompt or reasoning prompt settings to match your model.
  • Logger: The default logger is swanlab. If you prefer wandb or just console, modify trainer.logger=['console','swanlab'].
  • Validation Workload: For faster iteration, you can remove some benchmark files from VAL_FILE_LIST in scripts such as examples/sept_1e7_dsr.sh.
  • Validation Sampling: actor_rollout_ref.rollout.val_kwargs.n=32 is fairly expensive for pass@k evaluation. For smoke tests, you can lower it to 4 or less. Likewise, in the sweep scripts, you can reduce N=32.
  • Shared Memory: The example scripts default to model.use_shm=False. Only enable shared memory if you have verified it works well on your machine.
  • Other Hyperparameters: Like rollout number, batch size...
conda activate sept

# for SePT training
bash examples/sept_1e7_dsr.sh

# for GRPO training
bash examples/grpo_5e7_dsr.sh

# for EM-FT training
bash examples/em_1e7_dsr.sh

# for SePT (Offline) training
bash examples/generate_solutions.sh
bash examples/offline_train.sh

Validation on specific checkpoints

For evaluation, we re-write the evaluation code for _validate function inside the Trainer. And if you want to evaluate a specific checkpoint or base model, try to use examples/trained_model_sweep.sh and examples/base_model_sweep.sh.

If you want to transform the training checkpoints via verl, you should follow the instructions from verl official tutorial for model converting. The model merge script is located in here.

Dataset

The original datasets are located in the sept/data folder. There are two training sets DSR (DeepScaleR) and OTM (Openthoughts-Math), and there are 6 benchmark files inside its benchmarks folder.

We also provide a second version in sept/data_user_reasoning. It has the same dataset structure as sept/data, but each question is appended with Please reason step by step, and put the final answer within \boxed{}..

What we mainly modified

Training logic

  • recipe/sept/sept_trainer: SePT training logic

  • recipe/sept/dp_actor: Cross entropy calculating

Validation logic

  • verl/trainer/ppo/metric_utils.py: Added a pass@k calculation logic during validation.

  • verl/trainer/ppo/ray_trainer.py: Modified the validation logic by selecting 16 different question-response pairs for different validation dataset in _validate function.

Verifier

sept/verl/utils/reward_score/__init__.py

Source Acknowledgement

This repository is built based on VERL at commit hash 38d9a88170786a45cb189a08290c4651e6d6f671.

For verifier, we used the verifier from The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning., which uses HuggingFace Math-Verify. The source code could be found in VERL entropy recipe.

Citation

If you find our work useful, please consider citing our paper:

@article{li2026modelhelpitselfrewardfree,
  title   = {A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning},
  author  = {Mengqi Li and Lei Zhao and Anthony Man-Cho So and Ruoyu Sun and Xiao Li},
  journal = {arXiv preprint arXiv:2510.18814},
  year    = {2026}
}

About

Official code of "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors