- 2026.05: ReCrit is available on arXiv: 2605.18799.
- 2026.04: Initial anonymous research-code release for ReCrit.
- 2026.04: Added asynchronous rollout, four-quadrant transition reward, and GRPO-style training utilities.
ReCrit studies a practical failure mode of critic reasoning: a model may revise a correct initial answer after receiving misleading feedback, while still needing to recover from genuinely wrong answers. Instead of optimizing only the final answer, ReCrit models the transition from the Initial solution to the Critic solution.
The central idea is a four-quadrant transition reward:
- Correction: the initial solution is wrong and the critic solution becomes correct.
- Robustness: the initial solution is correct and remains correct after critic feedback.
- Sycophancy: the initial solution is correct but becomes wrong after critic feedback.
- Boundary: both the initial and critic solutions are wrong.
This repository contains the core training code for ReCrit, including asynchronous vLLM rollout, transition-aware rewards, and a GRPO/PPO-clip style trainer.
- Transition-aware objective: ReCrit rewards or penalizes correctness transitions instead of collapsing all trajectories into final-answer accuracy.
- Four-quadrant reward design: Correction, robustness, sycophancy, and boundary cases are separated and weighted explicitly.
- Asynchronous critic rollout: vLLM continuous batching keeps fast samples moving without waiting for the slowest sample in each turn.
- GRPO-style optimization: the trainer supports grouped advantages, PPO clipping, reference-model KL, auxiliary format rewards, and multi-GPU execution.
- Scientific reasoning focus: the code is designed for scientific multiple-choice and short-answer settings with judge-based correctness signals.
For each question, ReCrit first samples an Initial solution. It then injects a critic-style feedback prompt and samples a Critic solution. A judge maps both solutions to correctness labels, and the pair is assigned to one of the four transition quadrants. The training reward is computed from this transition rather than only from the final answer.
The training loop follows a sample-judge-update pipeline: sample trajectories from the current policy, assign quadrant rewards with a correctness judge, normalize rewards within GRPO groups, and update the policy with PPO-style clipping and KL regularization.
Synchronous multi-turn rollout wastes GPU time because every sample must wait for the slowest generation before moving to the next stage. ReCrit instead uses asynchronous rollout so each sample can advance as soon as its current generation is complete. The dynamic variant further reduces tail waiting by stopping once the required fraction of samples has completed the critic interaction.
The examples below illustrate the target behavior: the model first gives an incorrect scientific answer, receives a verification-style critic prompt, then revises its reasoning and returns the correct final answer. These cases are shown only as qualitative examples; the training objective is still defined by the transition reward above.
git clone <repository-url>
cd ReCrit
conda create -n recrit python=3.10 -y
conda activate recrit
pip install -r requirements.txtThe training code expects:
- PyTorch with CUDA support.
- vLLM compatible with the target model.
- Transformers and tokenizer support for the chosen chat model.
- A judge endpoint exposed through
LLM_API_KEYandLLM_BASE_URL.
Prepare a JSONL training file and run:
export MODEL_PATH=/path/to/model
export TRAIN_DATASET=/path/to/examples.jsonl
export OUTPUT_DIR=output/recrit
export LLM_API_KEY=YOUR_LLM_API_KEY
export LLM_BASE_URL=https://your-judge-endpoint/v1
export CUDA_VISIBLE_DEVICES=0,1
bash run.shFor a single GPU, set:
export CUDA_VISIBLE_DEVICES=0
bash run.shReCrit supports a simple JSONL format:
{"question": "Which option is correct?", "answer": "B", "judge_mode": "close"}It also supports a chat-style format:
{
"messages": [
{"role": "user", "content": "Question text and answer choices"}
],
"solution": "B",
"judge_mode": "close"
}The judge_mode field can be:
close: short-answer or multiple-choice judging.open: open-ended judging.both: read the judging mode from each example.
ReCrit/
βββ assets/ # Figures used by the README
βββ tests/ # Lightweight unit and rollout tests
βββ config.py # Training configuration and CLI arguments
βββ dataset.py # JSONL dataset loader
βββ reward.py # Four-quadrant reward and auxiliary rewards
βββ rollout.py # vLLM asynchronous critic rollout
βββ train.py # Main training loop
βββ trainer.py # GRPO/PPO-clip training step
βββ utils.py # DDP, model loading, checkpointing, logging
βββ requirements.txt # Python dependencies
βββ run.sh # Example launch script
Syntax-only checks can be run with:
python -m py_compile config.py dataset.py reward.py rollout.py train.py trainer.py utils.pySome rollout tests require a local GPU and a compatible chat-model tokenizer:
python -m tests.test_bridge_alignment --model_path /path/to/model
python -m tests.test_async_rollout --model_path /path/to/model- This is research code and may require adapting model paths, judge endpoints, and GPU memory settings for a new environment.
- The default script exposes one process per visible GPU. Each process sees a single CUDA device so that the training model and vLLM engine share the same local device cleanly.
- Large checkpoints and generated rollout files are intentionally ignored by git.
If you find ReCrit useful, please cite the corresponding paper:
@misc{xu2026recrittransitionawarereinforcementlearning,
title={ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning},
author={Wanghan Xu and Yuhao Zhou and Hengyuan Zhao and Shuo Li and Dianzhi Yu and Zhenfei Yin and Yaowen Hu and Fengli Xu and Wanli Ouyang and Wenlong Zhang and Lei Bai},
year={2026},
eprint={2605.18799},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.18799},
}This project is released under the MIT License. See LICENSE for details.



