Skip to content

black-yt/ReCrit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ReCrit

Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Python PyTorch vLLM License Status Project Page arXiv PDF

ReCrit teaser

πŸ”₯ News

  • 2026.05: ReCrit is available on arXiv: 2605.18799.
  • 2026.04: Initial anonymous research-code release for ReCrit.
  • 2026.04: Added asynchronous rollout, four-quadrant transition reward, and GRPO-style training utilities.

✨ Overview

ReCrit studies a practical failure mode of critic reasoning: a model may revise a correct initial answer after receiving misleading feedback, while still needing to recover from genuinely wrong answers. Instead of optimizing only the final answer, ReCrit models the transition from the Initial solution to the Critic solution.

The central idea is a four-quadrant transition reward:

  • Correction: the initial solution is wrong and the critic solution becomes correct.
  • Robustness: the initial solution is correct and remains correct after critic feedback.
  • Sycophancy: the initial solution is correct but becomes wrong after critic feedback.
  • Boundary: both the initial and critic solutions are wrong.

This repository contains the core training code for ReCrit, including asynchronous vLLM rollout, transition-aware rewards, and a GRPO/PPO-clip style trainer.

🌟 Highlights

  • Transition-aware objective: ReCrit rewards or penalizes correctness transitions instead of collapsing all trajectories into final-answer accuracy.
  • Four-quadrant reward design: Correction, robustness, sycophancy, and boundary cases are separated and weighted explicitly.
  • Asynchronous critic rollout: vLLM continuous batching keeps fast samples moving without waiting for the slowest sample in each turn.
  • GRPO-style optimization: the trainer supports grouped advantages, PPO clipping, reference-model KL, auxiliary format rewards, and multi-GPU execution.
  • Scientific reasoning focus: the code is designed for scientific multiple-choice and short-answer settings with judge-based correctness signals.

🧠 Method

For each question, ReCrit first samples an Initial solution. It then injects a critic-style feedback prompt and samples a Critic solution. A judge maps both solutions to correctness labels, and the pair is assigned to one of the four transition quadrants. The training reward is computed from this transition rather than only from the final answer.

The training loop follows a sample-judge-update pipeline: sample trajectories from the current policy, assign quadrant rewards with a correctness judge, normalize rewards within GRPO groups, and update the policy with PPO-style clipping and KL regularization.

ReCrit training pipeline

⚑ Dynamic Asynchronous Rollout

Synchronous multi-turn rollout wastes GPU time because every sample must wait for the slowest generation before moving to the next stage. ReCrit instead uses asynchronous rollout so each sample can advance as soon as its current generation is complete. The dynamic variant further reduces tail waiting by stopping once the required fraction of samples has completed the critic interaction.

Dynamic asynchronous rollout

πŸ”¬ Case Study

The examples below illustrate the target behavior: the model first gives an incorrect scientific answer, receives a verification-style critic prompt, then revises its reasoning and returns the correct final answer. These cases are shown only as qualitative examples; the training objective is still defined by the transition reward above.

ReCrit case study

πŸ› οΈ Installation

git clone <repository-url>
cd ReCrit

conda create -n recrit python=3.10 -y
conda activate recrit
pip install -r requirements.txt

The training code expects:

  • PyTorch with CUDA support.
  • vLLM compatible with the target model.
  • Transformers and tokenizer support for the chosen chat model.
  • A judge endpoint exposed through LLM_API_KEY and LLM_BASE_URL.

πŸš€ Quick Start

Prepare a JSONL training file and run:

export MODEL_PATH=/path/to/model
export TRAIN_DATASET=/path/to/examples.jsonl
export OUTPUT_DIR=output/recrit
export LLM_API_KEY=YOUR_LLM_API_KEY
export LLM_BASE_URL=https://your-judge-endpoint/v1
export CUDA_VISIBLE_DEVICES=0,1

bash run.sh

For a single GPU, set:

export CUDA_VISIBLE_DEVICES=0
bash run.sh

πŸ“¦ Data Format

ReCrit supports a simple JSONL format:

{"question": "Which option is correct?", "answer": "B", "judge_mode": "close"}

It also supports a chat-style format:

{
  "messages": [
    {"role": "user", "content": "Question text and answer choices"}
  ],
  "solution": "B",
  "judge_mode": "close"
}

The judge_mode field can be:

  • close: short-answer or multiple-choice judging.
  • open: open-ended judging.
  • both: read the judging mode from each example.

πŸ“ Repository Structure

ReCrit/
β”œβ”€β”€ assets/              # Figures used by the README
β”œβ”€β”€ tests/               # Lightweight unit and rollout tests
β”œβ”€β”€ config.py            # Training configuration and CLI arguments
β”œβ”€β”€ dataset.py           # JSONL dataset loader
β”œβ”€β”€ reward.py            # Four-quadrant reward and auxiliary rewards
β”œβ”€β”€ rollout.py           # vLLM asynchronous critic rollout
β”œβ”€β”€ train.py             # Main training loop
β”œβ”€β”€ trainer.py           # GRPO/PPO-clip training step
β”œβ”€β”€ utils.py             # DDP, model loading, checkpointing, logging
β”œβ”€β”€ requirements.txt     # Python dependencies
└── run.sh               # Example launch script

βœ… Tests

Syntax-only checks can be run with:

python -m py_compile config.py dataset.py reward.py rollout.py train.py trainer.py utils.py

Some rollout tests require a local GPU and a compatible chat-model tokenizer:

python -m tests.test_bridge_alignment --model_path /path/to/model
python -m tests.test_async_rollout --model_path /path/to/model

πŸ“Œ Notes

  • This is research code and may require adapting model paths, judge endpoints, and GPU memory settings for a new environment.
  • The default script exposes one process per visible GPU. Each process sees a single CUDA device so that the training model and vLLM engine share the same local device cleanly.
  • Large checkpoints and generated rollout files are intentionally ignored by git.

πŸ“– Citation

If you find ReCrit useful, please cite the corresponding paper:

@misc{xu2026recrittransitionawarereinforcementlearning,
      title={ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning},
      author={Wanghan Xu and Yuhao Zhou and Hengyuan Zhao and Shuo Li and Dianzhi Yu and Zhenfei Yin and Yaowen Hu and Fengli Xu and Wanli Ouyang and Wenlong Zhang and Lei Bai},
      year={2026},
      eprint={2605.18799},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.18799},
}

πŸ“œ License

This project is released under the MIT License. See LICENSE for details.

Releases

No releases published

Packages

 
 
 

Contributors