Reproduction Guide

This folder hosts the scripts to run and post-process the SAT method across models and datasets.

Files and roles

run_think_control_V5_qwen3.py: run SAT on Qwen3 models for math datasets (math500, gsm8k, aime2024, aime2025, amc).
run_think_control_V5_llama.py: SAT runner for the Llama family on the math datasets above.
run_think_control_V5_ds_qwen.py: SAT runner for the ds_qwen family on the math datasets above.
run_think_control_V5_qwen3_gpqa.py: SAT runner for Qwen3 models on the gpqa dataset.
run_think_control_V5_qwen3_human_eval.py: SAT runner for Qwen3 models on the human_eval dataset.
run_think_control_V5_qwq.py: SAT runner for QwQ models on the math datasets.
evaluate_r_detector_outputs.py: evaluate the generated r_*.jsonl results against the gold answers (LLM-based grading; configure API/paths inside the script).
extract_item_done_data.py: parse r_*.jsonl files to extract item_done events and token statistics (configure input/output paths inside the script).
tool.py: lightweight utilities shared by the runners (feature definitions and z-score stats).

Environment

The code is built on Python 3.10.16.

Install dependencies: pip install -r requirements.txt
Run from the Code directory so relative paths resolve correctly.

Running SAT (example)

Example: Qwen3-8B on math500, writing r_math_Qwen3_8B_seed3407_sat.jsonl.

python run_think_control_V5_qwen3.py \
  --model ../qwen3-8b \
  --input ../math500_test.jsonl \
  --out r_math_Qwen3_8B_seed3407_sat.jsonl

--model: model checkpoint directory.
--input: dataset JSONL (use the <dataset>_test.jsonl file, e.g., math500_test.jsonl, gsm8k_test.jsonl, aime2024_test.jsonl, aime2025_test.jsonl, amc_test.jsonl).
--out: output result file (convention: r_<dataset>_<Model>_seedXXXX_sat.jsonl).

Switch scripts for other model families/datasets:

Qwen3 math: run_think_control_V5_qwen3.py.
QwQ math: run_think_control_V5_qwq.py.
Llama math: run_think_control_V5_llama.py.
Qwen3 gpqa: run_think_control_V5_qwen3_gpqa.py with --input ../gpqa_test.jsonl.
Qwen3 human-eval: run_think_control_V5_qwen3_human_eval.py with the human-eval JSONL.

Evaluating accuracy

evaluate_r_detector_outputs.py is currently configured via constants at the top (API_KEY, BASE_URL, LLM_MODEL, INPUT_FILE, TEST_JSONL_FILE, and output/checkpoint paths). To evaluate:

Replace API_KEY with your own key and set INPUT_FILE to the generated result file (e.g., r_math_Qwen3_8B_seed3407_sat.jsonl) and TEST_JSONL_FILE to the matching gold file (e.g., math500_test.jsonl).
Run python evaluate_r_detector_outputs.py.

Important Note on Accuracy: > The automated evaluation script relies on LLM-based judging and may not be 100% accurate. For the most precise accuracy, we recommend manually verifying the model's raw outputs for cases marked as is_correct: false.
(Please note that all experimental results reported in our paper have undergone manual verification to ensure correctness.)

Token statistics / item_done extraction

extract_item_done_data.py uses hard-coded paths near the top (input_file, output_file, stats_file, json_output_file). Point input_file to the result you want to inspect, then run:

python extract_item_done_data.py

Notes

All runners assume the dataset files follow the <dataset>_test.jsonl naming and live at the paths you pass via --input.
Results r_*.jsonl files can be further analyzed with the evaluation and token-statistics scripts after editing their path/API settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproduction Guide

Files and roles

Environment

Running SAT (example)

Evaluating accuracy

Token statistics / item_done extraction

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Gte-Small		Gte-Small
HumanEval.jsonl.gz		HumanEval.jsonl.gz
README.md		README.md
aime2024_test.jsonl		aime2024_test.jsonl
aime2025_test.jsonl		aime2025_test.jsonl
amc_test.jsonl		amc_test.jsonl
evaluate_r_detector_outputs.py		evaluate_r_detector_outputs.py
extract_item_done_data.py		extract_item_done_data.py
gpqa_test.jsonl		gpqa_test.jsonl
gsm8k_test.jsonl		gsm8k_test.jsonl
math500_test.jsonl		math500_test.jsonl
requirements.txt		requirements.txt
run_think_control_V5_ds_qwen.py		run_think_control_V5_ds_qwen.py
run_think_control_V5_llama.py		run_think_control_V5_llama.py
run_think_control_V5_qwen3.py		run_think_control_V5_qwen3.py
run_think_control_V5_qwen3_gpqa.py		run_think_control_V5_qwen3_gpqa.py
run_think_control_V5_qwen3_human_eval.py		run_think_control_V5_qwen3_human_eval.py
run_think_control_V5_qwq.py		run_think_control_V5_qwq.py
step_seq_prm_gte_small_logits_gru_best_psr2.pt		step_seq_prm_gte_small_logits_gru_best_psr2.pt
tool.py		tool.py
zstats.json		zstats.json

Folders and files

Latest commit

History

Repository files navigation

Reproduction Guide

Files and roles

Environment

Running SAT (example)

Evaluating accuracy

Token statistics / item_done extraction

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages