This folder hosts the scripts to run and post-process the SAT method across models and datasets.
run_think_control_V5_qwen3.py: run SAT on Qwen3 models for math datasets (math500,gsm8k,aime2024,aime2025,amc).run_think_control_V5_llama.py: SAT runner for the Llama family on the math datasets above.run_think_control_V5_ds_qwen.py: SAT runner for the ds_qwen family on the math datasets above.run_think_control_V5_qwen3_gpqa.py: SAT runner for Qwen3 models on thegpqadataset.run_think_control_V5_qwen3_human_eval.py: SAT runner for Qwen3 models on thehuman_evaldataset.run_think_control_V5_qwq.py: SAT runner for QwQ models on the math datasets.evaluate_r_detector_outputs.py: evaluate the generatedr_*.jsonlresults against the gold answers (LLM-based grading; configure API/paths inside the script).extract_item_done_data.py: parser_*.jsonlfiles to extractitem_doneevents and token statistics (configure input/output paths inside the script).tool.py: lightweight utilities shared by the runners (feature definitions and z-score stats).
The code is built on Python 3.10.16.
- Install dependencies:
pip install -r requirements.txt - Run from the
Codedirectory so relative paths resolve correctly.
Example: Qwen3-8B on math500, writing r_math_Qwen3_8B_seed3407_sat.jsonl.
python run_think_control_V5_qwen3.py \
--model ../qwen3-8b \
--input ../math500_test.jsonl \
--out r_math_Qwen3_8B_seed3407_sat.jsonl--model: model checkpoint directory.--input: dataset JSONL (use the<dataset>_test.jsonlfile, e.g.,math500_test.jsonl,gsm8k_test.jsonl,aime2024_test.jsonl,aime2025_test.jsonl,amc_test.jsonl).--out: output result file (convention:r_<dataset>_<Model>_seedXXXX_sat.jsonl).
Switch scripts for other model families/datasets:
- Qwen3 math:
run_think_control_V5_qwen3.py. - QwQ math:
run_think_control_V5_qwq.py. - Llama math:
run_think_control_V5_llama.py. - Qwen3 gpqa:
run_think_control_V5_qwen3_gpqa.pywith--input ../gpqa_test.jsonl. - Qwen3 human-eval:
run_think_control_V5_qwen3_human_eval.pywith the human-eval JSONL.
evaluate_r_detector_outputs.py is currently configured via constants at the top (API_KEY, BASE_URL, LLM_MODEL, INPUT_FILE, TEST_JSONL_FILE, and output/checkpoint paths). To evaluate:
- Replace
API_KEYwith your own key and setINPUT_FILEto the generated result file (e.g.,r_math_Qwen3_8B_seed3407_sat.jsonl) andTEST_JSONL_FILEto the matching gold file (e.g.,math500_test.jsonl). - Run
python evaluate_r_detector_outputs.py.
Important Note on Accuracy: > The automated evaluation script relies on LLM-based judging and may not be 100% accurate. For the most precise accuracy, we recommend manually verifying the model's raw outputs for cases marked as
is_correct: false.
(Please note that all experimental results reported in our paper have undergone manual verification to ensure correctness.)
extract_item_done_data.py uses hard-coded paths near the top (input_file, output_file, stats_file, json_output_file). Point input_file to the result you want to inspect, then run:
python extract_item_done_data.py- All runners assume the dataset files follow the
<dataset>_test.jsonlnaming and live at the paths you pass via--input. - Results
r_*.jsonlfiles can be further analyzed with the evaluation and token-statistics scripts after editing their path/API settings.