Paper: Are Large Reasoning Models Interruptible?
Authors: Tsung-Han Wu*, Mihran Miroyan*, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Additional resources: 🤗 Dataset, 💻 Project page
Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.
Unlike static ‘frozen world’ settings that assume users wait for completion, real-world scenarios often demand mid-inference updates, as LRM reasoning can be time-consuming. We introduce a public evaluation suite to assess how LRMs handle interruptions across math and coding tasks. We define two types of interruptions: time-constrained (hard: immediate answer; soft: speedup reasoning) and update-driven (task specifications change mid-reasoning).
We found LRMs have three common failure modes: reasoning leakage can produce up to 10x longer answers after hard interrupts, "moving" their reasoning tokens to the answer segment; over 90% of new errors under speedup arise from panic, when the models prematurely terminate their reasoning process; and roughly 80% of update-driven interrupt errors stem from self-doubt, where models fail to validate and incorporate new information. Results are reported at 30% interruption points; detailed results are provided in the paper.
- Python: 3.11
- Installation:
uv sync
- vLLM is used for inference; install the optional dependency with
uv sync --extra vllm
. - Environment variables (e.g., API keys) are loaded via
python-dotenv
.
Directory | Functionality |
---|---|
data/loading/ |
Loads source math and code benchmarks. |
data/augmentation/ |
Interruptible benchmark construction scripts. |
src/ |
Core experiment script (run.py ) and helpers for parsing args, managing vLLM inference, and prompt formatting. |
scripts/ |
Shell wrappers for multi-GPU runs. |
eval/math/ |
Math grading script (run_math_tests.py ) and utilities. |
eval/code/ |
Code grading script (run_code_tests.py ) and utilities. |
Each loader writes JSONL files with id
, problem
, and benchmark-specific metadata.
python data/loading/load_data_math.py
python data/loading/load_data_code.py
Prompts used to augment the original source datasets are provided in data/augmentation/prompts.py
. The loaded output files from data/loading
are used as inputs to the augmentation scripts.
Math: data/augmentation/math_intervene.py
Code: Code interventions run in three passes (code_intervene_stage{1,2,3}.py
) to (1) break down the original problem, (2) revising the original problem specifications, and (3) revising the starter code.
The outputs of the augmentation scripts have original_problem
, revised_problem
, and update
fields (with additional problem- and task-specific metadata), used for running the main evaluation experiments.
parser_utils.py
defines a full CLI for controlling inference. Important arguments:
Argument | Functionality |
---|---|
task |
math or code . Determines prompt formatting and dataset slicing. |
mode |
initial , subsequent_interrupt_* . Controls whether we run a first-pass generation or inject interruptions. |
interrupt_pos |
Fraction (0–1] of reasoning tokens preserved before injecting an update. |
interrupt_role |
assistant (append to the reasoning trace within the assistant) or user (add new user turn). |
custom_prompt_file |
JSON file with system prompt, prefix/suffix hooks update strings. |
Modes:
initial
: Full thinking mode without any interruptions.subsequent_interrupt_hard
: Hard Interrupt (terminate thinking block).subsequent_interrupt_extreme
: Hard Interrupt (force to generate the final answer).subsequent_interrupt_soft
: Provide a speedup instruction without terminating the thinking block.subsequent_interrupt_update
: Provide an update instruction without terminating the thinking block.
Typical workflow:
-
Initial thinking pass
python src/run.py \ --model_name Qwen/Qwen3-8B \ --input_file <input_file> \ --output_dir <output_dir> \ --mode initial \ --task math
-
Interrupt round
python src/run.py \ --model_name Qwen/Qwen3-8B \ --input_file <initial_pass_output> \ --output_dir <output_dir> \ --mode subsequent_interrupt_hard \ --interrupt_pos 0.3 \ --task math
inference_utils.run_subsequent_intervene
trims thinking tokens withinterrupt_pos
, applies custom updates from the dataset, and "resumes" the generation process.
common_config.sh
centralises defaults:
- Temperature/max token presets per model family (
QWEN_MAX_TOKENS
,GPTOSS_MAX_TOKENS
, etc.). - GPU orchestration helpers (
setup_gpu_config
,run_inference_instances
,run_interrupt_experiments
) to fan out jobs acrossCUDA_VISIBLE_DEVICES
. merge_results
concatenatesoutput_{rank}.jsonl
shards and cleans temporary directories.
To launch experiments:
- Duplicate either
run_full_thinking.sh
(single round) orrun_interrupt.sh
(multiple interrupt positions). - Fill the
INSERT
placeholders forMODEL
,INPUT_FILE
,OUT_DIR
,CUSTOM_PROMPT_FILE
, etc.
run_interrupt.sh
automatically iterates INTERRUPT_POS_LIST
, calling run_interrupt_experiments
to execute each intervention and merge results into OUT_DIR/interrupted{pos}.jsonl
.
Script for evaluating model outputs on math benchmarks: eval/math/run_math_tests.py
.
Arguments:
--input_file
: JSONL withsource
,answer
, andoutput
fields (produced bysrc/run.py
).--output_file
: Destination JSONL storing per-problempass@1
.--add_math_block
: Optionally wraps predictions with\boxed{...}
before grading.
The scorer uses:
score_response_with_timeout
to evaluate per benchmark (gsm8k
,math500
,aime2024
,aime2025
).math_parsing_util.py
to normalize LaTeX and compare symbolic answers robustly.
Script for evaluating model outputs on LiveCodeBench: eval/code/run_code_tests.py
(most of the evaluation functionality is borrowed from LCB repo).
Arguments:
--input_predictions
: JSONL from experiments (must includeid
and finaloutput
list).--input_test_cases
: JSONL aligned with LiveCodeBench schema (public_test_cases
,private_test_cases
,metadata
).--add_code_block
: Forces code fences when generations omit ``` delimiters.
run_code_tests.py
extracts Python snippets, merges with official tests, and calls compute_code_generation_metrics.codegen_metrics
to obtain pass@k metrics (default pass@1
). Execution isolation relies on a sandboxed process with per-test timeouts (testing_util.py
) consistent with LCB evaluation suite.
@misc{wu2025largereasoningmodelsinterruptible,
title={Are Large Reasoning Models Interruptible?},
author={Tsung-Han Wu and Mihran Miroyan and David M. Chan and Trevor Darrell and Narges Norouzi and Joseph E. Gonzalez},
year={2025},
eprint={2510.11713},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.11713},
}