Are LRMs Interruptible?

Paper: Are Large Reasoning Models Interruptible?

Authors: Tsung-Han Wu*, Mihran Miroyan*, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Additional resources: 🤗 Dataset, 💻 Project page

Abstract

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

Unlike static ‘frozen world’ settings that assume users wait for completion, real-world scenarios often demand mid-inference updates, as LRM reasoning can be time-consuming. We introduce a public evaluation suite to assess how LRMs handle interruptions across math and coding tasks. We define two types of interruptions: time-constrained (hard: immediate answer; soft: speedup reasoning) and update-driven (task specifications change mid-reasoning).

We found LRMs have three common failure modes: reasoning leakage can produce up to 10x longer answers after hard interrupts, "moving" their reasoning tokens to the answer segment; over 90% of new errors under speedup arise from panic, when the models prematurely terminate their reasoning process; and roughly 80% of update-driven interrupt errors stem from self-doubt, where models fail to validate and incorporate new information. Results are reported at 30% interruption points; detailed results are provided in the paper.

Repository

Setup

Python: 3.11
Installation: uv sync
vLLM is used for inference; install the optional dependency with uv sync --extra vllm.
Environment variables (e.g., API keys) are loaded via python-dotenv.

Repository Layout

Directory	Functionality
`data/loading/`	Loads source math and code benchmarks.
`data/augmentation/`	Interruptible benchmark construction scripts.
`src/`	Core experiment script (`run.py`) and helpers for parsing args, managing vLLM inference, and prompt formatting.
`scripts/`	Shell wrappers for multi-GPU runs.
`eval/math/`	Math grading script (`run_math_tests.py`) and utilities.
`eval/code/`	Code grading script (`run_code_tests.py`) and utilities.

Data Preparation

Loading Source Datasets (`data/loading`)

Each loader writes JSONL files with id, problem, and benchmark-specific metadata.

python data/loading/load_data_math.py
python data/loading/load_data_code.py

Augmenting Source Datasets (`data/augmentation`)

Prompts used to augment the original source datasets are provided in data/augmentation/prompts.py. The loaded output files from data/loading are used as inputs to the augmentation scripts.

Math: data/augmentation/math_intervene.py

Code: Code interventions run in three passes (code_intervene_stage{1,2,3}.py) to (1) break down the original problem, (2) revising the original problem specifications, and (3) revising the starter code.

The outputs of the augmentation scripts have original_problem, revised_problem, and update fields (with additional problem- and task-specific metadata), used for running the main evaluation experiments.

Running Experiments (`src/`)

parser_utils.py defines a full CLI for controlling inference. Important arguments:

Argument	Functionality
`task`	`math` or `code`. Determines prompt formatting and dataset slicing.
`mode`	`initial`, `subsequent_interrupt_*`. Controls whether we run a first-pass generation or inject interruptions.
`interrupt_pos`	Fraction (0–1] of reasoning tokens preserved before injecting an update.
`interrupt_role`	`assistant` (append to the reasoning trace within the assistant) or `user` (add new user turn).
`custom_prompt_file`	JSON file with system prompt, prefix/suffix hooks update strings.

Modes:

initial: Full thinking mode without any interruptions.
subsequent_interrupt_hard: Hard Interrupt (terminate thinking block).
subsequent_interrupt_extreme: Hard Interrupt (force to generate the final answer).
subsequent_interrupt_soft: Provide a speedup instruction without terminating the thinking block.
subsequent_interrupt_update: Provide an update instruction without terminating the thinking block.

Typical workflow:

Initial thinking pass

python src/run.py \
     --model_name Qwen/Qwen3-8B \
     --input_file <input_file> \
     --output_dir <output_dir> \
     --mode initial \
     --task math

Interrupt round

python src/run.py \
     --model_name Qwen/Qwen3-8B \
     --input_file <initial_pass_output> \
     --output_dir <output_dir> \
     --mode subsequent_interrupt_hard \
     --interrupt_pos 0.3 \
     --task math

inference_utils.run_subsequent_intervene trims thinking tokens with interrupt_pos, applies custom updates from the dataset, and "resumes" the generation process.

Multi-GPU Scripts (`scripts/`)

common_config.sh centralises defaults:

Temperature/max token presets per model family (QWEN_MAX_TOKENS, GPTOSS_MAX_TOKENS, etc.).
GPU orchestration helpers (setup_gpu_config, run_inference_instances, run_interrupt_experiments) to fan out jobs across CUDA_VISIBLE_DEVICES.
merge_results concatenates output_{rank}.jsonl shards and cleans temporary directories.

To launch experiments:

Duplicate either run_full_thinking.sh (single round) or run_interrupt.sh (multiple interrupt positions).
Fill the INSERT placeholders for MODEL, INPUT_FILE, OUT_DIR, CUSTOM_PROMPT_FILE, etc.

run_interrupt.sh automatically iterates INTERRUPT_POS_LIST, calling run_interrupt_experiments to execute each intervention and merge results into OUT_DIR/interrupted{pos}.jsonl.

Evaluation

Math (`eval/math/`)

Script for evaluating model outputs on math benchmarks: eval/math/run_math_tests.py.

Arguments:

--input_file: JSONL with source, answer, and output fields (produced by src/run.py).
--output_file: Destination JSONL storing per-problem pass@1.
--add_math_block: Optionally wraps predictions with \boxed{...} before grading.

The scorer uses:

score_response_with_timeout to evaluate per benchmark (gsm8k, math500, aime2024, aime2025).
math_parsing_util.py to normalize LaTeX and compare symbolic answers robustly.

Code (`eval/code/`)

Script for evaluating model outputs on LiveCodeBench: eval/code/run_code_tests.py (most of the evaluation functionality is borrowed from LCB repo).

Arguments:

--input_predictions: JSONL from experiments (must include id and final output list).
--input_test_cases: JSONL aligned with LiveCodeBench schema (public_test_cases, private_test_cases, metadata).
--add_code_block: Forces code fences when generations omit ``` delimiters.

run_code_tests.py extracts Python snippets, merges with official tests, and calls compute_code_generation_metrics.codegen_metrics to obtain pass@k metrics (default pass@1). Execution isolation relies on a sandboxed process with per-test timeouts (testing_util.py) consistent with LCB evaluation suite.

Citation

@misc{wu2025largereasoningmodelsinterruptible,
      title={Are Large Reasoning Models Interruptible?}, 
      author={Tsung-Han Wu and Mihran Miroyan and David M. Chan and Trevor Darrell and Narges Norouzi and Joseph E. Gonzalez},
      year={2025},
      eprint={2510.11713},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.11713}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
eval		eval
images		images
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Are LRMs Interruptible?

Abstract

Repository

Setup

Repository Layout

Data Preparation

Loading Source Datasets (`data/loading`)

Augmenting Source Datasets (`data/augmentation`)

Running Experiments (`src/`)

Multi-GPU Scripts (`scripts/`)

Evaluation

Math (`eval/math/`)

Code (`eval/code/`)

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

dynamic-lm/interrupt-lrm

Folders and files

Latest commit

History

Repository files navigation

Are LRMs Interruptible?

Abstract

Repository

Setup

Repository Layout

Data Preparation

Loading Source Datasets (data/loading)

Augmenting Source Datasets (data/augmentation)

Running Experiments (src/)

Multi-GPU Scripts (scripts/)

Evaluation

Math (eval/math/)

Code (eval/code/)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Loading Source Datasets (`data/loading`)

Augmenting Source Datasets (`data/augmentation`)

Running Experiments (`src/`)

Multi-GPU Scripts (`scripts/`)

Math (`eval/math/`)

Code (`eval/code/`)

Packages