Skip to content

refactor(context): make ToolContext.abort_controller non-optional#140

Merged
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-subagent-propagation
May 15, 2026
Merged

refactor(context): make ToolContext.abort_controller non-optional#140
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-subagent-propagation

Conversation

@ericleepi314
Copy link
Copy Markdown
Collaborator

@ericleepi314 ericleepi314 commented May 15, 2026

Summary

  • Eliminates the "field is None" hazard class that masked the original ESC-into-subagent regression PR fix(esc): propagate cancel signal into tool_context for subagents #135 fixed — a forgotten plumbing call no longer silently degrades ESC propagation
  • ToolContext.abort_controller: AbortController = field(default_factory=AbortController) — every fresh context carries an untripped controller; bridge/engine still overwrite with their per-run controller, but tests, SDK callers, and any forgotten call site land on a safe default
  • Drops the or AbortController() / if abort_ctrl and … defensive guards in 6 downstream readers (sibling+per-tool controllers, abort fast-paths, hook gates)

Why

PR #135 wired tool_context.abort_controller so ESC actually reaches subagents and Bash. But the field stayed typed Any | None with a None default — meaning a future caller that forgets to plumb the controller would re-introduce the exact regression we just fixed, with the same silent-failure shape. This refactor tightens the contract so that's impossible.

(Originally merged stacked-on-top-of #135 as PR #137; this PR re-applies the same content on top of current main after #135 squash-merged and #136 landed.)

Changes

  • src/tool_system/context.pyabort_controller: AbortController = field(default_factory=AbortController)
  • src/agent/subagent_context.py — drop the impossible "parent has no controller → mint fresh" branch; the chain collapses to explicit override > share parent's > child of parent's
  • src/agent/run_agent.py — same; sync subagents now reliably share with parent (which is exactly the path that broke ESC propagation before)
  • src/services/tool_execution/streaming_executor.py — drop or AbortController() defensive wrapper; simplify truthiness guards in _get_abort_reason and _on_tool_abort; drop now-unused AbortController import
  • src/services/tool_execution/tool_execution.py — drop truthiness guard in the dispatch fast-abort
  • src/services/tool_execution/tool_hooks.py — same, for PreToolUse hook cancellation
  • src/tool_system/tools/bash/bash_tool.py — drop getattr(..., None) indirection in _get_abort_signal
  • src/tool_system/tools/tasks_v2.py — drop getattr(..., None) indirection in the TaskOutput abort fast-path
  • src/tui/agent_bridge.py_finish() installs a fresh controller instead of clearing to None (the field is no longer optional)
  • tests/test_esc_cancel_propagation.py — updated three assertions for the new contract (None → "fresh, untripped controller")
  • tests/test_tool_context_abort_default.py (new) — pins three contract invariants: default exists, per-context isolation, explicit override wins

Hook executor (src/hooks/hook_executor.py) and stop hooks (src/query/stop_hooks.py) keep their getattr(..., None) defensive pattern because their tool_use_context: Any signatures accept mock contexts from tests and external callers; tightening them would expand scope.

Test plan

🤖 Generated with Claude Code

The previous ``Any | None = None`` default papered over a hazard class
that masked the ESC-into-subagent regression PR #135 fixed: every
caller that owns the per-run cancellation lifecycle (TUI bridge, REPL
engine) had to remember to plumb the controller onto the context, and a
forgotten plumbing call silently degraded ESC propagation rather than
failing loudly.

Use ``field(default_factory=AbortController)`` so every fresh context
carries a real (untripped) controller. Downstream readers (subagent
inheritance, streaming executor, Bash supervisor, tool execution, tool
hooks, tasks_v2 abort fast-path) drop the defensive ``or
AbortController()`` and ``if abort_ctrl and …`` guards that existed only
to compensate for the impossible-now ``None`` state. The TUI bridge's
``_finish()`` now installs a fresh controller instead of clearing to
``None`` so the dataclass invariant holds across runs.

Three new contract tests pin the invariants: default exists, default is
per-context (not shared), explicit override still wins.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@ericleepi314 ericleepi314 force-pushed the fix/esc-cancel-subagent-propagation branch from ed5c9d7 to a0f9e4c Compare May 15, 2026 09:11
@ericleepi314 ericleepi314 changed the title Fix/esc cancel subagent propagation refactor(context): make ToolContext.abort_controller non-optional May 15, 2026
@ericleepi314 ericleepi314 merged commit e6168ec into main May 15, 2026
ericleepi314 added a commit that referenced this pull request May 15, 2026
Even after PRs #135/#140/#141/#143 wired ESC through subagents, headless
SIGINT, the non-optional ``tool_context.abort_controller`` contract, and
ripgrep mid-search, the user-visible ESC latency was still 20+ seconds
when the model emitted a multi-tool_use response. Root cause: the
provider's streaming HTTP read had no way to observe the abort
controller. ``query.py`` passed no abort plumbing to
``chat_stream_response``; the only existing interrupt point was the
``on_text_chunk`` callback, which never fires for a turn that emits
tool_use blocks without intervening text. The model could spend ~20s
generating eight parallel ``Write`` blocks, and only after the stream
returned naturally would the outer query loop check the abort signal
and yield "Interrupted by user" for all eight.

Thread an ``AbortSignal`` through ``chat_stream_response``. In the
Anthropic provider, register a listener that calls
``stream.response.close()`` (the same close pattern the existing
``StreamWatchdog`` uses for idle timeout). Closing the underlying HTTP
response causes the SDK's blocking socket read to raise in the
consumer thread; the provider catches it, detects the abort via the
signal state (not the exception class — different SDK versions raise
different classes), and re-raises ``AbortError``.

Three defenses against the registration race:
* Pre-call fast-path bails when the signal was already tripped at
  call entry — skips the round-trip entirely.
* Register-then-recheck after entering the stream context — closes
  the sub-microsecond window between an ``aborted`` read and a
  ``add_listener`` append where ``_fire`` could snapshot the
  listener list and silently drop our newly-appended listener.
* Post-with-block recheck — surfaces a signal that fires between
  ``__exit__`` and ``return``.

Plumbing:
* ``query.py``: ``_call_model_sync`` grows ``abort_signal`` keyword
  and forwards to the provider; call site passes
  ``params.abort_controller.signal``. New
  ``except AbortError: raise`` in ``_call_model_sync`` and
  ``except AbortError: pass`` in the query loop route the cancel
  to the existing post-API abort-handling block in one place.
* ``agent_loop.py``: ``_call_provider_for_turn`` grows
  ``cancel_signal`` and forwards as ``abort_signal=`` to the
  provider; ``run_agent_loop`` passes through.
* ``minimax_provider.py`` / ``openai_compatible.py``: pre-call
  fast-path for parity (mid-stream listener for these providers is
  a future PR — same underlying anthropic SDK for Minimax makes it
  a small lift).

Five regression tests pin the contract: pre-aborted fast-path skips
``_ensure_client``, mid-stream abort closes the stream and raises
``AbortError`` within <1s (vs the 20s+ symptom this PR fixes),
unaborted streams return normally, ``abort_signal=None`` is a
no-op for legacy callers, and the listener detaches after normal
completion.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant