Skip to content

feat(grep,glob): abort ripgrep mid-search instead of waiting out timeout#143

Merged
ericleepi314 merged 1 commit into
mainfrom
feat/per-tool-abort-polling
May 15, 2026
Merged

feat(grep,glob): abort ripgrep mid-search instead of waiting out timeout#143
ericleepi314 merged 1 commit into
mainfrom
feat/per-tool-abort-polling

Conversation

@ericleepi314
Copy link
Copy Markdown
Collaborator

Summary

  • Before: Glob/Grep used subprocess.run(timeout=20) for ripgrep — a SIGINT/ESC that tripped the abort controller had to wait out the full 20s timeout before the agent loop noticed
  • After: Popen + 50ms poll loop (mirrors bash_tool's pattern from PR fix(esc): propagate cancel signal into tool_context for subagents #135) watches both the deadline and the abort signal; SIGTERM → 2s grace → SIGKILL on abort, returning within ~150ms instead of 20s
  • New RipgrepAbortedError is distinct from RipgrepTimeoutError: timeout surfaces partial results (useful to the agent), abort drops them (user already cancelled — surfacing them would be noise)

Why

PR #135 wired tool_context.abort_controller so subagents and Bash honor ESC. PR #141 brought the same to headless. Other long-running tools (Read, Glob, Grep, WebFetch, WebSearch, MCP) still don't poll the signal — they return promptly only at natural completion. This PR closes the gap for the search tools, the most visible offenders: searching a large monorepo can pin the user's terminal for 20s after they ESC.

Changes

  • src/tool_system/utils/ripgrep.py:
    • New _run_rg_with_abort(argv, *, timeout_s, abort_signal) — Popen-based supervisor mirroring bash_tool._run_bash_with_abort (50ms poll, SIGTERM → 2s grace → SIGKILL, start_new_session=True / Windows CREATE_NEW_PROCESS_GROUP for process-group teardown)
    • New RipgrepAbortedError(message, partial_results=...) raised when the signal trips
    • ripgrep(...) grows an abort_signal: AbortSignal | None = None keyword param; default-None preserves previous semantics exactly for SDK consumers
  • src/tool_system/tools/glob.py + src/tool_system/tools/grep.py:
  • tests/test_ripgrep_abort.py (new) — 6 regression tests:
    • Bare helper sanity: never-tripped signal preserves existing semantics
    • Pre-tripped signal raises RipgrepAbortedError fast
    • Deterministic abort-mid-subprocess test using sleep 60 (no real ripgrep required — also serves as CI coverage on machines without rg installed)
    • Glob abort → AbortError
    • Grep abort → AbortError
    • Unavailable-rg path: RipgrepUnavailableError still raised correctly

Test plan

  • 19 directly related tests pass (6 new + headless + abort + esc-cancel)
  • Critic subagent review: APPROVE (after applying recommended deterministic test variant + minor type/comment cleanups)
  • Manual: in a real Claude Code session, kick off Grep against a large monorepo, press ESC — should unwind within ~1s instead of 20s

Out of scope (deferred to separate follow-ups)

  • Read: typical reads complete in <1s; agent loop's pre-dispatch _check_cancel already short-circuits abort fired between tools.
  • WebFetch / WebSearch: bounded by urllib.request.urlopen(timeout=15). Cleanest abort requires a watchdog thread that closes the socket — separate PR.
  • MCP tools: out-of-process JSON-RPC; transport-level cancellation is a cross-cutting change across the entire MCP client.

Originally PR #142 stacked on PR #141; re-opened against main after #141 merged but #142's content stayed on the stacked branch.

🤖 Generated with Claude Code

…out (#142)

Before this fix, the ripgrep wrapper used ``subprocess.run(timeout=20)``.
A SIGINT/ESC that tripped the abort controller mid-search had to wait
out the full 20-second timeout before the subprocess returned and the
agent loop could observe the cancellation. On a large repo Glob and
Grep felt exactly like the pre-PR-#135 Bash supervisor — "ESC is
ignored for 20+ seconds."

Add ``_run_rg_with_abort`` that mirrors bash_tool's pattern: Popen +
50ms poll loop watching both the deadline and an ``AbortSignal``,
SIGTERM → 2s grace → SIGKILL. New ``RipgrepAbortedError`` is distinct
from ``RipgrepTimeoutError`` because the two outcomes warrant
different downstream handling — timeout surfaces partial results
(useful to the agent), abort drops them (the user already cancelled).

Glob and Grep pass ``context.abort_controller.signal`` to the helper
and re-raise ``AbortError`` on ``RipgrepAbortedError`` so the agent
loop's ``except AbortError: raise`` branch from PR #135 unwinds to the
outer cancel boundary. ``ripgrep()``'s ``abort_signal=None`` default
keeps SDK consumers of the bare helper working without changes.

Six regression tests pin the contract: bare helper sanity / pre-call
trip / abort-tripped-subprocess-returns-promptly (deterministic event
handshake against ``sleep`` — also exercises the supervisor without
requiring ripgrep installed) / Glob abort → AbortError / Grep abort →
AbortError / unavailable-rg path unchanged.

Read, WebFetch, WebSearch and MCP are deliberately deferred:
* Read: ``open().read()`` returns in <1s for typical files; the agent
  loop's pre-dispatch ``_check_cancel`` already short-circuits abort
  fired between tools.
* WebFetch/WebSearch: ``urllib.request.urlopen(timeout=15)`` is bounded
  at 15s; a watchdog-thread approach is bigger surgery best done as a
  separate PR.
* MCP: out-of-process JSON-RPC needs transport-level cancellation —
  cross-cutting change, separate PR.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@ericleepi314 ericleepi314 merged commit fd30385 into main May 15, 2026
ericleepi314 added a commit that referenced this pull request May 15, 2026
Even after PRs #135/#140/#141/#143 wired ESC through subagents, headless
SIGINT, the non-optional ``tool_context.abort_controller`` contract, and
ripgrep mid-search, the user-visible ESC latency was still 20+ seconds
when the model emitted a multi-tool_use response. Root cause: the
provider's streaming HTTP read had no way to observe the abort
controller. ``query.py`` passed no abort plumbing to
``chat_stream_response``; the only existing interrupt point was the
``on_text_chunk`` callback, which never fires for a turn that emits
tool_use blocks without intervening text. The model could spend ~20s
generating eight parallel ``Write`` blocks, and only after the stream
returned naturally would the outer query loop check the abort signal
and yield "Interrupted by user" for all eight.

Thread an ``AbortSignal`` through ``chat_stream_response``. In the
Anthropic provider, register a listener that calls
``stream.response.close()`` (the same close pattern the existing
``StreamWatchdog`` uses for idle timeout). Closing the underlying HTTP
response causes the SDK's blocking socket read to raise in the
consumer thread; the provider catches it, detects the abort via the
signal state (not the exception class — different SDK versions raise
different classes), and re-raises ``AbortError``.

Three defenses against the registration race:
* Pre-call fast-path bails when the signal was already tripped at
  call entry — skips the round-trip entirely.
* Register-then-recheck after entering the stream context — closes
  the sub-microsecond window between an ``aborted`` read and a
  ``add_listener`` append where ``_fire`` could snapshot the
  listener list and silently drop our newly-appended listener.
* Post-with-block recheck — surfaces a signal that fires between
  ``__exit__`` and ``return``.

Plumbing:
* ``query.py``: ``_call_model_sync`` grows ``abort_signal`` keyword
  and forwards to the provider; call site passes
  ``params.abort_controller.signal``. New
  ``except AbortError: raise`` in ``_call_model_sync`` and
  ``except AbortError: pass`` in the query loop route the cancel
  to the existing post-API abort-handling block in one place.
* ``agent_loop.py``: ``_call_provider_for_turn`` grows
  ``cancel_signal`` and forwards as ``abort_signal=`` to the
  provider; ``run_agent_loop`` passes through.
* ``minimax_provider.py`` / ``openai_compatible.py``: pre-call
  fast-path for parity (mid-stream listener for these providers is
  a future PR — same underlying anthropic SDK for Minimax makes it
  a small lift).

Five regression tests pin the contract: pre-aborted fast-path skips
``_ensure_client``, mid-stream abort closes the stream and raises
``AbortError`` within <1s (vs the 20s+ symptom this PR fixes),
unaborted streams return normally, ``abort_signal=None`` is a
no-op for legacy callers, and the listener detaches after normal
completion.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant