fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649)#1667
Conversation
…#1650) The #1649 barrier made auto-continuation wait for all of a step's parallel client-tool results before firing, but bounded the wait with a fixed 60s timeout that fired through on expiry. That timeout was the wrong primary mechanism: - A human-in-the-loop tool with no `execute` (an `ask_user`/`display_ui`-style prompt) emitted in parallel with a fast tool legitimately parks at `input-available` for as long as the user takes to answer. The barrier would fire through after 60s and repair the still-open tool to errored mid-answer. - A true orphan (client disconnects mid-batch) pinned the isolate alive via `keepAlive` for the full 60s before falling through to the repair backstop. Auto-continuation is only ever triggered by a tool-result/approval event, so the barrier is now purely event-driven: - When the coalesce timer fires on an incomplete batch, Think drains the in-flight applies (`_drainInteractionApplies` — awaits the apply tail and re-reads it so a sibling enqueued mid-drain is still awaited), re-checks, and — if a sibling is still unanswered — returns WITHOUT firing and WITHOUT holding the isolate, leaving `_continuation.pending` in place. - The next sibling's result re-arms the coalesce timer and re-runs the check; the continuation fires exactly once when the final sibling lands. - If the in-memory pending state is lost to eviction between siblings, the final result re-creates it from the persisted transcript and fires with a complete batch — self-healing across hibernation. - A true orphan never auto-continues and never pins the isolate, which is correct: there is nothing valid to continue, and a later user turn / chat recovery repairs the dangling tool. This removes `AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS` (and its `console.warn`) from the Think path entirely. Errored-sibling completion: because the barrier now keys off events rather than polling message state, a result that COMPLETES a parallel batch but carries `autoContinue: false` (the client sends this for errored tool results) would no longer re-trigger the check, stranding the continuation a successful sibling already requested. `_rearmPendingAutoContinuationForBatch` re-arms the barrier for such a result ONLY when a pending continuation already exists, so the batch still continues exactly once — without ever creating a continuation for a standalone errored tool (documented single-tool error behavior is preserved). Double-fire safety is preserved via `_continuationBarrierActive` (a sibling that re-arms the timer mid-drain is absorbed by the in-progress drain) and by making the fire/return decision synchronous in the drain's `.finally`, so a macrotask timer cannot interleave. `@cloudflare/ai-chat` keeps the bounded-wait barrier for now (its barrier runs inside the queued continuation turn and can't return-and-wait without occupying the chat-turn queue); making it event-driven requires moving the batch gate before queueing, tracked alongside the think<->ai-chat unification (#1642). Tests (packages/think/src/tests/client-tools.test.ts): - human-in-the-loop tool parked beside a fast tool: no fire-through, no repair, fires once on the human's answer - errored sibling (autoContinue:false) completes the batch: continues once - standalone errored tool (autoContinue:false, no opted-in sibling): no continue - self-heals when the pending continuation is evicted mid-batch All existing #1649 regression tests remain green (70/70).
🦋 Changeset detectedLatest commit: 68f512b The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
…line mid-stream race The event-driven barrier alone did not fix #1649's actual reproduction. Per abhagsain's debug-log analysis on the issue: the model emits parallel tool calls SEQUENTIALLY within one step, so a fast client tool can resolve and round-trip its result to the server WHILE the model is still streaming the slower siblings. At that moment the siblings exist nowhere — not in `this.messages`, not in the in-flight `_streamingAssistant` accumulator — so no batch-completeness check can see them. The barrier bypassed, the continuation was enqueued, and when it ran (after the stream persisted all tool parts) it repaired the now-materialized-but-still-pending siblings to errored → MissingToolResultsError / death spiral. The only signal that "more tool calls may still arrive" is that the stream is open. So: - Stream-active gate: `_fireAutoContinuationWhenStable` now returns early while `_streamingAssistant` is non-null (also re-checked in the drain `.finally`). Mid-stream the batch can still grow, so no completeness check is meaningful. - Stream-finalize re-trigger: `_onStreamingTurnFinalized` (called at the two normal stream-finalize sites) clears the accumulator AND re-runs the barrier check. This is essential for an all-fast batch whose every result landed mid-stream — once the stream ends there is no further tool-result event to re-arm the barrier, so without this re-check the held continuation would never fire (deadlock). The abort/recovery paths keep a plain clear; recovery re-runs the turn and its own finalize re-triggers the held barrier. Corrected the prior (incorrect) NOTE that claimed turn-queue ordering made the mid-stream bypass safe — abhagsain's logs disprove it: the continuation runs after persist but still errors the pending siblings. Tests (packages/think/src/tests/client-tools.test.ts) with a new mock model (`createMidStreamParallelToolModel`) that emits a fast tool, holds the stream open, then emits a slow tool: - waits for a sibling emitted LATER in the same stream before continuing (the #1649 headline repro): fast tool resolved mid-stream, no premature continuation, no repair of the slow tool, fires once when the slow tool answers - all-fast batch resolved entirely mid-stream: fires exactly once via the stream-finalize re-check (guards the deadlock) Confidence-checked: both new tests FAIL with the gate + finalize re-trigger disabled (the all-fast case fires 2 continuations), and pass with them. Full client-tools suite 72/72; oxlint clean; all 91 projects typecheck.
#1663 tried to fix #1649 by making `_hasIncompleteToolBatch` read the streaming accumulator (`_partsAreMidBatch`). That approach never landed (it couldn't see a sibling the model hadn't streamed yet), and our fix uses the stream-active gate instead — so #1663's four accumulator-scan UNIT tests target an implementation we deliberately don't have and were not ported. Their behavioral intent is covered end-to-end by the existing stream-active-gate tests. Ported the two scenarios that carry real signal for our design: - Different ordering from the headline race: BOTH parallel tool calls streamed up front and visible in the accumulator, the fast one answered mid-stream, the slow sibling answered only AFTER stream end. Guards the gate across stream finalize (would fire prematurely and error the slow sibling without it). - Approval-path variant: a parallel batch whose slow sibling awaits an APPROVAL response rather than a result. Locks down that `_hasIncompleteToolBatch` treats `approval-requested` as pending so the barrier holds until approval, then fires exactly once (the approval analog of the existing parallel-results barrier test). client-tools suite 74/74; oxlint clean; all 91 projects typecheck.
Ported coverage from the abandoned #1663 attemptReviewed #1663 (threepointone's earlier attempt at #1649). It tried to fix the bug by making Notably, #1663 shipped a unit test, an integration test, and a full wrangler-dev e2e — all claimed to "fail without the fix" and reproduce the customer signature "deterministically across all 3 retries" — yet it didn't fix the customer's bug. The mock models put both tool calls into the accumulator before the fast result was answered, so the slow sibling was always visible. The lesson: a wrangler e2e is only as faithful as its mock model's emission ordering. Our headline test asserts Ported the two scenarios that carry real signal for this design (commit 68f512b):
|
…ery (#1671) Follow-up to #1667. Two findings from review of the event-driven auto-continuation barrier. #1 (fix): The RPC streaming path (`_streamResultToRpcCallback`) re-armed the auto-continuation coalesce timer in its `finally` even on the stream-stall recovery early-returns (`scheduled`/`exhausted`). This is unlike the WebSocket `_streamResult` recovery paths, which deliberately do a plain `this._streamingAssistant = null` WITHOUT re-arming, because the scheduled recovery continuation re-runs the turn and its own stream finalize re-triggers the held barrier. When a parallel tool batch had a pending continuation at the moment the stall watchdog fired, the RPC re-arm scheduled a 50ms coalesce timer that could fire `_fireAutoContinuation` alongside the alarm-scheduled recovery continuation -> a spurious double model invocation on the turn queue. The RPC recovery early-returns now mirror the WebSocket path: a `skipFinalizeRearm` flag makes the `finally` do a plain clear instead of `_onStreamingTurnFinalized()`, so the held barrier is re-triggered exactly once by the recovery continuation. #2 (document): The eviction self-healing path relies on the completing result carrying `autoContinue: true` (it re-creates `_continuation.pending` from the persisted transcript via `_scheduleAutoContinuation`). If the completing result is an errored `autoContinue: false` sibling AFTER eviction, `_rearmPendingAutoContinuationForBatch` finds no pending and no-ops -- the continuation is silently dropped. This is NOT a regression (the old in-memory 60s timer was equally eviction-fragile) and a later user message / chat recovery repairs the transcript; fixing it properly would require persisting the continuation-requested intent across eviction. Pinned the current behavior with an explicit test rather than changing it. Tests: - think-session.test.ts: asserts the coalesce timer is NOT re-armed after an RPC stall routes into recovery (while `_streamingAssistant` is still cleared). Adds `testStallRecoveryDoesNotRearmPendingContinuation`. - client-tools.test.ts: documents the eviction + errored-completing- result gap (0 continuations, both results still applied to transcript). npm run check passes; new + existing stall/self-heal tests green.
Summary
Reworks
@cloudflare/think's parallel-tool auto-continuation barrier (added in #1649) into a purely event-driven mechanism, and adds a stream-active gate that fixes #1649's headlineMissingToolResultsErrorrace.Fixes #1649. Follow-up tracked as #1650. Related to the think↔ai-chat barrier unification (#1642).
Background
When the model emits several tool calls in one step, the client answers each independently and sends a
cf_agent_tool_result(withautoContinue) per result. A fast tool's result must not trigger inference while a slower sibling is still unanswered — that feeds the provider an incomplete tool-result set (MissingToolResultsError) or, via the transcript-repair backstop, errors the in-flight sibling and runs a spurious continuation.#1649 introduced a barrier, but bounded the wait with a fixed 60s timeout that fired through on expiry. This PR replaces that, and then closes the deeper mid-stream race that the timeout-based fix never addressed.
Part 1 — Event-driven barrier (replaces the fixed timer)
A timer is the wrong primary mechanism for "wait until the client finishes answering the batch":
executetool (ask_user/display_ui) parked atinput-availablefor as long as the user takes to answer was errored after 60s, mid-answer.Auto-continuation is only ever triggered by a tool-result/approval event, so the barrier now keys off events:
_drainInteractionApplies— awaits the apply tail and re-reads it so a sibling enqueued mid-drain is still awaited), re-check, and if a sibling is still unanswered return without firing and without holding the isolate, leavingpendingin place.pendingfrom the persisted transcript — self-healing across hibernation.Removes
AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS(and itsconsole.warn).Errored-sibling completion: the client sends
autoContinue: falsefor an errored result, so that event no longer schedules a continuation. When a sibling already opted in,_rearmPendingAutoContinuationForBatchre-arms the check so the batch still continues exactly once — without ever creating a continuation for a standalone errored tool.Part 2 — Stream-active gate (fixes the #1649 headline race)
The event-driven change alone did not fix #1649's actual reproduction. Per @abhagsain's debug-log analysis on the issue: the model emits parallel tool calls sequentially within one step, so a fast client tool can resolve and round-trip its result while the model is still streaming the slower siblings. At that moment the siblings exist nowhere — not in
this.messages, not in the in-flight_streamingAssistantaccumulator — so no completeness check can see them. The barrier bypassed, the continuation was enqueued, and when it ran (after persist) it repaired the now-materialized-but-still-pending siblings to errored → death spiral.The only signal that "more tool calls may still arrive" is that the stream is open:
_fireAutoContinuationWhenStablereturns early while_streamingAssistant != null(also re-checked in the drain.finally)._onStreamingTurnFinalized(at the two normal stream-finalize sites) clears the accumulator and re-runs the barrier. Essential for an all-fast batch whose every result landed mid-stream — there's no later tool-result event to re-arm it, so without this re-check the held continuation would deadlock. Abort/recovery paths keep a plain clear; recovery re-runs the turn and its own finalize re-triggers the held barrier.Also corrected the prior NOTE that wrongly claimed turn-queue ordering made the mid-stream bypass safe — the logs disprove it.
Double-fire safety
_continuationBarrierActiveabsorbs a sibling that re-arms the timer mid-drain; the fire/return decision in the drain's.finallyis synchronous, and_fireAutoContinuationcancels the pending timer.Tests
packages/think/src/tests/client-tools.test.ts(real Workers runtime viavitest-pool-workers):executetool — no fire-through, no repair; fires once on answer.autoContinue: false): continues once.createMidStreamParallelToolModel): fast tool resolved mid-stream, no premature continuation, no repair of the slow tool, fires once when it answers.Confidence-checked: the two mid-stream tests fail with the gate + finalize re-trigger disabled (the all-fast case fires 2 continuations) and pass with them. Full
client-toolssuite 72/72;oxlintclean; all 91 projects typecheck.Scope / follow-ups
@cloudflare/ai-chatkeeps the bounded-wait barrier for now (its barrier runs inside the queued continuation turn); event-driven there requires moving the gate before queueing — tracked with refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642.console.warnis gone; a debug-level (non-firing) diagnostic for long-pending orphans is best added under refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642 rather than reintroducing a timer.Changeset
patchfor@cloudflare/think.