fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649) by threepointone · Pull Request #1667 · cloudflare/agents

threepointone · 2026-06-03T14:08:28Z

Summary

Reworks @cloudflare/think's parallel-tool auto-continuation barrier (added in #1649) into a purely event-driven mechanism, and adds a stream-active gate that fixes #1649's headline MissingToolResultsError race.

Fixes #1649. Follow-up tracked as #1650. Related to the think↔ai-chat barrier unification (#1642).

Background

When the model emits several tool calls in one step, the client answers each independently and sends a cf_agent_tool_result (with autoContinue) per result. A fast tool's result must not trigger inference while a slower sibling is still unanswered — that feeds the provider an incomplete tool-result set (MissingToolResultsError) or, via the transcript-repair backstop, errors the in-flight sibling and runs a spurious continuation.

#1649 introduced a barrier, but bounded the wait with a fixed 60s timeout that fired through on expiry. This PR replaces that, and then closes the deeper mid-stream race that the timeout-based fix never addressed.

Part 1 — Event-driven barrier (replaces the fixed timer)

A timer is the wrong primary mechanism for "wait until the client finishes answering the batch":

Human-in-the-loop fire-through. A no-execute tool (ask_user / display_ui) parked at input-available for as long as the user takes to answer was errored after 60s, mid-answer.
Orphan isolate hold. A disconnected client pinned the isolate alive for 60s.

Auto-continuation is only ever triggered by a tool-result/approval event, so the barrier now keys off events:

On the coalesce timer, drain in-flight applies (_drainInteractionApplies — awaits the apply tail and re-reads it so a sibling enqueued mid-drain is still awaited), re-check, and if a sibling is still unanswered return without firing and without holding the isolate, leaving pending in place.
The next sibling's result re-arms the check; the continuation fires exactly once when the final sibling lands.
Lost to eviction between siblings? The final result re-creates pending from the persisted transcript — self-healing across hibernation.
A true orphan never auto-continues and never pins the isolate.

Removes AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS (and its console.warn).

Errored-sibling completion: the client sends autoContinue: false for an errored result, so that event no longer schedules a continuation. When a sibling already opted in, _rearmPendingAutoContinuationForBatch re-arms the check so the batch still continues exactly once — without ever creating a continuation for a standalone errored tool.

Part 2 — Stream-active gate (fixes the #1649 headline race)

The event-driven change alone did not fix #1649's actual reproduction. Per @abhagsain's debug-log analysis on the issue: the model emits parallel tool calls sequentially within one step, so a fast client tool can resolve and round-trip its result while the model is still streaming the slower siblings. At that moment the siblings exist nowhere — not in this.messages, not in the in-flight _streamingAssistant accumulator — so no completeness check can see them. The barrier bypassed, the continuation was enqueued, and when it ran (after persist) it repaired the now-materialized-but-still-pending siblings to errored → death spiral.

T~250ms: deleteSlides (fast) result arrives → barrier checks
         accumulator = [reasoning(STREAMING), deleteSlides(output-available)]
         → modifySlide x2 NOT EMITTED YET → bypass → continuation enqueued
T~800ms: stream ends, persists all 3 parts (modifySlide x2 input-available)
         → continuation runs → repairs modifySlide x2 to errored

The only signal that "more tool calls may still arrive" is that the stream is open:

Stream-active gate: _fireAutoContinuationWhenStable returns early while _streamingAssistant != null (also re-checked in the drain .finally).
Stream-finalize re-trigger: _onStreamingTurnFinalized (at the two normal stream-finalize sites) clears the accumulator and re-runs the barrier. Essential for an all-fast batch whose every result landed mid-stream — there's no later tool-result event to re-arm it, so without this re-check the held continuation would deadlock. Abort/recovery paths keep a plain clear; recovery re-runs the turn and its own finalize re-triggers the held barrier.

Also corrected the prior NOTE that wrongly claimed turn-queue ordering made the mid-stream bypass safe — the logs disprove it.

Double-fire safety

_continuationBarrierActive absorbs a sibling that re-arms the timer mid-drain; the fire/return decision in the drain's .finally is synchronous, and _fireAutoContinuation cancels the pending timer.

Tests

packages/think/src/tests/client-tools.test.ts (real Workers runtime via vitest-pool-workers):

human-in-the-loop: parked no-execute tool — no fire-through, no repair; fires once on answer.
errored sibling completes the batch (autoContinue: false): continues once.
standalone errored tool: does not auto-continue.
self-heals across eviction mid-batch.
sibling emitted LATER in the same stream (the _scheduleAutoContinuation 50ms timer fires before all parallel client-tool results arrive → MissingToolResultsError #1649 headline repro, new createMidStreamParallelToolModel): fast tool resolved mid-stream, no premature continuation, no repair of the slow tool, fires once when it answers.
all-fast batch resolved entirely mid-stream: fires once via the stream-finalize re-check (guards the deadlock).

Confidence-checked: the two mid-stream tests fail with the gate + finalize re-trigger disabled (the all-fast case fires 2 continuations) and pass with them. Full client-tools suite 72/72; oxlint clean; all 91 projects typecheck.

Scope / follow-ups

@cloudflare/ai-chat keeps the bounded-wait barrier for now (its barrier runs inside the queued continuation turn); event-driven there requires moving the gate before queueing — tracked with refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642.
Observability: the old console.warn is gone; a debug-level (non-firing) diagnostic for long-pending orphans is best added under refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642 rather than reintroducing a timer.

Changeset

patch for @cloudflare/think.

…#1650) The #1649 barrier made auto-continuation wait for all of a step's parallel client-tool results before firing, but bounded the wait with a fixed 60s timeout that fired through on expiry. That timeout was the wrong primary mechanism: - A human-in-the-loop tool with no `execute` (an `ask_user`/`display_ui`-style prompt) emitted in parallel with a fast tool legitimately parks at `input-available` for as long as the user takes to answer. The barrier would fire through after 60s and repair the still-open tool to errored mid-answer. - A true orphan (client disconnects mid-batch) pinned the isolate alive via `keepAlive` for the full 60s before falling through to the repair backstop. Auto-continuation is only ever triggered by a tool-result/approval event, so the barrier is now purely event-driven: - When the coalesce timer fires on an incomplete batch, Think drains the in-flight applies (`_drainInteractionApplies` — awaits the apply tail and re-reads it so a sibling enqueued mid-drain is still awaited), re-checks, and — if a sibling is still unanswered — returns WITHOUT firing and WITHOUT holding the isolate, leaving `_continuation.pending` in place. - The next sibling's result re-arms the coalesce timer and re-runs the check; the continuation fires exactly once when the final sibling lands. - If the in-memory pending state is lost to eviction between siblings, the final result re-creates it from the persisted transcript and fires with a complete batch — self-healing across hibernation. - A true orphan never auto-continues and never pins the isolate, which is correct: there is nothing valid to continue, and a later user turn / chat recovery repairs the dangling tool. This removes `AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS` (and its `console.warn`) from the Think path entirely. Errored-sibling completion: because the barrier now keys off events rather than polling message state, a result that COMPLETES a parallel batch but carries `autoContinue: false` (the client sends this for errored tool results) would no longer re-trigger the check, stranding the continuation a successful sibling already requested. `_rearmPendingAutoContinuationForBatch` re-arms the barrier for such a result ONLY when a pending continuation already exists, so the batch still continues exactly once — without ever creating a continuation for a standalone errored tool (documented single-tool error behavior is preserved). Double-fire safety is preserved via `_continuationBarrierActive` (a sibling that re-arms the timer mid-drain is absorbed by the in-progress drain) and by making the fire/return decision synchronous in the drain's `.finally`, so a macrotask timer cannot interleave. `@cloudflare/ai-chat` keeps the bounded-wait barrier for now (its barrier runs inside the queued continuation turn and can't return-and-wait without occupying the chat-turn queue); making it event-driven requires moving the batch gate before queueing, tracked alongside the think<->ai-chat unification (#1642). Tests (packages/think/src/tests/client-tools.test.ts): - human-in-the-loop tool parked beside a fast tool: no fire-through, no repair, fires once on the human's answer - errored sibling (autoContinue:false) completes the batch: continues once - standalone errored tool (autoContinue:false, no opted-in sibling): no continue - self-heals when the pending continuation is evicted mid-batch All existing #1649 regression tests remain green (70/70).

changeset-bot · 2026-06-03T14:08:32Z

🦋 Changeset detected

Latest commit: 68f512b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/think	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-06-03T14:13:50Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1667

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1667

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1667

hono-agents

npm i https://pkg.pr.new/hono-agents@1667

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1667

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1667

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1667

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1667

commit: 68f512b

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

…line mid-stream race The event-driven barrier alone did not fix #1649's actual reproduction. Per abhagsain's debug-log analysis on the issue: the model emits parallel tool calls SEQUENTIALLY within one step, so a fast client tool can resolve and round-trip its result to the server WHILE the model is still streaming the slower siblings. At that moment the siblings exist nowhere — not in `this.messages`, not in the in-flight `_streamingAssistant` accumulator — so no batch-completeness check can see them. The barrier bypassed, the continuation was enqueued, and when it ran (after the stream persisted all tool parts) it repaired the now-materialized-but-still-pending siblings to errored → MissingToolResultsError / death spiral. The only signal that "more tool calls may still arrive" is that the stream is open. So: - Stream-active gate: `_fireAutoContinuationWhenStable` now returns early while `_streamingAssistant` is non-null (also re-checked in the drain `.finally`). Mid-stream the batch can still grow, so no completeness check is meaningful. - Stream-finalize re-trigger: `_onStreamingTurnFinalized` (called at the two normal stream-finalize sites) clears the accumulator AND re-runs the barrier check. This is essential for an all-fast batch whose every result landed mid-stream — once the stream ends there is no further tool-result event to re-arm the barrier, so without this re-check the held continuation would never fire (deadlock). The abort/recovery paths keep a plain clear; recovery re-runs the turn and its own finalize re-triggers the held barrier. Corrected the prior (incorrect) NOTE that claimed turn-queue ordering made the mid-stream bypass safe — abhagsain's logs disprove it: the continuation runs after persist but still errors the pending siblings. Tests (packages/think/src/tests/client-tools.test.ts) with a new mock model (`createMidStreamParallelToolModel`) that emits a fast tool, holds the stream open, then emits a slow tool: - waits for a sibling emitted LATER in the same stream before continuing (the #1649 headline repro): fast tool resolved mid-stream, no premature continuation, no repair of the slow tool, fires once when the slow tool answers - all-fast batch resolved entirely mid-stream: fires exactly once via the stream-finalize re-check (guards the deadlock) Confidence-checked: both new tests FAIL with the gate + finalize re-trigger disabled (the all-fast case fires 2 continuations), and pass with them. Full client-tools suite 72/72; oxlint clean; all 91 projects typecheck.

#1663 tried to fix #1649 by making `_hasIncompleteToolBatch` read the streaming accumulator (`_partsAreMidBatch`). That approach never landed (it couldn't see a sibling the model hadn't streamed yet), and our fix uses the stream-active gate instead — so #1663's four accumulator-scan UNIT tests target an implementation we deliberately don't have and were not ported. Their behavioral intent is covered end-to-end by the existing stream-active-gate tests. Ported the two scenarios that carry real signal for our design: - Different ordering from the headline race: BOTH parallel tool calls streamed up front and visible in the accumulator, the fast one answered mid-stream, the slow sibling answered only AFTER stream end. Guards the gate across stream finalize (would fire prematurely and error the slow sibling without it). - Approval-path variant: a parallel batch whose slow sibling awaits an APPROVAL response rather than a result. Locks down that `_hasIncompleteToolBatch` treats `approval-requested` as pending so the barrier holds until approval, then fires exactly once (the approval analog of the existing parallel-results barrier test). client-tools suite 74/74; oxlint clean; all 91 projects typecheck.

threepointone · 2026-06-03T14:44:53Z

Ported coverage from the abandoned #1663 attempt

Reviewed #1663 (threepointone's earlier attempt at #1649). It tried to fix the bug by making _hasIncompleteToolBatch read the streaming accumulator (_partsAreMidBatch). That approach never landed — it can't see a sibling the model hasn't streamed yet, which is the actual race (per @abhagsain's debug logs on the issue). Its four accumulator-scan unit tests target that implementation, which this PR deliberately does not have (we gate on _streamingAssistant != null and never inspect sibling part-states), so they were not ported; their behavioral intent is already covered end-to-end by the stream-active-gate tests.

Notably, #1663 shipped a unit test, an integration test, and a full wrangler-dev e2e — all claimed to "fail without the fix" and reproduce the customer signature "deterministically across all 3 retries" — yet it didn't fix the customer's bug. The mock models put both tool calls into the accumulator before the fast result was answered, so the slow sibling was always visible. The lesson: a wrangler e2e is only as faithful as its mock model's emission ordering. Our headline test asserts streamingToolCallState("tc-slow") === undefined at the moment the fast tool is answered — the exact condition #1663's tests never created.

Ported the two scenarios that carry real signal for this design (commit 68f512b):

Different ordering — both parallel calls streamed up front and visible, fast answered mid-stream, slow answered only after stream end. Guards the gate across stream finalize.
Approval-path variant — a parallel batch whose slow sibling awaits an approval response; locks down that _hasIncompleteToolBatch treats approval-requested as pending so the barrier holds until approval, then fires once.

client-tools suite now 74/74; lint clean; 91/91 typecheck.

…ery (#1671) Follow-up to #1667. Two findings from review of the event-driven auto-continuation barrier. #1 (fix): The RPC streaming path (`_streamResultToRpcCallback`) re-armed the auto-continuation coalesce timer in its `finally` even on the stream-stall recovery early-returns (`scheduled`/`exhausted`). This is unlike the WebSocket `_streamResult` recovery paths, which deliberately do a plain `this._streamingAssistant = null` WITHOUT re-arming, because the scheduled recovery continuation re-runs the turn and its own stream finalize re-triggers the held barrier. When a parallel tool batch had a pending continuation at the moment the stall watchdog fired, the RPC re-arm scheduled a 50ms coalesce timer that could fire `_fireAutoContinuation` alongside the alarm-scheduled recovery continuation -> a spurious double model invocation on the turn queue. The RPC recovery early-returns now mirror the WebSocket path: a `skipFinalizeRearm` flag makes the `finally` do a plain clear instead of `_onStreamingTurnFinalized()`, so the held barrier is re-triggered exactly once by the recovery continuation. #2 (document): The eviction self-healing path relies on the completing result carrying `autoContinue: true` (it re-creates `_continuation.pending` from the persisted transcript via `_scheduleAutoContinuation`). If the completing result is an errored `autoContinue: false` sibling AFTER eviction, `_rearmPendingAutoContinuationForBatch` finds no pending and no-ops -- the continuation is silently dropped. This is NOT a regression (the old in-memory 60s timer was equally eviction-fragile) and a later user message / chat recovery repairs the transcript; fixing it properly would require persisting the continuation-requested intent across eviction. Pinned the current behavior with an explicit test rather than changing it. Tests: - think-session.test.ts: asserts the coalesce timer is NOT re-armed after an RPC stall routes into recovery (while `_streamingAssistant` is still cleared). Adds `testStallRecoveryDoesNotRearmPendingContinuation`. - client-tools.test.ts: documents the eviction + errored-completing- result gap (0 continuations, both results still applied to transcript). npm run check passes; new + existing stall/self-heal tests green.

devin-ai-integration Bot reviewed Jun 3, 2026

View reviewed changes

threepointone changed the title ~~fix(think): make parallel-tool auto-continuation barrier event-driven (#1650)~~ fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649) Jun 3, 2026

threepointone merged commit 919bfaa into main Jun 3, 2026
4 checks passed

threepointone deleted the think-event-driven-continuation-barrier branch June 3, 2026 17:41

github-actions Bot mentioned this pull request Jun 3, 2026

Version Packages #1669

Merged

This was referenced Jun 3, 2026

fix(agents): progress-keyed agent-tool re-attach so a deploy can't abandon a still-running child (#1630) #1670

Open

fix(think): don't re-arm auto-continuation barrier on RPC stall recovery #1671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649)#1667

fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649)#1667
threepointone merged 3 commits into
mainfrom
think-event-driven-continuation-barrier

threepointone commented Jun 3, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

threepointone commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Part 1 — Event-driven barrier (replaces the fixed timer)

Part 2 — Stream-active gate (fixes the #1649 headline race)

Double-fire safety

Tests

Scope / follow-ups

Changeset

Uh oh!

changeset-bot Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

pkg-pr-new Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

threepointone commented Jun 3, 2026

Ported coverage from the abandoned #1663 attempt

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

threepointone commented Jun 3, 2026 •

edited

Loading

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 3, 2026 •

edited

Loading