Skip to content

fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649)#1667

Merged
threepointone merged 3 commits into
mainfrom
think-event-driven-continuation-barrier
Jun 3, 2026
Merged

fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649)#1667
threepointone merged 3 commits into
mainfrom
think-event-driven-continuation-barrier

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

@threepointone threepointone commented Jun 3, 2026

Summary

Reworks @cloudflare/think's parallel-tool auto-continuation barrier (added in #1649) into a purely event-driven mechanism, and adds a stream-active gate that fixes #1649's headline MissingToolResultsError race.

Fixes #1649. Follow-up tracked as #1650. Related to the think↔ai-chat barrier unification (#1642).

Background

When the model emits several tool calls in one step, the client answers each independently and sends a cf_agent_tool_result (with autoContinue) per result. A fast tool's result must not trigger inference while a slower sibling is still unanswered — that feeds the provider an incomplete tool-result set (MissingToolResultsError) or, via the transcript-repair backstop, errors the in-flight sibling and runs a spurious continuation.

#1649 introduced a barrier, but bounded the wait with a fixed 60s timeout that fired through on expiry. This PR replaces that, and then closes the deeper mid-stream race that the timeout-based fix never addressed.

Part 1 — Event-driven barrier (replaces the fixed timer)

A timer is the wrong primary mechanism for "wait until the client finishes answering the batch":

  • Human-in-the-loop fire-through. A no-execute tool (ask_user / display_ui) parked at input-available for as long as the user takes to answer was errored after 60s, mid-answer.
  • Orphan isolate hold. A disconnected client pinned the isolate alive for 60s.

Auto-continuation is only ever triggered by a tool-result/approval event, so the barrier now keys off events:

  • On the coalesce timer, drain in-flight applies (_drainInteractionApplies — awaits the apply tail and re-reads it so a sibling enqueued mid-drain is still awaited), re-check, and if a sibling is still unanswered return without firing and without holding the isolate, leaving pending in place.
  • The next sibling's result re-arms the check; the continuation fires exactly once when the final sibling lands.
  • Lost to eviction between siblings? The final result re-creates pending from the persisted transcript — self-healing across hibernation.
  • A true orphan never auto-continues and never pins the isolate.

Removes AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS (and its console.warn).

Errored-sibling completion: the client sends autoContinue: false for an errored result, so that event no longer schedules a continuation. When a sibling already opted in, _rearmPendingAutoContinuationForBatch re-arms the check so the batch still continues exactly once — without ever creating a continuation for a standalone errored tool.

Part 2 — Stream-active gate (fixes the #1649 headline race)

The event-driven change alone did not fix #1649's actual reproduction. Per @abhagsain's debug-log analysis on the issue: the model emits parallel tool calls sequentially within one step, so a fast client tool can resolve and round-trip its result while the model is still streaming the slower siblings. At that moment the siblings exist nowhere — not in this.messages, not in the in-flight _streamingAssistant accumulator — so no completeness check can see them. The barrier bypassed, the continuation was enqueued, and when it ran (after persist) it repaired the now-materialized-but-still-pending siblings to errored → death spiral.

T~250ms: deleteSlides (fast) result arrives → barrier checks
         accumulator = [reasoning(STREAMING), deleteSlides(output-available)]
         → modifySlide x2 NOT EMITTED YET → bypass → continuation enqueued
T~800ms: stream ends, persists all 3 parts (modifySlide x2 input-available)
         → continuation runs → repairs modifySlide x2 to errored

The only signal that "more tool calls may still arrive" is that the stream is open:

  • Stream-active gate: _fireAutoContinuationWhenStable returns early while _streamingAssistant != null (also re-checked in the drain .finally).
  • Stream-finalize re-trigger: _onStreamingTurnFinalized (at the two normal stream-finalize sites) clears the accumulator and re-runs the barrier. Essential for an all-fast batch whose every result landed mid-stream — there's no later tool-result event to re-arm it, so without this re-check the held continuation would deadlock. Abort/recovery paths keep a plain clear; recovery re-runs the turn and its own finalize re-triggers the held barrier.

Also corrected the prior NOTE that wrongly claimed turn-queue ordering made the mid-stream bypass safe — the logs disprove it.

Double-fire safety

_continuationBarrierActive absorbs a sibling that re-arms the timer mid-drain; the fire/return decision in the drain's .finally is synchronous, and _fireAutoContinuation cancels the pending timer.

Tests

packages/think/src/tests/client-tools.test.ts (real Workers runtime via vitest-pool-workers):

  • human-in-the-loop: parked no-execute tool — no fire-through, no repair; fires once on answer.
  • errored sibling completes the batch (autoContinue: false): continues once.
  • standalone errored tool: does not auto-continue.
  • self-heals across eviction mid-batch.
  • sibling emitted LATER in the same stream (the _scheduleAutoContinuation 50ms timer fires before all parallel client-tool results arrive → MissingToolResultsError #1649 headline repro, new createMidStreamParallelToolModel): fast tool resolved mid-stream, no premature continuation, no repair of the slow tool, fires once when it answers.
  • all-fast batch resolved entirely mid-stream: fires once via the stream-finalize re-check (guards the deadlock).

Confidence-checked: the two mid-stream tests fail with the gate + finalize re-trigger disabled (the all-fast case fires 2 continuations) and pass with them. Full client-tools suite 72/72; oxlint clean; all 91 projects typecheck.

Scope / follow-ups

Changeset

patch for @cloudflare/think.

…#1650)

The #1649 barrier made auto-continuation wait for all of a step's parallel
client-tool results before firing, but bounded the wait with a fixed 60s
timeout that fired through on expiry. That timeout was the wrong primary
mechanism:

- A human-in-the-loop tool with no `execute` (an `ask_user`/`display_ui`-style
  prompt) emitted in parallel with a fast tool legitimately parks at
  `input-available` for as long as the user takes to answer. The barrier would
  fire through after 60s and repair the still-open tool to errored mid-answer.
- A true orphan (client disconnects mid-batch) pinned the isolate alive via
  `keepAlive` for the full 60s before falling through to the repair backstop.

Auto-continuation is only ever triggered by a tool-result/approval event, so
the barrier is now purely event-driven:

- When the coalesce timer fires on an incomplete batch, Think drains the
  in-flight applies (`_drainInteractionApplies` — awaits the apply tail and
  re-reads it so a sibling enqueued mid-drain is still awaited), re-checks, and
  — if a sibling is still unanswered — returns WITHOUT firing and WITHOUT
  holding the isolate, leaving `_continuation.pending` in place.
- The next sibling's result re-arms the coalesce timer and re-runs the check;
  the continuation fires exactly once when the final sibling lands.
- If the in-memory pending state is lost to eviction between siblings, the
  final result re-creates it from the persisted transcript and fires with a
  complete batch — self-healing across hibernation.
- A true orphan never auto-continues and never pins the isolate, which is
  correct: there is nothing valid to continue, and a later user turn / chat
  recovery repairs the dangling tool.

This removes `AUTO_CONTINUATION_PENDING_TOOL_TIMEOUT_MS` (and its `console.warn`)
from the Think path entirely.

Errored-sibling completion: because the barrier now keys off events rather than
polling message state, a result that COMPLETES a parallel batch but carries
`autoContinue: false` (the client sends this for errored tool results) would no
longer re-trigger the check, stranding the continuation a successful sibling
already requested. `_rearmPendingAutoContinuationForBatch` re-arms the barrier
for such a result ONLY when a pending continuation already exists, so the batch
still continues exactly once — without ever creating a continuation for a
standalone errored tool (documented single-tool error behavior is preserved).

Double-fire safety is preserved via `_continuationBarrierActive` (a sibling
that re-arms the timer mid-drain is absorbed by the in-progress drain) and by
making the fire/return decision synchronous in the drain's `.finally`, so a
macrotask timer cannot interleave.

`@cloudflare/ai-chat` keeps the bounded-wait barrier for now (its barrier runs
inside the queued continuation turn and can't return-and-wait without occupying
the chat-turn queue); making it event-driven requires moving the batch gate
before queueing, tracked alongside the think<->ai-chat unification (#1642).

Tests (packages/think/src/tests/client-tools.test.ts):
- human-in-the-loop tool parked beside a fast tool: no fire-through, no repair,
  fires once on the human's answer
- errored sibling (autoContinue:false) completes the batch: continues once
- standalone errored tool (autoContinue:false, no opted-in sibling): no continue
- self-heals when the pending continuation is evicted mid-batch
All existing #1649 regression tests remain green (70/70).
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 3, 2026

🦋 Changeset detected

Latest commit: 68f512b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/think Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 3, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1667

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1667

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1667

hono-agents

npm i https://pkg.pr.new/hono-agents@1667

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1667

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1667

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1667

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1667

commit: 68f512b

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

…line mid-stream race

The event-driven barrier alone did not fix #1649's actual reproduction. Per
abhagsain's debug-log analysis on the issue: the model emits parallel tool
calls SEQUENTIALLY within one step, so a fast client tool can resolve and
round-trip its result to the server WHILE the model is still streaming the
slower siblings. At that moment the siblings exist nowhere — not in
`this.messages`, not in the in-flight `_streamingAssistant` accumulator — so no
batch-completeness check can see them. The barrier bypassed, the continuation
was enqueued, and when it ran (after the stream persisted all tool parts) it
repaired the now-materialized-but-still-pending siblings to errored →
MissingToolResultsError / death spiral.

The only signal that "more tool calls may still arrive" is that the stream is
open. So:

- Stream-active gate: `_fireAutoContinuationWhenStable` now returns early while
  `_streamingAssistant` is non-null (also re-checked in the drain `.finally`).
  Mid-stream the batch can still grow, so no completeness check is meaningful.
- Stream-finalize re-trigger: `_onStreamingTurnFinalized` (called at the two
  normal stream-finalize sites) clears the accumulator AND re-runs the barrier
  check. This is essential for an all-fast batch whose every result landed
  mid-stream — once the stream ends there is no further tool-result event to
  re-arm the barrier, so without this re-check the held continuation would
  never fire (deadlock). The abort/recovery paths keep a plain clear; recovery
  re-runs the turn and its own finalize re-triggers the held barrier.

Corrected the prior (incorrect) NOTE that claimed turn-queue ordering made the
mid-stream bypass safe — abhagsain's logs disprove it: the continuation runs
after persist but still errors the pending siblings.

Tests (packages/think/src/tests/client-tools.test.ts) with a new mock model
(`createMidStreamParallelToolModel`) that emits a fast tool, holds the stream
open, then emits a slow tool:
- waits for a sibling emitted LATER in the same stream before continuing
  (the #1649 headline repro): fast tool resolved mid-stream, no premature
  continuation, no repair of the slow tool, fires once when the slow tool answers
- all-fast batch resolved entirely mid-stream: fires exactly once via the
  stream-finalize re-check (guards the deadlock)

Confidence-checked: both new tests FAIL with the gate + finalize re-trigger
disabled (the all-fast case fires 2 continuations), and pass with them. Full
client-tools suite 72/72; oxlint clean; all 91 projects typecheck.
@threepointone threepointone changed the title fix(think): make parallel-tool auto-continuation barrier event-driven (#1650) fix(think): event-driven auto-continuation barrier + stream-active gate (fixes #1649) Jun 3, 2026
#1663 tried to fix #1649 by making `_hasIncompleteToolBatch` read the streaming
accumulator (`_partsAreMidBatch`). That approach never landed (it couldn't see a
sibling the model hadn't streamed yet), and our fix uses the stream-active gate
instead — so #1663's four accumulator-scan UNIT tests target an implementation
we deliberately don't have and were not ported. Their behavioral intent is
covered end-to-end by the existing stream-active-gate tests.

Ported the two scenarios that carry real signal for our design:

- Different ordering from the headline race: BOTH parallel tool calls streamed
  up front and visible in the accumulator, the fast one answered mid-stream, the
  slow sibling answered only AFTER stream end. Guards the gate across stream
  finalize (would fire prematurely and error the slow sibling without it).
- Approval-path variant: a parallel batch whose slow sibling awaits an APPROVAL
  response rather than a result. Locks down that `_hasIncompleteToolBatch`
  treats `approval-requested` as pending so the barrier holds until approval,
  then fires exactly once (the approval analog of the existing parallel-results
  barrier test).

client-tools suite 74/74; oxlint clean; all 91 projects typecheck.
@threepointone
Copy link
Copy Markdown
Contributor Author

Ported coverage from the abandoned #1663 attempt

Reviewed #1663 (threepointone's earlier attempt at #1649). It tried to fix the bug by making _hasIncompleteToolBatch read the streaming accumulator (_partsAreMidBatch). That approach never landed — it can't see a sibling the model hasn't streamed yet, which is the actual race (per @abhagsain's debug logs on the issue). Its four accumulator-scan unit tests target that implementation, which this PR deliberately does not have (we gate on _streamingAssistant != null and never inspect sibling part-states), so they were not ported; their behavioral intent is already covered end-to-end by the stream-active-gate tests.

Notably, #1663 shipped a unit test, an integration test, and a full wrangler-dev e2e — all claimed to "fail without the fix" and reproduce the customer signature "deterministically across all 3 retries" — yet it didn't fix the customer's bug. The mock models put both tool calls into the accumulator before the fast result was answered, so the slow sibling was always visible. The lesson: a wrangler e2e is only as faithful as its mock model's emission ordering. Our headline test asserts streamingToolCallState("tc-slow") === undefined at the moment the fast tool is answered — the exact condition #1663's tests never created.

Ported the two scenarios that carry real signal for this design (commit 68f512b):

  • Different ordering — both parallel calls streamed up front and visible, fast answered mid-stream, slow answered only after stream end. Guards the gate across stream finalize.
  • Approval-path variant — a parallel batch whose slow sibling awaits an approval response; locks down that _hasIncompleteToolBatch treats approval-requested as pending so the barrier holds until approval, then fires once.

client-tools suite now 74/74; lint clean; 91/91 typecheck.

@threepointone threepointone merged commit 919bfaa into main Jun 3, 2026
4 checks passed
@threepointone threepointone deleted the think-event-driven-continuation-barrier branch June 3, 2026 17:41
@github-actions github-actions Bot mentioned this pull request Jun 3, 2026
threepointone added a commit that referenced this pull request Jun 3, 2026
…ery (#1671)

Follow-up to #1667. Two findings from review of the event-driven
auto-continuation barrier.

#1 (fix): The RPC streaming path (`_streamResultToRpcCallback`) re-armed
the auto-continuation coalesce timer in its `finally` even on the
stream-stall recovery early-returns (`scheduled`/`exhausted`). This is
unlike the WebSocket `_streamResult` recovery paths, which deliberately
do a plain `this._streamingAssistant = null` WITHOUT re-arming, because
the scheduled recovery continuation re-runs the turn and its own stream
finalize re-triggers the held barrier.

When a parallel tool batch had a pending continuation at the moment the
stall watchdog fired, the RPC re-arm scheduled a 50ms coalesce timer
that could fire `_fireAutoContinuation` alongside the alarm-scheduled
recovery continuation -> a spurious double model invocation on the turn
queue. The RPC recovery early-returns now mirror the WebSocket path: a
`skipFinalizeRearm` flag makes the `finally` do a plain clear instead of
`_onStreamingTurnFinalized()`, so the held barrier is re-triggered
exactly once by the recovery continuation.

#2 (document): The eviction self-healing path relies on the completing
result carrying `autoContinue: true` (it re-creates `_continuation.pending`
from the persisted transcript via `_scheduleAutoContinuation`). If the
completing result is an errored `autoContinue: false` sibling AFTER
eviction, `_rearmPendingAutoContinuationForBatch` finds no pending and
no-ops -- the continuation is silently dropped. This is NOT a regression
(the old in-memory 60s timer was equally eviction-fragile) and a later
user message / chat recovery repairs the transcript; fixing it properly
would require persisting the continuation-requested intent across
eviction. Pinned the current behavior with an explicit test rather than
changing it.

Tests:
- think-session.test.ts: asserts the coalesce timer is NOT re-armed
  after an RPC stall routes into recovery (while `_streamingAssistant`
  is still cleared). Adds `testStallRecoveryDoesNotRearmPendingContinuation`.
- client-tools.test.ts: documents the eviction + errored-completing-
  result gap (0 continuations, both results still applied to transcript).

npm run check passes; new + existing stall/self-heal tests green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_scheduleAutoContinuation 50ms timer fires before all parallel client-tool results arrive → MissingToolResultsError

1 participant