fix(agent-tools): forward proxied sub-agent tool events stuck at input-available (#1589)#1827
Merged
Merged
Conversation
…t-available (#1589) `tailAgentToolRun` (in both `AIChatAgent` and `Think`) drained the stored chunk backlog and only afterwards attached its live forwarder, with `await` boundaries in between. Any chunk the child stored AND broadcast during that window was neither in the drained snapshot nor live-forwarded, so it silently vanished from the parent's stream — leaving tool parts (notably `tool-output-available`) stuck at `input-available` in `useAgentToolEvents`. A network-paced proxied remote stream (a sub-agent returning a remote `toUIMessageStreamResponse()`) hits this window constantly; a fast local child mostly avoids it. The forwarder is now registered BEFORE the backlog is drained, with live chunks buffered and replayed in order and deduped by sequence. Think also realigns its live sequence to the true high-water mark so a post-restart re-attach can't collide. Hardened forwarder teardown so an empty forwarder/ sequence map entry can't keep the broadcast idle-guard hot for the DO's lifetime, and a drain/read failure detaches before surfacing. Also documents why the per-run live sequence counter is deliberately separate from the resumable store's chunk_index: progress/milestone frames (`reportProgress`) are forwarded but NOT durably stored, so they depend on the live counter — sequencing forwards off the stored count would collide them and the tail's dedupe would silently drop them. Tests: attach-window forwarding (#1589) and non-stored progress/milestone forwarding through the replay->live handoff, for both ai-chat and think. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 9e1a835 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
…estart) AIChatAgent's `tailAgentToolRun` drained the stored backlog but never realigned the in-memory live sequence counter afterwards, unlike Think. On a re-attach after the child's Durable Object restarts or wakes from hibernation, `_agentToolLiveSequences` is cold (empty) while the durable backlog sits at N, and chat-recovery resumes the turn via `tailAgentToolRun` without re-running `startAgentToolRun` (which seeds the counter). The broadcast snoop then handed the recovered turn's new chunks sequences from 0 — all <= N — and `emit`'s high-water dedupe silently dropped every one, leaving the parent permanently stuck with no post-restart chunks. Realign the counter to the stored high-water mark after the drain (matching Think). On a warm attach the counter is already in lockstep, so this is a no-op; it only bites the cold post-restart path. Adds a deterministic regression test that seeds a running run with a backlog, wipes the live counter, re-attaches, and asserts a subsequent broadcast forwards at N+1 instead of being dropped. Caught by Devin review on #1827. Co-authored-by: Cursor <cursoragent@cursor.com>
Three defensive gaps in AIChatAgent's `tailAgentToolRun` (all present in the Think version) surfaced now that the forwarder is registered before the drain, which widened the window in which a consumer can detach: - Empty `cancel` handler left a zombie forwarder registered after a consumer cancelled its reader. `closed`/`forward`/cleanup are now hoisted out of `start` so `cancel` can reach them, and `cancel` marks the tail dead and detaches the forwarder. - Unguarded `controller.close()` in `close()` (and the `emit` catch calling `close()`) could throw on an already-cancelled stream; that throw propagated out of `interceptAgentToolBroadcast`'s forward loop and starved the run's sibling tailers of the chunk. `controller.close()` is now guarded and the `emit` catch just detaches instead of closing. - Unguarded `controller.error()` in the catch path could throw if the stream was already torn down; now wrapped. Extracted a shared `detach()` (mirrors Think) so the forwarder-set cleanup is no longer duplicated across close/error paths. Adds a regression test that tails a run from two consumers, cancels one, and asserts the sibling still receives a subsequent broadcast (verified it fails without the fix). Caught by Devin review on #1827. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Awesome @threepointone .. Thanks!! I can see 1 more issue that i've created here -- #1835 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1589. When an agent-tool child proxies a remote
toUIMessageStreamResponse(), the sub-agent's tool invocations got stuck atinput-availablein the parent UI (useAgentToolEvents) —tool-output-availableand other frames never arrived.Root cause is a race in
tailAgentToolRun(present in bothAIChatAgentandThink):getAgentToolChunks), thenwith
awaitboundaries in between. Any chunk the child stored AND broadcast during that drain↔register window was neither in the drained snapshot nor live-forwarded — it silently vanished from the parent's stream. A network-paced proxied remote stream lands chunks in this window constantly (a fast local child mostly skips through it), which is why this only reproduced for the remote-proxy setup in the issue.What changed
sequencededupes the replay→live handoff so each chunk is emitted exactly once. (packages/ai-chat/src/index.ts,packages/think/src/think.ts)packages/agents/src/chat/agent-tools.ts): the per-run live sequence counter is deliberately separate from the resumable store'schunk_index.reportProgress()progress/milestone frames are forwarded on the chat-response wire but are not durably stored, so they depend on the live counter. Sourcing forward sequencing from the stored count would collide these non-stored frames with the last stored chunk and the tail's dedupe would silently drop them. The comment exists to prevent a "simplification" that would reintroduce that regression.Tests
Added to both
@cloudflare/ai-chatand@cloudflare/thinkagent-tool suites:toUIMessageStreamResponse()#1589): deterministically injects a chunk into the drain↔register window and asserts it reaches the parent verbatim. Fails pre-fix (chunk dropped), passes post-fix.reportProgressprogress + milestone frames (no storedchunk_index) in the same window and asserts they survive the replay→live handoff — guarding the dual-counter invariant above.Test plan
ai-chatagent-tool suite (30 tests) passthinkagent-tool suite (33 tests) passMade with Cursor