Fix agent-tool recovery startup wedges#1604
Merged
Merged
Conversation
Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever. This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 52fa2e8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise. Co-authored-by: Cursor <cursoragent@cursor.com>
agents
@cloudflare/ai-chat
@cloudflare/codemode
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
threepointone
added a commit
that referenced
this pull request
May 28, 2026
* Fix agent-tool recovery startup wedges Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever. This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios. Co-authored-by: Cursor <cursoragent@cursor.com> * Handle recovery chunk replay failures locally Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
onStartruns, then reconcile that snapshot inctx.waitUntilwith a single-flight guard.interruptedso future wakes do not retry the same wedged recovery cascade.Why
Issue #1595 reports parent Agent Durable Objects becoming permanently wedged when startup recovery encounters stale
cf_agent_tool_runsrows. The old startup path awaited_reconcileAgentToolRuns()inside PartyServer'sblockConcurrencyWhilestartup gate. Reconciliation synchronously resolved child facets via_cf_resolveSubAgent()/_cf_initAsFacet(), which forces the child through its own startup and can recursively drive more recovery. If that tree exceeds the runtime startup budget, the parent resets before marking rows terminal, so the same stale rows wedge every subsequent wake.This change keeps parent startup live by making startup recovery asynchronous and bounded. It preserves best-effort recovery of completed child results, but prioritizes liveness by interrupting rows that cannot be inspected promptly.
Behavior changes
onStart()may now observe stale agent-tool rows that recovery will finalize shortly afterward.onAgentToolFinish()hooks now run after startup, not before useronStart()completes.onStart()or immediately after startup are not accidentally interrupted by the startup recovery task.Test plan
npx nx run agents:buildcd packages/ai-chat && npm run test:workers -- src/tests/agent-tools.test.tscd packages/think && npm run test:workers -- src/tests/agent-tools.test.tsnpm run checkNotes:
npm run checkpasses. Sherif still reports the existing warnings for example workspace folders that do not havepackage.jsonfiles:examples/agent-skills/package.jsonexamples/think-tanstack-start/package.jsonexamples/think-react-router/package.jsonCoverage added
runAgentTool(ThinkTestAgent, ...)happy path.interrupted.interrupted.onStart()are not interrupted by startup recovery.Made with Cursor