Skip to content

Fix agent-tool recovery startup wedges#1604

Merged
threepointone merged 2 commits into
mainfrom
fix-agent-tool-startup-recovery
May 28, 2026
Merged

Fix agent-tool recovery startup wedges#1604
threepointone merged 2 commits into
mainfrom
fix-agent-tool-startup-recovery

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

@threepointone threepointone commented May 28, 2026

Summary

  • Move stale agent-tool run reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering.
  • Snapshot only rows that were stale before user onStart runs, then reconcile that snapshot in ctx.waitUntil with a single-flight guard.
  • Bound child inspection and recovery chunk replay; timed-out or uninspectable rows are terminal-finalized as interrupted so future wakes do not retry the same wedged recovery cascade.
  • Reuse the already-resolved child adapter for recovery chunk replay to avoid an extra child facet initialization during reconciliation.
  • Add AIChat and Think coverage for completed, running, stuck, scheduled, single-flight, and Think startup-ordering recovery scenarios.

Why

Issue #1595 reports parent Agent Durable Objects becoming permanently wedged when startup recovery encounters stale cf_agent_tool_runs rows. The old startup path awaited _reconcileAgentToolRuns() inside PartyServer's blockConcurrencyWhile startup gate. Reconciliation synchronously resolved child facets via _cf_resolveSubAgent() / _cf_initAsFacet(), which forces the child through its own startup and can recursively drive more recovery. If that tree exceeds the runtime startup budget, the parent resets before marking rows terminal, so the same stale rows wedge every subsequent wake.

This change keeps parent startup live by making startup recovery asynchronous and bounded. It preserves best-effort recovery of completed child results, but prioritizes liveness by interrupting rows that cannot be inspected promptly.

Behavior changes

  • onStart() may now observe stale agent-tool rows that recovery will finalize shortly afterward.
  • Recovered onAgentToolFinish() hooks now run after startup, not before user onStart() completes.
  • Clients may briefly replay a stale running tool before receiving the recovered terminal event.
  • Recovery is scoped to rows that were already stale before user startup began, so tool runs created during onStart() or immediately after startup are not accidentally interrupted by the startup recovery task.

Test plan

  • npx nx run agents:build
  • cd packages/ai-chat && npm run test:workers -- src/tests/agent-tools.test.ts
  • cd packages/think && npm run test:workers -- src/tests/agent-tools.test.ts
  • npm run check

Notes: npm run check passes. Sherif still reports the existing warnings for example workspace folders that do not have package.json files:

  • examples/agent-skills/package.json
  • examples/think-tanstack-start/package.json
  • examples/think-react-router/package.json

Coverage added

  • AIChat parent recovery finalizes stuck child facet startup via timeout.
  • AIChat scheduled recovery finalizes stale rows and remains single-flight.
  • Think parent runAgentTool(ThinkTestAgent, ...) happy path.
  • Think completed child recovery preserves completed terminal state.
  • Think still-running child recovery becomes interrupted.
  • Think stuck child facet startup times out and finalizes as interrupted.
  • Think scheduled recovery finalizes stale rows and remains single-flight.
  • Think startup-ordering test proves startup returns while stale rows are still running, then background recovery finalizes only the pre-startup snapshot.
  • Think startup snapshot test proves rows created during user onStart() are not interrupted by startup recovery.

Made with Cursor


Open in Devin Review

Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever.

This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios.

Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 28, 2026

🦋 Changeset detected

Latest commit: 52fa2e8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread packages/agents/src/index.ts Outdated
Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise.

Co-authored-by: Cursor <cursoragent@cursor.com>
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 28, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1604

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1604

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1604

hono-agents

npm i https://pkg.pr.new/hono-agents@1604

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1604

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1604

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1604

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1604

commit: 52fa2e8

@threepointone threepointone merged commit dfb3ecd into main May 28, 2026
4 checks passed
@threepointone threepointone deleted the fix-agent-tool-startup-recovery branch May 28, 2026 12:00
@github-actions github-actions Bot mentioned this pull request May 28, 2026
threepointone added a commit that referenced this pull request May 28, 2026
* Fix agent-tool recovery startup wedges

Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever.

This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Handle recovery chunk replay failures locally

Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant