Fix agent-tool recovery startup wedges by threepointone · Pull Request #1604 · cloudflare/agents

threepointone · 2026-05-28T11:30:45Z

Summary

Move stale agent-tool run reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering.
Snapshot only rows that were stale before user onStart runs, then reconcile that snapshot in ctx.waitUntil with a single-flight guard.
Bound child inspection and recovery chunk replay; timed-out or uninspectable rows are terminal-finalized as interrupted so future wakes do not retry the same wedged recovery cascade.
Reuse the already-resolved child adapter for recovery chunk replay to avoid an extra child facet initialization during reconciliation.
Add AIChat and Think coverage for completed, running, stuck, scheduled, single-flight, and Think startup-ordering recovery scenarios.

Why

Issue #1595 reports parent Agent Durable Objects becoming permanently wedged when startup recovery encounters stale cf_agent_tool_runs rows. The old startup path awaited _reconcileAgentToolRuns() inside PartyServer's blockConcurrencyWhile startup gate. Reconciliation synchronously resolved child facets via _cf_resolveSubAgent() / _cf_initAsFacet(), which forces the child through its own startup and can recursively drive more recovery. If that tree exceeds the runtime startup budget, the parent resets before marking rows terminal, so the same stale rows wedge every subsequent wake.

This change keeps parent startup live by making startup recovery asynchronous and bounded. It preserves best-effort recovery of completed child results, but prioritizes liveness by interrupting rows that cannot be inspected promptly.

Behavior changes

onStart() may now observe stale agent-tool rows that recovery will finalize shortly afterward.
Recovered onAgentToolFinish() hooks now run after startup, not before user onStart() completes.
Clients may briefly replay a stale running tool before receiving the recovered terminal event.
Recovery is scoped to rows that were already stale before user startup began, so tool runs created during onStart() or immediately after startup are not accidentally interrupted by the startup recovery task.

Test plan

npx nx run agents:build
cd packages/ai-chat && npm run test:workers -- src/tests/agent-tools.test.ts
cd packages/think && npm run test:workers -- src/tests/agent-tools.test.ts
npm run check

Notes: npm run check passes. Sherif still reports the existing warnings for example workspace folders that do not have package.json files:

examples/agent-skills/package.json
examples/think-tanstack-start/package.json
examples/think-react-router/package.json

Coverage added

AIChat parent recovery finalizes stuck child facet startup via timeout.
AIChat scheduled recovery finalizes stale rows and remains single-flight.
Think parent runAgentTool(ThinkTestAgent, ...) happy path.
Think completed child recovery preserves completed terminal state.
Think still-running child recovery becomes interrupted.
Think stuck child facet startup times out and finalizes as interrupted.
Think scheduled recovery finalizes stale rows and remains single-flight.
Think startup-ordering test proves startup returns while stale rows are still running, then background recovery finalizes only the pre-startup snapshot.
Think startup snapshot test proves rows created during user onStart() are not interrupted by startup recovery.

Made with Cursor

Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever. This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>

changeset-bot · 2026-05-28T11:30:50Z

🦋 Changeset detected

Latest commit: 52fa2e8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
agents	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise. Co-authored-by: Cursor <cursoragent@cursor.com>

pkg-pr-new · 2026-05-28T11:59:40Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1604

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1604

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1604

hono-agents

npm i https://pkg.pr.new/hono-agents@1604

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1604

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1604

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1604

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1604

commit: 52fa2e8

* Fix agent-tool recovery startup wedges Move stale agent-tool reconciliation out of the Agent startup gate so parent Durable Objects can boot even when child facets are stuck or recursively recovering. Startup now snapshots only the rows that were already stale before user onStart runs, schedules bounded reconciliation in waitUntil, and terminal-finalizes uninspectable rows as interrupted instead of retrying the same child initialization cascade forever. This also bounds child inspection and recovery chunk replay, reuses the already-resolved child adapter during replay, and adds AIChat/Think coverage for completed, running, stuck, scheduled, single-flight, and startup-ordering recovery scenarios. Co-authored-by: Cursor <cursoragent@cursor.com> * Handle recovery chunk replay failures locally Normalize getAgentToolChunks failures inside the bounded recovery helper so stale-run reconciliation treats chunk replay as best-effort even when the child rejects after inspection. This keeps the timeout helper self-contained and avoids reviewer ambiguity around the raced promise. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/agents/src/index.ts Outdated

threepointone merged commit dfb3ecd into main May 28, 2026
4 checks passed

threepointone deleted the fix-agent-tool-startup-recovery branch May 28, 2026 12:00

github-actions Bot mentioned this pull request May 28, 2026

Version Packages #1597

Open

threepointone mentioned this pull request May 28, 2026

onStart recovery cascade (_reconcileAgentToolRuns → _cf_initAsFacet) exceeds blockConcurrencyWhile budget and permanently wedges parent DO #1595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix agent-tool recovery startup wedges#1604

Fix agent-tool recovery startup wedges#1604
threepointone merged 2 commits into
mainfrom
fix-agent-tool-startup-recovery

threepointone commented May 28, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

changeset-bot Bot commented May 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

pkg-pr-new Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented May 28, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Behavior changes

Test plan

Coverage added

Uh oh!

changeset-bot Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pkg-pr-new Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

threepointone commented May 28, 2026 •

edited by devin-ai-integration Bot

Loading

changeset-bot Bot commented May 28, 2026 •

edited

Loading