Skip to content

fix(chat): orphaned-stream recovery no longer merges a new turn into the previous message (#1691)#1693

Merged
threepointone merged 2 commits into
mainfrom
fix-orphan-stream-message-merge-1691
Jun 6, 2026
Merged

fix(chat): orphaned-stream recovery no longer merges a new turn into the previous message (#1691)#1693
threepointone merged 2 commits into
mainfrom
fix-orphan-stream-message-merge-1691

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

@threepointone threepointone commented Jun 6, 2026

Summary

Fixes #1691. When an AIChatAgent stream is interrupted before its assistant message is persisted (Durable Object hibernation, deploy churn, isolate restart, reconnect), orphan recovery reconstructs the message from stored chunks. If those chunks carry no provider start.messageId — the common case with streamText(...).toUIMessageStreamResponse(), where the id is assigned client-side — recovery used to fall back to the last assistant message in history.

That is correct for a continuation, but wrong for a normal new turn after a later user message: the recovered chunks were appended onto the previous assistant message, corrupting both the persisted transcript and future model context.

The fix

Core

  • ResumableStream now persists the allocated assistant message id in stream metadata (message_id column, added via a one-time, schema-checked migration) and exposes getStreamMessageId().
  • _persistOrphanedStream keys recovery on that stored id when the chunks carry no provider start.messageId, so a new turn becomes its own message and a continuation still merges into the message it was extending (it stored the cloned last-assistant id). A provider start.messageId still wins when present. Pre-migration rows keep the legacy last-assistant fallback.
  • Dropped the now-unused is_continuation metadata column.

Two related variants of the same corruption (found during review, fixed here)

  • Duplicate tool parts: early-persist + recovery (e.g. a tool-approval pause) re-appended chunks it had already stored, duplicating a tool call's parts. Recovery now skips reconstructed parts whose toolCallId already exists on the message.
  • Lost-partial new turns: a new turn interrupted before any assistant part was persisted — cut off before the first chunk materialized, or discarded via onChatRecovery returning { persist: false } — was "continued" by cloning the previous assistant message and merging into it. _handleInternalFiberRecovery now detects that the conversation leaf is still the unanswered user message (no partial to continue) and re-runs the turn fresh, so it becomes its own message.

@cloudflare/think is unaffected — its session-tree recovery already allocates a distinct message id per orphan and never falls back to the last assistant message.

Tests

  • New regression + wiring tests in durable-chat-recovery, resumable-streaming, and the test worker, including the fiber-continuation happy path and the two edge cases (empty partial, persist: false) that previously merged.
  • Full recovery suites + pnpm run check (93 projects) green.

Verification (real LLMs)

A SIGKILL-mid-stream / restart harness (wip/issue-1691-live/, included and documented) drives the exact #1691 sequence against real models:

  • Isolation — across Workers AI, OpenAI, and Anthropic (and Think), the recovered turn always lands as its own message and the previous turn is untouched.
  • Continuation quality (large partials) — all three providers continue cleanly (0 resets, 0 duplicates); OpenAI and Anthropic resume a truncated partial to completion. Methodology note: the continuation runs in a scheduled alarm ~10–13s after recovery, so the harness waits for the message to stabilize before measuring.

Changeset

@cloudflare/ai-chat + agents — patch.


Open in Devin Review

…the previous message (#1691)

When an AIChatAgent stream is interrupted before its assistant message is
persisted (Durable Object hibernation, deploy churn, isolate restart,
reconnect), orphan recovery reconstructs the message from stored chunks. If the
chunks carry no provider `start.messageId` — the common case with
`streamText(...).toUIMessageStreamResponse()`, where the id is assigned
client-side — recovery used to fall back to the LAST assistant message in
history. That is correct for a continuation, but wrong for a normal new turn
after a later user message: the recovered chunks were appended onto the
PREVIOUS assistant message, corrupting both the persisted transcript and future
model context.

Core fix
- ResumableStream now persists the allocated assistant message id in stream
  metadata (`message_id` column, added via a one-time, schema-checked
  migration) and exposes `getStreamMessageId()`.
- `_persistOrphanedStream` keys recovery on that stored id when the chunks carry
  no provider `start.messageId`, so a new turn becomes its own message and a
  continuation still merges into the message it was extending (it stored the
  cloned last-assistant id). A provider `start.messageId` still wins when
  present. Pre-migration rows keep the legacy last-assistant fallback.
- Dropped the now-unused `is_continuation` metadata column.

Two related variants of the same corruption on the durable (chatRecovery)
continuation path, found during review and fixed here:
- Early-persist + recovery (e.g. a tool-approval pause) re-appended chunks it
  had already stored, duplicating a tool call's parts. Recovery now skips
  reconstructed parts whose `toolCallId` already exists on the message.
- A new turn interrupted before any assistant part was persisted — cut off
  before the first chunk materialized, or discarded via
  `onChatRecovery` returning `{ persist: false }` — was "continued" by cloning
  the previous assistant message and merging into it. `_handleInternalFiberRecovery`
  now detects that the conversation leaf is still the unanswered user message
  (no partial to continue) and re-runs the turn fresh, so it becomes its own
  message.

@cloudflare/think is unaffected — its session-tree recovery already allocates a
distinct message id per orphan and never falls back to the last assistant
message.

Tests
- New regression + wiring tests in durable-chat-recovery, resumable-streaming,
  and the test worker, including the fiber-continuation happy path and the two
  edge cases (empty partial, persist:false) that previously merged.

Verification
- Verified live against real LLMs (Workers AI, OpenAI, Anthropic) and Think via
  a SIGKILL-mid-stream / restart harness (wip/issue-1691-live): the recovered
  turn always lands as its own message and the previous turn is untouched.
- Cross-model continuation with large partials is clean (no duplication, no
  restarts); OpenAI and Anthropic resume a truncated partial to completion. The
  harness and its methodology notes are documented in its README.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 6, 2026

🦋 Changeset detected

Latest commit: 63f8da4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@cloudflare/ai-chat Patch
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment thread packages/ai-chat/src/tests/worker.ts
Comment thread packages/ai-chat/src/index.ts
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 6, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1693

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1693

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1693

hono-agents

npm i https://pkg.pr.new/hono-agents@1693

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1693

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1693

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1693

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1693

commit: 63f8da4

- Report `recoveryKind: "retry"` to `onChatRecovery` and the incident record
  for an empty-partial new turn (interrupted before any chunk), since that case
  is deterministically a retry — it's knowable before the hook runs. The
  `persist: false` sibling case still reports "continue" (it only becomes a
  retry based on the hook's own return value) and the comment documents why.
- Await `_persistOrphanedStream` in the `triggerInterruptedStreamCheck` test
  helper so it matches the production fiber-recovery path (latent test-only
  race, harmless in practice but now correct).
- Rename the two `wip/` package.json names to the `@cloudflare/agents-*`
  prefix so changesets' ignore glob excludes them from versioning/release.
@threepointone threepointone merged commit 6496c80 into main Jun 6, 2026
4 checks passed
@threepointone threepointone deleted the fix-orphan-stream-message-merge-1691 branch June 6, 2026 23:59
@github-actions github-actions Bot mentioned this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AIChatAgent orphaned stream recovery can merge a new assistant response into the previous assistant message

1 participant