Skip to content

fix(acp): replace timing-based replay guard with DB-matching state machine#431

Merged
matt2e merged 8 commits intomainfrom
session-chat-out-of-order
Mar 20, 2026
Merged

fix(acp): replace timing-based replay guard with DB-matching state machine#431
matt2e merged 8 commits intomainfrom
session-chat-out-of-order

Conversation

@matt2e
Copy link
Contributor

@matt2e matt2e commented Mar 20, 2026

Summary

  • Replaces the AtomicBool replay guard with a phase-based state machine (Replaying → WaitingForPrompt → Live) that matches replayed notifications against persisted DB messages, preventing out-of-order or duplicate messages in session chat on resumption.
  • Adds Store::get_session_messages() trait method so the ACP driver can read previously persisted messages for replay matching.
  • Applies the same phase-based pattern to the simple driver for consistency, and moves strip_code_fences to acp-client crate for reuse.

Test plan

  • Resume an existing agent session and verify chat messages appear in correct order without duplicates
  • Start a fresh session and verify normal message flow is unaffected
  • Verify tool call results display correctly after session resumption

🤖 Generated with Claude Code

@matt2e matt2e requested review from baxen and wesbillman as code owners March 20, 2026 06:38
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca9c7b3404

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +729 to +732
async fn is_replay_idle(&self, timeout: Duration) -> bool {
let phase = self.phase.lock().await;
if let HandlerPhase::Replaying(buf) = &*phase {
buf.received_any && buf.last_notification_at.elapsed() >= timeout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat zero-notification resumes as completed replay

If the previous run never persisted any non-user output—for example, the user cancelled before the agent answered—ReplayBuffer::new considers replay complete immediately (target_index = None). run_acp_protocol still waits for replay_done/is_replay_idle, though, and this helper never returns true until at least one notification arrives. When load_session emits nothing, the loop at driver.rs:957-971 never exits, so resume_session leaves the session stuck in running and never sends the new prompt.

Useful? React with 👍 / 👎.

Comment on lines +983 to 987
handler.transition_to_live().await;

connection
.prompt(prompt_request)
.await

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep replay suppression active until the prompt is actually sent

On a resumed session, late replay AgentMessageChunks can still arrive after the role/id matcher decides replay is done. This code switches the handler to Live before awaiting connection.prompt(...), and Live forwards assistant chunks without any replay filter. In that window, a slow load_session replay is appended to the DB as if it belonged to the new turn, duplicating old assistant text in the transcript.

Useful? React with 👍 / 👎.

matt2e and others added 8 commits March 20, 2026 19:17
Add [msg-order] and [replay-guard] tagged debug logs across the full
message lifecycle to help diagnose the race condition where old messages
appear after the user's reply when resuming a session:

- ACP driver & simple driver: log replay guard transitions (set_live)
  and whether notifications are dropped or processed
- Session runner: log the persisted user message ID on resume
- Message writer: log DB row IDs for inserted assistant/tool messages
- Store: log get_session_messages_since results when >1 message returned
- Frontend: log poll results with message IDs and roles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log when tool_call and tool_result dedup fires (in-place update instead
of insert) and when finalize() resets writer state. These silent paths
were previously unlogged, making it hard to diagnose whether tool calls
break the message ordering/deduplication logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When resuming a session, load_session() returns before all replay
notifications have been delivered through the async transport. This
caused old tool calls and assistant text to slip past the replaying
guard, get written to the DB with new IDs, and appear after the
user's message in the chat.

Track tool_call_ids seen during the replay phase. After set_live(),
definitively drop any ToolCall/ToolCallUpdate referencing a replayed
ID, and use temporal proximity (500ms window) to also drop interleaved
old assistant text and AvailableCommandsUpdates that arrive in the
same burst as stale tool-call notifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… gate

The previous 500ms temporal window for filtering late-arriving replay
notifications was unreliable — especially over high-latency connections
to remote workstations where buffered notifications could arrive well
after the window expired.

Replace the timing-based approach with a deterministic `prompt_sent`
gate.  Between `set_live()` and `mark_prompt_sent()`, ALL notifications
are definitively stale because the server cannot produce new-turn output
before receiving the prompt.  This eliminates the STALE_PROXIMITY
constant and last_stale_drop tracking entirely.

Keep `replayed_tool_call_ids` as a safety net: tool-call IDs observed
during the replay and pre-prompt phases are used to catch any stale
ToolCall/ToolCallUpdate that arrives after the prompt is sent.

Applied the same pattern to the simple driver.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…chine

The previous approaches (temporal window, prompt_sent gate) relied on
timing assumptions that break on high-latency connections to remote
workstations.

Replace with a phase-based state machine that accumulates replay
notifications into complete messages, matches them against existing
DB messages, and signals completion deterministically:

- Replaying: accumulate text chunks, match tool calls/results against
  DB by role and content prefix.  Signal done when the last non-user
  DB message is matched, or after 1s of idle (fallback).
- WaitingForPrompt: replay done, drop stragglers, record tool-call IDs.
- Live: forward to writer with replayed_tool_call_ids safety net.

The prompt is only sent after replay completion is detected, eliminating
the race between stale replay notifications and new-turn output.

Also adds Store::get_session_messages() to the acp-client Store trait
(with a default empty-vec impl for NoOpStore) and implements it for the
Tauri Store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix finalize_current() return value being discarded in the Replaying
  phase, so replay completion is signalled via content matching instead
  of always falling through to the 1s idle timeout
- Remove all [msg-order] and [replay-guard] diagnostic logs added during
  debugging across writer, session runner, store, driver, simple driver,
  and frontend
- Extract duplicate strip_code_fences to acp-client and import in writer
- Guard transition_to_live() with is_resuming check in both drivers to
  avoid misleading transitions on non-resuming sessions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 10s absolute timeout to replay-wait loop to prevent indefinite
  hang when server sends zero replay notifications (e.g. remote session
  was garbage-collected but local DB still has messages)
- Remove unused _content parameter from ReplayBuffer::try_match — matching
  is role-based only, so callers no longer compute content for matching
  (strip_code_fences, sanitize_title)
- Log warning on get_session_messages failure instead of silently
  returning empty vec via unwrap_or_default
- Drop phase mutex before writer calls in Live phase to avoid holding
  it across await points, matching the pattern used by the simple driver

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@matt2e matt2e force-pushed the session-chat-out-of-order branch from 72a0511 to b371963 Compare March 20, 2026 08:18
@matt2e matt2e merged commit f1369af into main Mar 20, 2026
6 checks passed
@matt2e matt2e deleted the session-chat-out-of-order branch March 20, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant