fix(acp): replace timing-based replay guard with DB-matching state machine#431
fix(acp): replace timing-based replay guard with DB-matching state machine#431
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca9c7b3404
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async fn is_replay_idle(&self, timeout: Duration) -> bool { | ||
| let phase = self.phase.lock().await; | ||
| if let HandlerPhase::Replaying(buf) = &*phase { | ||
| buf.received_any && buf.last_notification_at.elapsed() >= timeout |
There was a problem hiding this comment.
Treat zero-notification resumes as completed replay
If the previous run never persisted any non-user output—for example, the user cancelled before the agent answered—ReplayBuffer::new considers replay complete immediately (target_index = None). run_acp_protocol still waits for replay_done/is_replay_idle, though, and this helper never returns true until at least one notification arrives. When load_session emits nothing, the loop at driver.rs:957-971 never exits, so resume_session leaves the session stuck in running and never sends the new prompt.
Useful? React with 👍 / 👎.
| handler.transition_to_live().await; | ||
|
|
||
| connection | ||
| .prompt(prompt_request) | ||
| .await |
There was a problem hiding this comment.
Keep replay suppression active until the prompt is actually sent
On a resumed session, late replay AgentMessageChunks can still arrive after the role/id matcher decides replay is done. This code switches the handler to Live before awaiting connection.prompt(...), and Live forwards assistant chunks without any replay filter. In that window, a slow load_session replay is appended to the DB as if it belonged to the new turn, duplicating old assistant text in the transcript.
Useful? React with 👍 / 👎.
Add [msg-order] and [replay-guard] tagged debug logs across the full message lifecycle to help diagnose the race condition where old messages appear after the user's reply when resuming a session: - ACP driver & simple driver: log replay guard transitions (set_live) and whether notifications are dropped or processed - Session runner: log the persisted user message ID on resume - Message writer: log DB row IDs for inserted assistant/tool messages - Store: log get_session_messages_since results when >1 message returned - Frontend: log poll results with message IDs and roles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log when tool_call and tool_result dedup fires (in-place update instead of insert) and when finalize() resets writer state. These silent paths were previously unlogged, making it hard to diagnose whether tool calls break the message ordering/deduplication logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When resuming a session, load_session() returns before all replay notifications have been delivered through the async transport. This caused old tool calls and assistant text to slip past the replaying guard, get written to the DB with new IDs, and appear after the user's message in the chat. Track tool_call_ids seen during the replay phase. After set_live(), definitively drop any ToolCall/ToolCallUpdate referencing a replayed ID, and use temporal proximity (500ms window) to also drop interleaved old assistant text and AvailableCommandsUpdates that arrive in the same burst as stale tool-call notifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… gate The previous 500ms temporal window for filtering late-arriving replay notifications was unreliable — especially over high-latency connections to remote workstations where buffered notifications could arrive well after the window expired. Replace the timing-based approach with a deterministic `prompt_sent` gate. Between `set_live()` and `mark_prompt_sent()`, ALL notifications are definitively stale because the server cannot produce new-turn output before receiving the prompt. This eliminates the STALE_PROXIMITY constant and last_stale_drop tracking entirely. Keep `replayed_tool_call_ids` as a safety net: tool-call IDs observed during the replay and pre-prompt phases are used to catch any stale ToolCall/ToolCallUpdate that arrives after the prompt is sent. Applied the same pattern to the simple driver. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…chine The previous approaches (temporal window, prompt_sent gate) relied on timing assumptions that break on high-latency connections to remote workstations. Replace with a phase-based state machine that accumulates replay notifications into complete messages, matches them against existing DB messages, and signals completion deterministically: - Replaying: accumulate text chunks, match tool calls/results against DB by role and content prefix. Signal done when the last non-user DB message is matched, or after 1s of idle (fallback). - WaitingForPrompt: replay done, drop stragglers, record tool-call IDs. - Live: forward to writer with replayed_tool_call_ids safety net. The prompt is only sent after replay completion is detected, eliminating the race between stale replay notifications and new-turn output. Also adds Store::get_session_messages() to the acp-client Store trait (with a default empty-vec impl for NoOpStore) and implements it for the Tauri Store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix finalize_current() return value being discarded in the Replaying phase, so replay completion is signalled via content matching instead of always falling through to the 1s idle timeout - Remove all [msg-order] and [replay-guard] diagnostic logs added during debugging across writer, session runner, store, driver, simple driver, and frontend - Extract duplicate strip_code_fences to acp-client and import in writer - Guard transition_to_live() with is_resuming check in both drivers to avoid misleading transitions on non-resuming sessions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 10s absolute timeout to replay-wait loop to prevent indefinite hang when server sends zero replay notifications (e.g. remote session was garbage-collected but local DB still has messages) - Remove unused _content parameter from ReplayBuffer::try_match — matching is role-based only, so callers no longer compute content for matching (strip_code_fences, sanitize_title) - Log warning on get_session_messages failure instead of silently returning empty vec via unwrap_or_default - Drop phase mutex before writer calls in Live phase to avoid holding it across await points, matching the pattern used by the simple driver Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
72a0511 to
b371963
Compare
Summary
AtomicBoolreplay guard with a phase-based state machine (Replaying → WaitingForPrompt → Live) that matches replayed notifications against persisted DB messages, preventing out-of-order or duplicate messages in session chat on resumption.Store::get_session_messages()trait method so the ACP driver can read previously persisted messages for replay matching.strip_code_fencestoacp-clientcrate for reuse.Test plan
🤖 Generated with Claude Code