fix: safe agent recovery on submit failure and accurate disconnect messaging#7972
fix: safe agent recovery on submit failure and accurate disconnect messaging#7972kyledef wants to merge 1 commit intoblock:mainfrom
Conversation
b7351e4 to
5faabd4
Compare
…ssaging - Keep sseMaxRetryAttempts at 0 for /reply — the endpoint is not idempotent and retrying can duplicate tool calls (shell commands, file edits, MCP calls). Added explicit comment at all 3 call sites. - On submit failure, attempt resumeAgent to restore the session (handles LRU eviction, server restart) so the user can manually retry. Does NOT auto-resend the message. - Update connection-lost toast to warn about possible partial execution and duplicate tool actions on resend. - Remove misleading 'connection restored' / 'reconnected' language. True resumable streaming requires protocol-level changes (idempotent turn submission, reply_id, event replay) tracked separately.
5faabd4 to
ae5bce0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae5bce0525
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await resumeAgent({ | ||
| body: { | ||
| session_id: sessionId, | ||
| load_model_and_extensions: true, | ||
| }, |
There was a problem hiding this comment.
Finish failed submit before attempting reconnect
When reply() fails (for example during the same network/backend outage this change is trying to recover from), this path now awaits resumeAgent() before calling onFinish. Because START_STREAMING has already been dispatched, a slow or hanging /agent/resume call keeps the UI stuck in streaming state (spinner/disabled input) instead of returning to idle with an error; previously the submit error finished immediately. Please call onFinish(...) first (or run reconnect in the background with a timeout), and apply the same fix to the mirrored logic in submitElicitationResponse.
Useful? React with 👍 / 👎.
|
Thanks for the contribution @kyledef! Closing this one for now — a couple of things worth flagging: Multiple PRs opened in quick succession: You've opened both #7972 and #7978 within a few hours of each other as a first-time contributor. Our CONTRIBUTING.md asks that you open one PR at a time and wait for it to land before opening more — this helps us give each change a proper review and keeps the feedback loop manageable. Please pick your preferred one (#7978 looks like the more complete approach) and let's work from there. Unaddressed codex comment: There's an unaddressed P1 comment from codex pointing out a real bug — if the server is down (exactly the failure case this PR targets), Feel free to reopen or incorporate the fix into #7978 — happy to keep the conversation going there! |
Summary
When a
/replyrequest fails (due to agent LRU eviction, server restart, or network loss), silently restore the agent so the session remains usable, and show accurate messaging about what happened.Problem
Multiple users have reported that Goose "just hangs" when sending a message after not interacting for a while, requiring a full restart to recover. The root causes:
resumeAgentwas only called once on mountWhy we do NOT retry
/replyThe
/replyendpoint is a one-shot turn submission, not a resumable stream. There is no protocol support for idempotent turn submission, reconnecting to an in-flight turn, or replaying buffered events. Retrying the SSE connection would re-submit the same turn and potentially duplicate side effects. True resumable streaming requires protocol-level changes (idempotentreply_id, event replay,Last-Event-IDsupport) tracked separately.Changes
// Do not retry /replycomments at all 3 SSE call sites explaining whysseMaxRetryAttemptsmust stay at0resumeAgentto restore the session (handles LRU eviction, server restart, extension reload) so the user can manually retry. Does NOT auto-resend the message.What this does NOT do
/replyrequests (unsafe without idempotency)Testing
sseMaxRetryAttemptscall sites documentedhandleSubmitandsubmitElicitationResponseerror handlersFiles Changed
ui/desktop/src/hooks/useChatStream.ts— agent recovery + messaging + commentsui/desktop/package.json— patch version bump to 1.27.1