-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Description
When the OpenCode server restarts (or the process crashes) while a session is actively executing tool calls, the session gets permanently stuck in a "Thinking" state. The root cause is that there is no startup recovery that cleans up orphaned assistant messages and tool parts.
What happens
- A session is actively executing tool calls (e.g., bash commands)
- The server restarts or crashes
- The in-memory session state (
SessionStatus) is lost — the session is no longer "busy" - But the database state is stale: the last assistant message has
time.completed = undefined(never completed) and tool parts remain instatus: "running"forever - The UI sees the incomplete assistant message and shows a permanent "Thinking" spinner
- The session cannot recover — sending a new message creates a new loop iteration, but the old orphaned message still exists
Root cause analysis
The existing cleanup in processor.ts:402-417 correctly handles the normal case — when the stream ends (normally, via error, or abort), it force-sets any non-terminal tool parts to status: "error". However, this cleanup only runs if the process survives long enough to reach it.
There is zero recovery at startup:
Session.initialize()does not scan for orphaned messagesSessionStatus(in-memory map) is empty after restart — no stale detection- No background watchdog checks for sessions stuck in busy state
The only defense is in toModelMessages() (message-v2.ts:740-746), which converts pending/running tool parts into "[Tool execution was interrupted]" when building the next LLM prompt. This helps contextual recovery if the user sends a new message, but the UI still shows the session as stuck because the orphaned assistant message has no time.completed.
Observed in production
- Session
ses_2f4299f5cffeVZfxCt3ViZ7eVJstuck for 3+ hours with agit logtool part permanently in"running"status - Session
ses_2e9127723ffeKJ1JpjLNS35B4zsimilar pattern (though this one was actually still running a long k8s test — but demonstrates the same vulnerability)
Relation to existing issues
This is the backend root cause behind several reported symptoms:
- Web UI shows permanent Thinking spinner after stream interruption or server restart #17680 — Web UI permanent Thinking spinner after stream interruption
- QUEUED tag stuck on all user messages when an assistant message has null time.completed #16856 — TUI stuck QUEUED badges from orphan assistant messages
- Intermittent hang: session stays running forever until manual interrupt #14769 — Intermittent hang: session stays running forever
- Tasks/Subagents with Codex / OpenAI are frequently getting stuck with no timeout/retry, which then hangs the session forever #11865 — Subagents stuck with no timeout/retry
- Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841 — Explore subagent hangs indefinitely with no recovery
Open PRs #16907 and #17593 address frontend symptoms (making the UI more defensive about stale state), but neither fixes the backend root cause — orphaned messages and tool parts in the database.
Proposed fix
Startup recovery in Session or app bootstrap:
- On server start, query all messages where
time.completed IS NULLand the messagerole = "assistant" - For each orphaned message:
- Set
time.completed = Date.now() - Set all tool parts with
status = "running"orstatus = "pending"tostatus = "error"witherror = "Tool execution was interrupted (server restart)" - Emit Bus events so connected frontends update
- Set
This is a small, safe change — the cleanup logic already exists in processor.ts:402-417, it just needs to be callable from a recovery path at startup.
Steps to reproduce
- Start
opencode serve - Start a session that uses tool calls (e.g., ask it to run tests)
- Kill the server process while tools are executing (
kill -9) - Restart the server
- Open the session in the UI — it shows permanent "Thinking" spinner
- Session status API returns
{}(idle) but the UI is stuck
Environment
- opencode serve (long-running, multiple sessions)
- macOS / Linux
- Any provider (observed with gpt-5.3-codex via github-copilot)
OpenCode version
Latest dev branch (commit 814a515a8)