feat(session): add session state machine — Phase 1+2 (write + observability)#348
feat(session): add session state machine — Phase 1+2 (write + observability)#348dimakis wants to merge 2 commits into
Conversation
Write 7-state lifecycle (CREATED→STARTING→ACTIVE→DETACHED/SUSPENDED→ CLOSING→ENDED) to EventStore at every transition point. Phase 1 is write-only: existing behavior unchanged, invalid transitions logged as warnings. This lays the groundwork for Phase 2 where reads replace the dual source-of-truth (SessionRegistry + isActive flag). - Add SessionState type and VALID_TRANSITIONS to protocol package - Add setSessionState/getSessionState to EventStore with migration - Write state in chat.ts (start/detach/reattach/closeout) - Write state in query-loop.ts (ACTIVE on first event, ENDED in finally) - Write state in ws-handler-v2.ts and app.ts (SUSPENDED) - 13 new tests covering full lifecycle, edge cases, force flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (3 warning).
server/query-loop.ts
The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.
- 🟡 bugs (L411): For new (non-resume) sessions,
resolvedSessionIdis undefined at this point (set later at line 475 on the first assistant event), andregistry.get(clientId)?.sessionIdis also undefined (set at line 478 viasetSessionId). Sosidis undefined andsetSessionState('ACTIVE')is silently skipped. New sessions never record CREATED, STARTING, or ACTIVE — only ENDED in the finally block. The state machine only tracks lifecycle for resumed sessions.[fixable]
server/chat.ts
The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.
- 🟡 bugs (L688): For new sessions,
options.resumeis undefined andsession.sessionIdis not yet assigned (it's set later in query-loop.ts line 478 when the first assistant event arrives). SostateSessionIdis undefined/falsy and the CREATED transition is skipped. Same issue at line 925 for STARTING. The Phase 1 state machine captures lifecycle only for resumed sessions.[fixable] - 🔵 missing_tests: No integration tests verify that the server-side call sites (detachChat, reattachChat, closeoutSession, suspend endpoints) actually produce the expected state transitions in the EventStore. The unit tests cover the EventStore API directly, but don't exercise the wiring in chat.ts, query-loop.ts, app.ts, or ws-handler-v2.ts.
[fixable]
packages/protocol/src/event-store.ts
The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.
- 🟡 unsafe_assumptions (L551):
setSessionStateexecutes an UPDATE statement that silently does nothing if the session row doesn't exist in the database (no matchingsession_id). The method still logs 'session state transition' as if it succeeded. For new sessions whereupsertSessionhasn't been called yet, or if called with a stale/typo'd sessionId, the state write is silently lost. Consider checkingchangeson the run result.[fixable] - 🔵 style (L541): Invalid state transitions are logged at
infolevel. In production,infologs are high-volume and these warnings could be lost. Consider usingwarnlevel (would require expanding theEventStoreLoggerinterface) or at minimum adding a distinguishing prefix/field (e.g.,level: 'warn'in the meta object) so they're filterable.[fixable]
packages/protocol/__tests__/event-store.test.ts
The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.
- 🔵 missing_tests: No test for calling
setSessionStateon a session that doesn't exist in the database. The UPDATE is a silent no-op, but the method still logs as if the transition succeeded. A test would document this edge case and catch regressions if the behavior changes.[fixable]
| firstEventReceived = true; | ||
| clearTimeout(firstEventTimer); | ||
| // Session state machine: mark ACTIVE on first SDK event (resume path) | ||
| const sid = resolvedSessionId || registry.get(clientId)?.sessionId; |
There was a problem hiding this comment.
🟡 bugs: For new (non-resume) sessions, resolvedSessionId is undefined at this point (set later at line 475 on the first assistant event), and registry.get(clientId)?.sessionId is also undefined (set at line 478 via setSessionId). So sid is undefined and setSessionState('ACTIVE') is silently skipped. New sessions never record CREATED, STARTING, or ACTIVE — only ENDED in the finally block. The state machine only tracks lifecycle for resumed sessions. [fixable]
| _onSessionChange?.(clientId, 'start'); | ||
|
|
||
| // Session state machine: mark CREATED (Phase 1 — write only, no behavior change) | ||
| const stateSessionId = options.resume ?? session.sessionId; |
There was a problem hiding this comment.
🟡 bugs: For new sessions, options.resume is undefined and session.sessionId is not yet assigned (it's set later in query-loop.ts line 478 when the first assistant event arrives). So stateSessionId is undefined/falsy and the CREATED transition is skipped. Same issue at line 925 for STARTING. The Phase 1 state machine captures lifecycle only for resumed sessions. [fixable]
| } | ||
| } | ||
|
|
||
| this.stmts.setSessionState.run(newState, now, sessionId); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: setSessionState executes an UPDATE statement that silently does nothing if the session row doesn't exist in the database (no matching session_id). The method still logs 'session state transition' as if it succeeded. For new sessions where upsertSession hasn't been called yet, or if called with a stale/typo'd sessionId, the state write is silently lost. Consider checking changes on the run result. [fixable]
| if (fromState && !opts?.force) { | ||
| const allowed = VALID_TRANSITIONS[fromState]; | ||
| if (!allowed?.includes(newState)) { | ||
| this.log.info('invalid session state transition (warn-only)', { |
There was a problem hiding this comment.
🔵 style: Invalid state transitions are logged at info level. In production, info logs are high-volume and these warnings could be lost. Consider using warn level (would require expanding the EventStoreLogger interface) or at minimum adding a distinguishing prefix/field (e.g., level: 'warn' in the meta object) so they're filterable. [fixable]
Add detectStateMismatch() that checks consistency between in-memory SessionRegistry and durable EventStore session state on every send and interrupt. Logs mismatches as errors with full context (storeState, registryHas, details) for debugging. No behavior change — observability only, preparing for Phase 3 state-based routing. - detectStateMismatch() checks registry↔state, attach↔state alignment - Wired into handleSendV2 and handleInterruptV2 with span attributes - State added to all routing decision log lines (storeState field) - 14 new tests covering all mismatch scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 8 issue(s) (4 warning).
|
Summary
CREATED → STARTING → ACTIVE → DETACHED/SUSPENDED → CLOSING → ENDED) written to EventStore at every transition point. Existing behavior unchanged, invalid transitions logged as warnings.detectStateMismatch()that checks consistency between in-memory SessionRegistry and durable EventStore state on every send/interrupt. Logs mismatches as errors with full context. No behavior change.Together these lay the groundwork for Phase 3 where state reads replace the dual source-of-truth (
SessionRegistry+isActiveflag) that causes the mobile reattach hang.Changes
Phase 1
packages/protocol/src/types.ts—SessionStatetype, state fields onSessionMetapackages/protocol/src/event-store.ts—VALID_TRANSITIONS, migration,setSessionState()/getSessionState()with warn-only validationserver/chat.ts— State writes instartChat,detachChat,reattachChat,closeoutSessionserver/query-loop.ts—ACTIVEon first SDK event,ENDEDin finally blockserver/ws-handler-v2.ts+server/app.ts—SUSPENDEDon iOS backgroundPhase 2
server/ws-handler-v2.ts—detectStateMismatch(), wired intohandleSendV2+handleInterruptV2,storeStateadded to routing logsserver/__tests__/ws-handler-v2.test.ts— 14 new tests for mismatch detectionDesign doc
See PR #340 for the full design doc and review discussion.
Test plan
🤖 Generated with Claude Code