feat(terminal): wire CDC session-state streaming through /mux WebSocket#105
feat(terminal): wire CDC session-state streaming through /mux WebSocket#105Pritom14 wants to merge 3 commits into
Conversation
Sessions changing state (spawn, PR update, kill) now push live sessionPatch frames to every subscribed dashboard client over the /mux WebSocket, closing the gap where the frontend MuxProvider received thin CDC notifications instead of the full SessionPatch shape it expects. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greptile SummaryThis PR closes the wire-shape mismatch between the Go
Confidence Score: 4/5Safe to merge after addressing the phantom-subscription path in handleSubscribe; the rest of the change is well-structured and thoroughly tested. When a Manager is constructed without WithSessionSource (the default), handleSubscribe still proceeds past its guard, creates a real CDC subscription, and latches the dedup flag — leaving any client that sends a sessions-subscribe frame permanently subscribed with no data and no ability to re-subscribe. All other paths (DeriveStatus refactor, signal coalescing, writeLoop enrichment) are correct and covered by both unit and functional tests under -race. backend/internal/terminal/manager.go (handleSubscribe guard), backend/internal/daemon/session_source.go (previously flagged silent error discard in toSession) Important Files Changed
Sequence DiagramsequenceDiagram
participant DB as DB (sessions/pr/pr_checks)
participant CL as change_log
participant CP as cdc.Poller
participant BC as cdc.Broadcaster
participant HS as handleSubscribe
participant SE as sessionEvents chan
participant WL as writeLoop
participant SS as SessionSource (daemonSessionSource)
participant WS as WebSocket (client)
Note over HS,SE: On subscribe (ch=subscribe, topics=[sessions])
HS->>BC: Subscribe(callback)
HS->>SE: pushSessionEvent() [initial snapshot signal]
WL->>SE: receive signal
WL->>SE: drainSignals() [coalesce burst]
WL->>SS: AllSessions(ctx)
SS-->>WL: []domain.Session
WL->>WS: WriteJSON(sessions/snapshot)
Note over DB,WS: On CDC event (DB write)
DB->>CL: INSERT trigger
CP->>CL: Poll()
CP->>BC: "Publish(Event{SessionID!=empty})"
BC->>HS: callback fires
HS->>SE: pushSessionEvent() [non-blocking]
WL->>SE: receive signal
WL->>SE: drainSignals()
WL->>SS: AllSessions(ctx)
SS-->>WL: []domain.Session
WL->>WS: WriteJSON(sessions/snapshot)
Reviews (3): Last reviewed commit: "fix(terminal): always send full session ..." | Re-trigger Greptile |
| pr, ok, _ := s.store.GetDisplayPRFactsForSession(ctx, rec.ID) | ||
| if ok { | ||
| return domain.Session{SessionRecord: rec, Status: domain.DeriveStatus(rec, &pr)} | ||
| } | ||
| return domain.Session{SessionRecord: rec, Status: domain.DeriveStatus(rec, nil)} |
There was a problem hiding this comment.
Silent DB error swallows PR facts
GetDisplayPRFactsForSession errors are silently discarded (blank _). When the DB returns a transient error (e.g., a context deadline or busy reader), ok is false and DeriveStatus receives nil for pr, so a session that actually has a PR gets derived as StatusWorking or StatusIdle instead of the correct PR-derived status such as StatusCIFailed or StatusMergeable. There is no log entry, so the wrong status is invisible to operators until the next CDC event corrects it. At minimum, the error should be logged so transient DB failures surface in traces.
| // the session derive to ci_failed and asserts the live patch carries the derived | ||
| // status and its mapped attention level — exercising the per-event enrichment path | ||
| // with a non-trivial status mapping. | ||
| func TestMuxStreamingPatchReflectsPRDerivedStatus(t *testing.T) { | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| t.Cleanup(cancel) | ||
|
|
There was a problem hiding this comment.
Test duplicates production
toSession logic without coverage link
testSessionSource.toSession is a verbatim copy of daemonSessionSource.toSession — including the same silent discard of GetDisplayPRFactsForSession errors. If the production implementation changes (e.g., adds logging or falls back differently on error), this test continues to pass while diverging from the actual code path being tested. Consider extracting the shared logic into a package-level helper or accepting the daemon package import via an integration-test build tag so the test exercises the real implementation.
| case ev := <-c.sessionEvents: | ||
| // Resolve state here, off the broadcaster's hot path. Reading live at | ||
| // write time (not enqueue time) means the last frame for any session | ||
| // always reflects current state, whatever the queue interleaving. | ||
| if c.mgr.sessionSrc == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Unreachable nil guard in
writeLoop
The if c.mgr.sessionSrc == nil { continue } branch at line 398 can never execute. Items are pushed to sessionEvents only from pushSessionEvent, which is called exclusively when sessionSrc != nil — both in the CDC callback (if c.mgr.sessionSrc == nil { return }) and in handleSubscribe (if c.mgr.sessionSrc != nil { ... }). The guard is dead code and mildly misleads readers into thinking the sessionSrc could become nil between push and drain.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| Status: string(s.Status), | ||
| Activity: string(s.Activity.State), | ||
| AttentionLevel: attentionLevel(s.Status, s.Activity.State), | ||
| LastActivityAt: s.Activity.LastActivityAt.UTC().Format(time.RFC3339), |
There was a problem hiding this comment.
Zero-time
LastActivityAt serializes to "0001-01-01T00:00:00Z"
For a freshly spawned session that has never received any activity, s.Activity.LastActivityAt is the zero time.Time{}, which Format(time.RFC3339) renders as "0001-01-01T00:00:00Z". The TS MuxProvider consumer likely does not expect this sentinel value and may render the field directly, exposing the year-1 timestamp in the dashboard. Consider omitting lastActivityAt (or serializing it as null/empty string) when the time is zero.
The frontend sends a subscribe frame as {ch:"subscribe", topics:[...]} with
no "type" field, but handleSubscribe required msg.Type == "subscribe" and
clientMsg had no Topics field. The real client's subscribe was silently
dropped, so no snapshot or CDC patches ever reached the dashboard.
Honor the topics array server-side: add Topics to clientMsg and subscribe to
the session feed only when topics contains "sessions". This matches the
contract the frontend already speaks and gates the per-event enrichment so
clients that did not opt into "sessions" do no session work on the hot path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up: subscribe handshake fixWhile verifying the end-to-end path against the frontend, I found that the What was wrongThe frontend subscribes by sending only a channel and a topics list, with no message type: On the server,
The functional tests passed only because they sent a synthetic The fixGate the subscription on the topics array the frontend already sends, not on the message type.
Why gate on topics rather than just sending a typeHonoring topics is strictly more capable than making the client send a Tests
Still open (separate decision needed)Single-session CDC patches are currently sent as |
| if ev.snapshot { | ||
| all, err := c.mgr.sessionSrc.AllSessions(ctx) | ||
| if err != nil { | ||
| c.mgr.log.Warn("terminal: failed to fetch session snapshot", "err", err) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Dropped initial snapshot with no retry leaves client without full session state
When AllSessions fails (e.g., a transient SQLite busy-reader or context deadline), the continue discards the event. The sessionEvent{snapshot: true} was enqueued exactly once in handleSubscribe and is now gone — there is no re-queue path. Subsequent CDC events only deliver per-session patches for sessions that change after connect; sessions that are idle never appear. A user who connects during a transient DB hiccup will see a permanently empty dashboard for all unchanged sessions, with no indication that state is missing.
…patches The frontend replaces its session list wholesale on every sessions/snapshot frame (setSessions(msg.sessions)). The writer was sending single-session updates as a one-element snapshot, which on the client dropped every other session from view on the first CDC event after subscribe. Collapse the writer to always emit a full snapshot of all sessions on any session-state signal. This keeps the existing frontend correct with no client-side change, matching the protocol's stated goal of letting the existing client connect unchanged. Signals are coalesced (drainSignals) so a burst of CDC events collapses into one AllSessions read, bounding the read amplification the full-snapshot approach introduces. Removes the now-dead single-session path: SessionSource.Session and its implementations on the daemon source and test fakes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up: full snapshots instead of single-session patchesThis closes the second issue noted above (single-session updates clobbering the client's session list), without any frontend change. What was wrongThe frontend treats a sessions/snapshot frame as the complete session list: The writer, though, sent single-session CDC updates as a one-element snapshot. On the client that meant the first CDC event after subscribe replaced the entire list with just one session, dropping every other session from view. The fixAlways emit a full snapshot of all sessions on any session-state signal, rather than a single-session patch. This keeps the existing frontend correct with zero client-side change, which matches the protocol's stated goal of letting the existing client connect unchanged. Why this over a "patch" message typeA CleanupRemoved the now-dead single-session path: TestsAll terminal and integration tests pass (race-clean), including the three |
| if c.mgr.events == nil || !slices.Contains(msg.Topics, topicSessions) { | ||
| return | ||
| } |
There was a problem hiding this comment.
When
events != nil but sessionSrc == nil (the default when WithSessionSource is not passed), handleSubscribe still proceeds past this guard: it creates a real CDC subscription (whose callback always returns early) and sets c.unsubEvts, permanently latching the dedup flag. The client receives no snapshot and no error, and any future re-subscribe is silently ignored because c.unsubEvts != nil short-circuits on line 328. Adding sessionSrc == nil to this guard makes the no-op path explicit and prevents the phantom subscription.
| if c.mgr.events == nil || !slices.Contains(msg.Topics, topicSessions) { | |
| return | |
| } | |
| if c.mgr.events == nil || c.mgr.sessionSrc == nil || !slices.Contains(msg.Topics, topicSessions) { | |
| return | |
| } |
neversettle17-101
left a comment
There was a problem hiding this comment.
Leaving the /mux-vs-dedicated-endpoint question aside (taking that separately) — two things about this change that I'd want addressed before merge. Both are about where the new DB work lands.
1. The session snapshot read runs on the terminal's single write goroutine
In writeLoop (backend/internal/terminal/manager.go), the new sessionEvents case calls sessionSrc.AllSessions(ctx) synchronously:
case <-c.sessionEvents:
drainSignals(c.sessionEvents)
all, err := c.mgr.sessionSrc.AllSessions(ctx) // synchronous DB read
...
c.conn.WriteJSON(ctx, serverMsg{Ch: chSessions, Type: msgSnapshot, Sessions: toSessionPatches(all)})writeLoop is the single writer for the whole connection, including PTY output drained from c.out. So this DB read now sits inline on the path that flushes terminal bytes — a slow or contended snapshot read stalls PTY output for that connection.
The PR description's goal was to move DB reads "off the broadcaster's hot path," and it does correctly get them off the poller loop. But it lands them on the write loop, which is the latency-sensitive path for the terminal. Suggest resolving the snapshot off writeLoop (e.g. a small dedicated goroutine that feeds resolved patches into c.out) so a slow read can't block PTY flushing.
2. The snapshot read is N+1 per session, and it fires per CDC event per connection
AllSessions (session_source.go) calls GetDisplayPRFactsForSession once per session, which is a single-session query (backend/internal/storage/sqlite/store/pr_facts.go:16). So each snapshot is O(sessions) queries. Combined with #1 and the full-snapshot-per-event model (every CDC event re-reads and re-ships all sessions to every connected client), this is the part that bites first as session count grows. The coalescing helps batch bursts but doesn't bound the per-snapshot cost. Worth either a batched PR-facts query, or at minimum a comment acknowledging the fan-out as a known scaling limit.
Everything else looks good — the DeriveStatus move into domain is a clean call, the subscribe-before-snapshot ordering is right, and the functional tests are thorough.
|
closing this |
Closes #104
What changed and why
Before this PR, the
/muxWebSocket session-state channel forwarded raw CDC notification fields (seq,projectId,sessionId,eventType) directly onto the wire. The TypeScriptMuxProviderconsumer expects aSessionPatchshape (id,status,activity,attentionLevel,lastActivityAt). The mismatch meant the dashboard never received live session-state updates -- it could only poll or wait for a page refresh.This PR closes that gap end-to-end.
Flow
On subscribe, a full snapshot of all current sessions is queued through the same ordered channel before any per-session patches, so the client always starts with complete state.
Changes
internal/terminal/patches.go(new)Defines
SessionSource(the terminal package's view of session state -- no service-layer import),attentionLevel()(server-side port ofgetDetailedAttentionLevelfrompackages/web/src/lib/types.ts), andtoSessionPatch/toSessionPatcheshelpers.internal/terminal/protocol.goReplaced the thin
sessionUpdatestruct withsessionPatchcarrying the full wire shape the TS consumer expects. UpdatedserverMsg.Sessionsaccordingly.internal/terminal/manager.goThree changes, each fixing a separate problem:
1. Off-hot-path enrichment. Added
sessionEvents chan sessionEventtoconnState. The CDC subscriber callback now pushes only aSessionID(non-blocking); all DB reads happen inwriteLoop, off the broadcaster's serializedPublishloop. This honors the documented contract: "fn is called synchronously from the poller loop, so it must not block."2. No-gap subscribe/snapshot ordering. Subscribe is registered before the initial snapshot is queued. Both the snapshot (full read) and per-session patches (single-session read) flow through the same ordered
sessionEventschannel and are resolved live at write time -- so the last frame the client sees for any session always reflects current state, regardless of interleaving.3. Connection context for DB reads. The writer uses the connection-scoped
ctx, so a disconnect cancels any in-flight query.internal/domain/status.goMoved
DeriveStatusandprPipelineStatusfrominternal/service/session/status.gointo the domain package. They use only domain types; keeping them in the service layer created coupling pressure.internal/service/session/status.gonow delegates to the domain function.internal/daemon/session_source.go(new)daemonSessionSourceimplementsterminal.SessionSourceover the sqlite store. Per session: loads the record, fetches PR facts, callsdomain.DeriveStatus. Wired intoterminal.NewManagerviaWithSessionSourceindaemon.go.internal/httpd/terminal_mux.goExported
TerminalMuxHandler(wasterminalMuxHandler) so the integration test can mount it onhttptest.Serverwithout importing the full httpd stack.Tests
Unit (
internal/terminal/patches_test.go) -- table-driven coverage of all 16attentionLevelbranches (every status constant + activity fallbacks + priority-ordering cases),toSessionPatchfield shape including UTC timestamp normalization, andtoSessionPatchesordering.Unit (
internal/terminal/manager_test.go,protocol_test.go) -- updated wire-shape golden test, subscribe/snapshot/CDC-forward tests with in-memory fake conn.Functional (
internal/integration/mux_streaming_functional_test.go) -- real sqlite store, real CDC poller, realterminal.Manager, real WebSocket viahttptest.Server+coder/websocket.Dial:TestMuxStreamingSessionPatchDelivery: subscribe (empty snapshot) -> spawn -> poll CDC -> exactidle/idle/workingpatch on the wireTestMuxStreamingPatchReflectsPRDerivedStatus: apply CI failure -> poll CDC ->ci_failed/reviewpatch on the wire (per-event enrichment path with non-trivial status mapping)TestMuxStreamingInitialSnapshotContainsExistingSessions: sessions in the store before connect appear in the first snapshotAll tests pass under
-race. No new failures introduced.