feat(orchestrator): persist agent stream events in WebSocket handler by geoffjay · Pull Request #1172 · geoffjay/agentd

geoffjay · 2026-04-17T22:29:41Z

Wires AgentStorage into ConnectionRegistry so that every agent event flowing through the WebSocket handler is persisted to SQLite as a conversation_event row, without blocking the WebSocket read loop.

What changed

`websocket.rs`

Added storage: Option<AgentStorage> and session_numbers: Arc<RwLock<HashMap<Uuid, i64>>> to ConnectionRegistry
with_storage(storage) builder method to attach storage at startup
persist_event() — fire-and-forget helper; spawns a task, logs warn on error, never panics
get_session_number() — reads the in-memory per-agent session counter
persist_context_cleared() — updates the session counter, persists the event, and broadcasts agent:context_cleared on the stream
register() — initialises session counter to 0 then fires an async lookup to sync from storage
unregister() — cleans up the session counter entry
send_user_message() — persists prompt_sent + activity_changed(busy)
handle_incoming_message() — persists output (per line), tool_use, thinking, result with usage metadata, activity_changed(idle); also fixes the hardcoded session_number: 0 in the usage_update stream event

`manager.rs`

clear_context() calls registry.persist_context_cleared() after advancing the storage session

`main.rs`

Registry construction chains .with_storage((*storage).clone())

Event types persisted

Message type	Events written
`assistant`	`output` (per line), `tool_use`, `thinking`
`result`	`result` (with usage metadata), `activity_changed(idle)`
`send_user_message`	`prompt_sent`, `activity_changed(busy)`
`clear_context`	`context_cleared`
`system`, `control_request`, `keep_alive`	not persisted

Closes #1160

ShawnSunClio

Review — feat(orchestrator): persist agent stream events in WebSocket handler

Stack note: This PR bases on issue-1159 (PR #1167, still open) — sibling of #1168 and #1170. All three need #1167 to merge first. Diff is conflict-free relative to the parent.

Blocking issues (2)

1. Race condition at reconnect: events get session_number=0 during async lookup window

register() sets the in-memory session counter to 0 immediately, then spawns an async task to read the real value from storage:

// Sets to 0 right now
self.session_numbers.write().await.entry(agent_id).or_insert(0);

// Correct value arrives asynchronously — could be seconds later under load
tokio::spawn(async move {
    match storage.get_usage_stats(&agent_id).await {
        Ok(stats) => {
            session_numbers.write().await.insert(agent_id, stats.session_count as i64);
        }
        ...
    }
});

Any assistant, result, or user-message event that arrives before the async task completes will be tagged with session_number: 0, regardless of the agent's actual session history. For a reconnecting agent (restart, network blip, crash-recovery) that has already had two or three sessions, every event during this window lands in session 0 — permanently misattributed with no way to correct it after the fact.

The window is typically short, but during Sidekiq load or SQLite contention it could be seconds. Reconnection is the exact scenario where correct session attribution matters most (distinguishing the new run from prior ones).

Options:

Read the session number synchronously before completing registration (accept the small latency hit at connect time)
Defer event persistence until the session counter is confirmed (buffer events in a VecDeque, flush when the lookup resolves)
Query the max session_number from conversation_events directly in the async task (no round-trip through agent_sessions) as a faster path

2. No tests for any of the new persistence logic

The PR adds:

persist_event() — fire-and-forget spawn, error logging
get_session_number() — async RwLock read
persist_context_cleared() — counter write + persist + stream broadcast
register() session init + async lookup race
unregister() session cleanup
send_user_message() two new persist calls
handle_incoming_message() — six new persist_event call sites across three match arms

None of these paths have tests. The existing websocket test suite (mod tests at line 1227) covers connection lifecycle and message routing but has no coverage for the persistence layer.

At minimum the following should be tested before merge:

register() initialises session counter to 0 when storage is absent
unregister() removes the session counter entry
persist_context_cleared() updates the counter and calls persist_event
send_user_message() calls persist_event twice (prompt_sent + activity_changed)
An event persisted in handle_incoming_message for each of the three assistant/result arms

These are all testable with an in-memory SQLite storage instance (the same pattern already used in storage.rs tests via create_test_storage()).

Non-blocking observations

3. `ActivityChanged` events double the per-agent event volume

Every interaction cycle writes 2 ActivityChanged events (busy on send_user_message, idle on result) on top of the content events. For an active agent producing 20 assistant messages per hour, ActivityChanged pairs alone add ~40 rows/hour — roughly 40% of event volume for a typical run. Given the 50 k default cap from PR #1168, this reduces effective conversation history retention. Consider whether ActivityChanged events need to be persisted at all (vs. just broadcast on the stream), or document the trade-off explicitly.

4. Session semantics at initial connect rely on agent_sessions row count

register() initialises the counter to stats.session_count, assuming that the count of session rows equals the current session number. This holds as long as the first session is numbered 0 and clear_context() is the only thing that creates new session rows. If that invariant ever changes (e.g., seeding, import, a new workflow phase that rotates sessions), the session tagging silently diverges. A more robust approach would be to read MAX(session_number) directly from conversation_events for the agent, which is always self-consistent.

5. Minor: `(*storage).clone()` in `main.rs`

AgentStorage derives Clone and wraps a DatabaseConnection (SeaORM's Arc-backed pool). The clone shares the underlying connection pool, which is correct. The (*storage) deref is needed because storage is Arc<AgentStorage> — this is fine, just worth a comment that it's intentionally cloning the pool handle, not duplicating the DB connection.

geoffjay

Review: feat(orchestrator): persist agent stream events in WebSocket handler

Stack note: This PR is stacked on issue-1159 (PR #1167, needs-rework). It cannot merge until #1167 is fixed. The code review below is independent of that dependency.

🔴 Fix required — session number initialization is off-by-one

websocket.rs, register():

match storage.get_usage_stats(&agent_id).await {
    Ok(stats) => {
        session_numbers.write().await.insert(agent_id, stats.session_count as i64);
    }

AgentUsageStats::session_count is documented as "Total number of sessions (including the current one, if any)" — confirmed by the storage tests:

// After start_new_session:
assert_eq!(stats.session_count, 2);  // session 0 ended + session 1 active = 2

// After recording usage (no explicit end):
assert_eq!(stats.session_count, 1);  // session 0 active = 1

The invariant holds: session_count = session_number + 1 whenever a current usage session is active. Any agent that has processed at least one message will have an active session, so on reconnect:

Context clears	Correct `session_number`	`stats.session_count`	Used value
0 (but has usage)	0	1	1 ← wrong
1	1	2	2 ← wrong
N	N	N+1	N+1 ← wrong

The only case that accidentally works is a brand-new agent with no recorded usage at all (session_count = 0 = session_number).

Consequence: On reconnect, events in the current session get tagged as session_number = N+1 instead of N. The REST API (#1161) and UI (#1162) would show them as belonging to a non-existent session.

Fix — add a storage method that queries conversation_events directly:

// storage.rs — new method on AgentStorage
pub async fn get_max_conversation_session_number(&self, agent_id: Uuid) -> Result<i64> {
    use sea_orm::{ConnectionTrait, Statement, DbBackend};
    let row = self.db.query_one(Statement::from_sql_and_values(
        DbBackend::Sqlite,
        "SELECT COALESCE(MAX(session_number), 0) FROM conversation_events WHERE agent_id = ?",
        [agent_id.to_string().into()],
    )).await?;
    Ok(row.and_then(|r| r.try_get_by_index::<i64>(0).ok()).unwrap_or(0))
}

Then in register():

tokio::spawn(async move {
    match storage.get_max_conversation_session_number(agent_id).await {
        Ok(session) => {
            session_numbers.write().await.insert(agent_id, session);
        }
        Err(e) => {
            warn!(%agent_id, error = %e,
                "Failed to initialize session number from storage; defaulting to 0");
        }
    }
});

This is always correct because it reads directly from the same table where session numbers are written — no dependency on usage session semantics.

Alternatively (simpler, no new method), adjust the current approach:

// session_count includes the current active session, so subtract 1 when active
let session_number = if stats.current_session.is_some() {
    (stats.session_count as i64).saturating_sub(1)
} else {
    stats.session_count as i64
};
session_numbers.write().await.insert(agent_id, session_number);

The get_max_conversation_session_number approach is preferred — it doesn't rely on the relationship between usage sessions and conversation sessions staying in sync.

✅ What's correct — solid implementation throughout

Fire-and-forget pattern (persist_event): spawns a task, logs warn on error, no-op when storage is None — exactly right. ✅

Session counter lifecycle:

new(): initialised to empty HashMap ✅
register(): or_insert(0) provides an immediate safe default ✅
unregister(): remove(agent_id) prevents unbounded growth ✅
persist_context_cleared(): insert(agent_id, new_session_number) updates the counter before persisting the event ✅

handle_incoming_message() — session read strategy: The single let session = registry.get_session_number(agent_id).await at the top of the NDJSON loop means all events from one message batch share one session number and one lock acquisition. ✅

Event coverage vs acceptance criteria in #1160:

output (per non-empty line) ✅
tool_use (with full tool_use JSON as metadata) ✅
thinking ✅
result (with usage metadata including is_error) ✅
prompt_sent + activity_changed(busy) from send_user_message ✅
context_cleared via persist_context_cleared ✅
activity_changed(idle) from result arm ✅

Fixed session_number: 0 hardcode in the agent:usage_update stream event (flagged in pre-implementation analysis) ✅

main.rs: (*storage).clone() correctly dereferences the Arc<AgentStorage> before cloning the inner value (which is Clone via DatabaseConnection). ✅

manager.rs: persist_context_cleared called after storage.end_session() completes and before the API returns — no race possible since the user cannot send a message while clear_context is in-flight. ✅

New agent:context_cleared stream event in persist_context_cleared: additive change, existing UI clients ignore unknown event types safely. ✅

geoffjay · 2026-04-18T04:40:19Z

This change is part of the following stack:

feat(orchestrator): add conversation_events table, entity, and storage #1167
- feat(orchestrator): persist agent stream events in WebSocket handler #1172 ◀

_{Change managed by git-spice.}

Address two blocking issues from PR review: 1. Session number off-by-one on reconnect - replace get_usage_stats().session_count (which is N+1 when active) with a new get_max_conversation_session_number() storage method that queries MAX(session_number) directly from conversation_events. This is always correct regardless of usage-session semantics. 2. Add persistence tests - 9 new tests covering: session counter init without storage, session counter removed on unregister, storage-backed session init reads MAX(session_number), persist_context_cleared updates counter and writes event, send_user_message persists prompt_sent + activity_changed, handle_incoming_message persists output/tool_use/thinking/result events. 3. Fix race condition in register(): async storage init now uses max(storage, current) to avoid overwriting a newer value set by persist_context_cleared between register() and the spawned lookup completing. 4. Add #[allow(dead_code)] to storage query helpers and ConversationQuery struct that are used only in test code (clippy -D warnings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

geoffjay added the review-agent Used to invoke a review by an agent tracking this label label Apr 17, 2026

geoffjay mentioned this pull request Apr 17, 2026

Persist agent stream events in orchestrator WebSocket handler #1160

Closed

9 tasks

ShawnSunClio suggested changes Apr 17, 2026

View reviewed changes

geoffjay commented Apr 17, 2026

View reviewed changes

geoffjay added needs-rework PR has review feedback that must be addressed before merging and removed review-agent Used to invoke a review by an agent tracking this label labels Apr 17, 2026

This was linked to issues Apr 17, 2026

Persist agent stream events in orchestrator WebSocket handler #1160

Closed

Load persisted conversation history in UI agent detail view #1162

Closed

geoffjay removed a link to an issue Apr 17, 2026

Load persisted conversation history in UI agent detail view #1162

Closed

8 tasks

geoffjay force-pushed the issue-1160 branch from 8058526 to d4b0adc Compare April 18, 2026 04:40

geoffjay added the review-agent Used to invoke a review by an agent tracking this label label Apr 18, 2026

geoffjay and others added 2 commits April 17, 2026 21:54

feat(orchestrator): persist agent stream events in WebSocket handler

8971d17

geoffjay force-pushed the issue-1159 branch from 89d56de to 0e33f56 Compare April 18, 2026 05:15

geoffjay mentioned this pull request Apr 18, 2026

feat(orchestrator): add conversation_events table, entity, and storage #1167

Merged

4 tasks

geoffjay force-pushed the issue-1160 branch from d4b0adc to 9e64205 Compare April 18, 2026 05:25

geoffjay merged commit f18bde0 into issue-1159 Apr 18, 2026

geoffjay deleted the issue-1160 branch April 18, 2026 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): persist agent stream events in WebSocket handler#1172

feat(orchestrator): persist agent stream events in WebSocket handler#1172
geoffjay merged 2 commits intoissue-1159from
issue-1160

geoffjay commented Apr 17, 2026

Uh oh!

ShawnSunClio left a comment

Uh oh!

geoffjay left a comment

Uh oh!

geoffjay commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geoffjay commented Apr 17, 2026

What changed

websocket.rs

manager.rs

main.rs

Event types persisted

Uh oh!

ShawnSunClio left a comment

Choose a reason for hiding this comment

Review — feat(orchestrator): persist agent stream events in WebSocket handler

Blocking issues (2)

1. Race condition at reconnect: events get session_number=0 during async lookup window

2. No tests for any of the new persistence logic

Non-blocking observations

3. ActivityChanged events double the per-agent event volume

4. Session semantics at initial connect rely on agent_sessions row count

5. Minor: (*storage).clone() in main.rs

Uh oh!

geoffjay left a comment

Choose a reason for hiding this comment

Review: feat(orchestrator): persist agent stream events in WebSocket handler

🔴 Fix required — session number initialization is off-by-one

✅ What's correct — solid implementation throughout

Uh oh!

geoffjay commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`websocket.rs`

`manager.rs`

`main.rs`

3. `ActivityChanged` events double the per-agent event volume

5. Minor: `(*storage).clone()` in `main.rs`

geoffjay commented Apr 18, 2026 •

edited

Loading