perf: caching, batched DM resolution, bounded audit, global kind index by tlongwell-block · Pull Request #367 · block/sprout

tlongwell-block · 2026-04-20T15:12:14Z

Summary

Performance improvements from staging profiling (2026-04-17). All findings from the profiling plan are addressed — Phases 1 through 3.

Projected impact:

GET /api/channels: 329ms → ~35ms (batch DM resolution)
Per-request DB queries: ~43 → ~5
channels seq scans: >50% reduction
Global fan-out: O(all_subs) → O(matching_kind_subs)
DB connections held: 57/100 → ~22/100 (room for 4 relay pods)

Changes

Caching (#1, #2)

Wire membership_cache (moka, 10s TTL, 10k cap) into all 10 is_member() call sites with cache-aside pattern.
Add accessible_channels_cache for get_accessible_channel_ids() at 3 call sites (REQ handler, /api/feed, /api/search).
Invalidate on all mutation paths: add_member, remove_member, channel create/delete, DM create/expand, compensation delete, audio auto-join.
Multi-pod: other pods rely on TTL expiry (documented in code).

Batch DM resolution (#3)

Add get_members_bulk(channel_ids) using WHERE channel_id = ANY($1).
Rewrite /api/channels to resolve all DM participants in 2 queries (one get_members_bulk + one get_users_bulk) instead of 2×N_DMs.
Remove per-DM resolve_dm_participants() function.

Bounded audit (#7)

Replace unbounded tokio::spawn(audit.log()) with bounded mpsc channel (capacity 1000) + single worker task.
Uses .send().await for backpressure — audit entries must not be silently dropped (SOX-grade tamper-evident chain).
Migrate media upload audit from unbounded spawn to audit_tx too.
Add sprout_audit_log_errors_total counter for DB write failures.

Graceful audit drain on shutdown

AppState::new() returns (Self, AuditShutdownHandle) so the caller can drain the audit queue during graceful shutdown.
AuditShutdownHandle owns a CancellationToken + JoinHandle. On drain: cancel fires → worker calls audit_rx.close() (atomically rejects future sends) → drains buffered entries via recv().await → exits.
Independent of Arc<AppState> lifetime — works even when background tasks (reaper, pubsub, health server) still hold state clones.
5-second timeout prevents hanging on a stuck audit DB.
Extracted log_audit_entry() helper shared by normal loop and drain loop.

Global kind index (#6)

Add global_kind_index and global_wildcard_index to SubscriptionRegistry for sub-linear fan-out on global events.
Preserves channel/global scoping invariant (no behavior change).
4 new tests: kind routing, wildcard routing, removal cleanup, channel/global isolation.

Pool sizing (#8)

Main pool: max 50→20, min 5→2. Audit pool: max=5, min=1.
Evidence: staging measured 51 idle + 1 active out of 50 — most connections sat unused.

Observability (#9)

Wire sprout_fanout_recipients histogram at all 4 fan_out() sites.

Testing

209 unit tests pass (205 existing + 4 new global index tests)
cargo clippy clean, cargo fmt clean
Crossfire reviewed by codex CLI and opus subagent. Audit drain went through 3 rounds of codex review (4/10 → 6/10 → 9/10) fixing Arc lifetime and late-send race issues.

What's NOT in this PR

PG config tuning (docs(readme): clarify desktop setup #4) — infrastructure, applied at runtime
Redis config hardening (Initial backend revisions, workflow expansion #5) — infrastructure, applied at runtime
Multi-pod cache invalidation — future work when CNPG/PgBouncer is deployed

Performance improvements from staging profiling (2026-04-17). Projects GET /api/channels from 329ms to ~35ms, reduces per-request DB queries from ~43 to ~5, and cuts channels seq scans >50%. Caching (#1, #2) - Wire membership_cache (moka, 10s TTL, 10k cap) into all 10 is_member() call sites with cache-aside pattern. - Add accessible_channels_cache for get_accessible_channel_ids() at 3 call sites (REQ handler, /api/feed, /api/search). - Invalidate on all mutation paths: add_member, remove_member, channel create/delete, DM create/expand, compensation delete, audio auto-join. Multi-pod relies on TTL expiry (documented). Batch DM resolution (#3) - Add get_members_bulk(channel_ids) using WHERE channel_id = ANY($1). - Rewrite /api/channels to resolve all DM participants in 2 queries (one get_members_bulk + one get_users_bulk) instead of 2xN_DMs. - Remove per-DM resolve_dm_participants() function. Bounded audit (#7) - Replace unbounded tokio::spawn(audit.log()) with bounded mpsc channel (capacity 1000) + single worker task in AppState. - Uses .send().await for backpressure — audit entries must not be silently dropped (SOX-grade tamper-evident chain). - Migrate media upload audit from unbounded spawn to audit_tx. - Add sprout_audit_log_errors_total counter for DB write failures. Global kind index (#6) - Add global_kind_index and global_wildcard_index to SubscriptionRegistry for sub-linear fan-out on global events. - Fan-out goes from O(all_subs) to O(matching_kind_subs). - Preserves channel/global scoping invariant (no behavior change). - Add 4 tests: kind routing, wildcard routing, removal cleanup, channel/global isolation. Pool sizing (#8) - Main pool: max 50->20, min 5->2. Audit pool: max=5, min=1. - Frees connections for multi-pod (4 pods x 25 = 100 <= PG limit). Observability (#9) - Wire sprout_fanout_recipients histogram at all 4 fan_out() sites.

AppState::new() now returns (Self, AuditShutdownHandle). The handle owns a CancellationToken that signals the audit worker to stop accepting new entries, close the receiver, drain buffered entries via recv().await, and exit. This is independent of Arc<AppState> lifetime — works correctly even when background tasks (reaper, pubsub, health server) still hold state clones after axum's graceful shutdown completes. Closing the receiver (audit_rx.close()) rejects future sends atomically, so no entries are lost between the cancel signal and the drain loop. Sequence: SIGTERM → readiness 503 → axum drains connections → main() calls audit_shutdown.drain(5s) → cancel token fires → worker closes receiver → drains buffered entries → exit. 5-second timeout prevents hanging on a stuck audit DB.

…-binding * origin/main: fix(desktop): eliminate agent startup beachball (#374) fix(desktop): resolve agent command path for DMG builds (#372) fix(desktop): remove stale sprout-admin prereq, add sidecar tooling (#371) Add server cross-compile and macOS desktop build CI jobs (#369) Fix forum post card bugs on desktop and mobile (#370) fix(desktop): kill WebSocket flood and fix Markdown <p><div> nesting (#368) perf: caching, batched DM resolution, bounded audit, global kind index (#367) fix: staging to generate stubs as needed (#366) chore(deps): update rust crate axum to v0.8.9 (#365) chore(deps): update dependency @tanstack/react-router to v1.168.22 (#364) feat(desktop): autoscroll thread sidebar for new replies (#363) fix(desktop): eliminate 10+ second UI freeze on startup (#361) feat(desktop): bundle sprout-acp and sprout-mcp-server as Tauri sidecars (#362) Remove release pipeline from public repo (#360) Amp-Thread-ID: https://ampcode.com/threads/T-019dab7a-5979-7401-83a1-509b9adfe4a0 Co-authored-by: Amp <amp@ampcode.com> # Conflicts: # crates/sprout-relay/src/state.rs

tlongwell-block requested a review from wesbillman as a code owner April 20, 2026 15:12

tlongwell-block merged commit b925008 into main Apr 20, 2026
10 checks passed

tlongwell-block deleted the perf/phase1-caching-batching-backpressure branch April 20, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: caching, batched DM resolution, bounded audit, global kind index#367

perf: caching, batched DM resolution, bounded audit, global kind index#367
tlongwell-block merged 2 commits intomainfrom
perf/phase1-caching-batching-backpressure

tlongwell-block commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlongwell-block commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Caching (#1, #2)

Batch DM resolution (#3)

Bounded audit (#7)

Graceful audit drain on shutdown

Global kind index (#6)

Pool sizing (#8)

Observability (#9)

Testing

What's NOT in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlongwell-block commented Apr 20, 2026 •

edited

Loading