Skip to content

perf: caching, batched DM resolution, bounded audit, global kind index#367

Merged
tlongwell-block merged 2 commits intomainfrom
perf/phase1-caching-batching-backpressure
Apr 20, 2026
Merged

perf: caching, batched DM resolution, bounded audit, global kind index#367
tlongwell-block merged 2 commits intomainfrom
perf/phase1-caching-batching-backpressure

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented Apr 20, 2026

Summary

Performance improvements from staging profiling (2026-04-17). All findings from the profiling plan are addressed — Phases 1 through 3.

Projected impact:

  • GET /api/channels: 329ms → ~35ms (batch DM resolution)
  • Per-request DB queries: ~43 → ~5
  • channels seq scans: >50% reduction
  • Global fan-out: O(all_subs) → O(matching_kind_subs)
  • DB connections held: 57/100 → ~22/100 (room for 4 relay pods)

Changes

Caching (#1, #2)

  • Wire membership_cache (moka, 10s TTL, 10k cap) into all 10 is_member() call sites with cache-aside pattern.
  • Add accessible_channels_cache for get_accessible_channel_ids() at 3 call sites (REQ handler, /api/feed, /api/search).
  • Invalidate on all mutation paths: add_member, remove_member, channel create/delete, DM create/expand, compensation delete, audio auto-join.
  • Multi-pod: other pods rely on TTL expiry (documented in code).

Batch DM resolution (#3)

  • Add get_members_bulk(channel_ids) using WHERE channel_id = ANY($1).
  • Rewrite /api/channels to resolve all DM participants in 2 queries (one get_members_bulk + one get_users_bulk) instead of 2×N_DMs.
  • Remove per-DM resolve_dm_participants() function.

Bounded audit (#7)

  • Replace unbounded tokio::spawn(audit.log()) with bounded mpsc channel (capacity 1000) + single worker task.
  • Uses .send().await for backpressure — audit entries must not be silently dropped (SOX-grade tamper-evident chain).
  • Migrate media upload audit from unbounded spawn to audit_tx too.
  • Add sprout_audit_log_errors_total counter for DB write failures.

Graceful audit drain on shutdown

  • AppState::new() returns (Self, AuditShutdownHandle) so the caller can drain the audit queue during graceful shutdown.
  • AuditShutdownHandle owns a CancellationToken + JoinHandle. On drain: cancel fires → worker calls audit_rx.close() (atomically rejects future sends) → drains buffered entries via recv().await → exits.
  • Independent of Arc<AppState> lifetime — works even when background tasks (reaper, pubsub, health server) still hold state clones.
  • 5-second timeout prevents hanging on a stuck audit DB.
  • Extracted log_audit_entry() helper shared by normal loop and drain loop.

Global kind index (#6)

  • Add global_kind_index and global_wildcard_index to SubscriptionRegistry for sub-linear fan-out on global events.
  • Preserves channel/global scoping invariant (no behavior change).
  • 4 new tests: kind routing, wildcard routing, removal cleanup, channel/global isolation.

Pool sizing (#8)

  • Main pool: max 50→20, min 5→2. Audit pool: max=5, min=1.
  • Evidence: staging measured 51 idle + 1 active out of 50 — most connections sat unused.

Observability (#9)

  • Wire sprout_fanout_recipients histogram at all 4 fan_out() sites.

Testing

  • 209 unit tests pass (205 existing + 4 new global index tests)
  • cargo clippy clean, cargo fmt clean
  • Crossfire reviewed by codex CLI and opus subagent. Audit drain went through 3 rounds of codex review (4/10 → 6/10 → 9/10) fixing Arc lifetime and late-send race issues.

What's NOT in this PR

Performance improvements from staging profiling (2026-04-17). Projects
GET /api/channels from 329ms to ~35ms, reduces per-request DB queries
from ~43 to ~5, and cuts channels seq scans >50%.

Caching (#1, #2)
- Wire membership_cache (moka, 10s TTL, 10k cap) into all 10
  is_member() call sites with cache-aside pattern.
- Add accessible_channels_cache for get_accessible_channel_ids() at
  3 call sites (REQ handler, /api/feed, /api/search).
- Invalidate on all mutation paths: add_member, remove_member,
  channel create/delete, DM create/expand, compensation delete,
  audio auto-join. Multi-pod relies on TTL expiry (documented).

Batch DM resolution (#3)
- Add get_members_bulk(channel_ids) using WHERE channel_id = ANY($1).
- Rewrite /api/channels to resolve all DM participants in 2 queries
  (one get_members_bulk + one get_users_bulk) instead of 2xN_DMs.
- Remove per-DM resolve_dm_participants() function.

Bounded audit (#7)
- Replace unbounded tokio::spawn(audit.log()) with bounded mpsc
  channel (capacity 1000) + single worker task in AppState.
- Uses .send().await for backpressure — audit entries must not be
  silently dropped (SOX-grade tamper-evident chain).
- Migrate media upload audit from unbounded spawn to audit_tx.
- Add sprout_audit_log_errors_total counter for DB write failures.

Global kind index (#6)
- Add global_kind_index and global_wildcard_index to
  SubscriptionRegistry for sub-linear fan-out on global events.
- Fan-out goes from O(all_subs) to O(matching_kind_subs).
- Preserves channel/global scoping invariant (no behavior change).
- Add 4 tests: kind routing, wildcard routing, removal cleanup,
  channel/global isolation.

Pool sizing (#8)
- Main pool: max 50->20, min 5->2. Audit pool: max=5, min=1.
- Frees connections for multi-pod (4 pods x 25 = 100 <= PG limit).

Observability (#9)
- Wire sprout_fanout_recipients histogram at all 4 fan_out() sites.
AppState::new() now returns (Self, AuditShutdownHandle). The handle
owns a CancellationToken that signals the audit worker to stop
accepting new entries, close the receiver, drain buffered entries
via recv().await, and exit.

This is independent of Arc<AppState> lifetime — works correctly even
when background tasks (reaper, pubsub, health server) still hold
state clones after axum's graceful shutdown completes. Closing the
receiver (audit_rx.close()) rejects future sends atomically, so no
entries are lost between the cancel signal and the drain loop.

Sequence: SIGTERM → readiness 503 → axum drains connections →
main() calls audit_shutdown.drain(5s) → cancel token fires →
worker closes receiver → drains buffered entries → exit.
5-second timeout prevents hanging on a stuck audit DB.
@tlongwell-block tlongwell-block merged commit b925008 into main Apr 20, 2026
10 checks passed
@tlongwell-block tlongwell-block deleted the perf/phase1-caching-batching-backpressure branch April 20, 2026 16:59
fsola-sq added a commit that referenced this pull request Apr 20, 2026
…-binding

* origin/main:
  fix(desktop): eliminate agent startup beachball (#374)
  fix(desktop): resolve agent command path for DMG builds (#372)
  fix(desktop): remove stale sprout-admin prereq, add sidecar tooling (#371)
  Add server cross-compile and macOS desktop build CI jobs (#369)
  Fix forum post card bugs on desktop and mobile (#370)
  fix(desktop): kill WebSocket flood and fix Markdown <p><div> nesting (#368)
  perf: caching, batched DM resolution, bounded audit, global kind index (#367)
  fix: staging to generate stubs as needed (#366)
  chore(deps): update rust crate axum to v0.8.9 (#365)
  chore(deps): update dependency @tanstack/react-router to v1.168.22 (#364)
  feat(desktop): autoscroll thread sidebar for new replies (#363)
  fix(desktop): eliminate 10+ second UI freeze on startup (#361)
  feat(desktop): bundle sprout-acp and sprout-mcp-server as Tauri sidecars (#362)
  Remove release pipeline from public repo (#360)

Amp-Thread-ID: https://ampcode.com/threads/T-019dab7a-5979-7401-83a1-509b9adfe4a0
Co-authored-by: Amp <amp@ampcode.com>

# Conflicts:
#	crates/sprout-relay/src/state.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant