You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
External incident report (kalshi-pulse, 2026-05-21, ongoing 7-day outage): beava HTTP becomes unresponsive 60-90s on a ~60s cycle, correlated 1:1 with snapshot file mtime. POST /ping times out at the docker healthcheck → restart loop → entire downstream pipeline frozen.
let body = {let registry_snap = app_state.dev_agg.registry.snapshot();let tables = app_state.dev_agg.state_tables.lock();// ← parking_lot Mutex heldSnapshotBody::from_live(®istry_snap,&tables, ...)// ← full deep-clone};
SnapshotBody::from_live (crates/beava-core/src/snapshot_body.rs:115-138) explicitly requires the lock and does:
let entries:Vec<(EntityKey,Vec<AggOp>)> = table
.iter_sorted().map(|(k, v)| (k.clone(), v.clone())).collect();
For 507 MB of state (~6M entries × 80B AggOp post-Phase-12.9 boxing) this is multi-second under the lock.
Cascade onto /ping: the apply thread (mio data plane, single-threaded per CLAUDE.md §mio-only Hot-Path Invariant) acquires the same lock on every push (apply_shard.rs:1025, 1413, 1538). While the snapshot holds the lock, the next push parks the apply thread, and every queued request — including /ping that doesn't touch the lock — is FIFO-queued behind it. /ping's constant-time handler (apply_shard.rs:731) never gets to run until the snapshot releases the lock.
2. /ping is not latency-bounded (P0)
POST /ping dispatches on the apply thread (apply_shard.rs:731-734). The handler itself is constant-time (reads registry.version()), but the apply thread serializes it behind blocked pushes. Liveness probes need a path that doesn't share the apply-thread queue.
Two reasonable fixes (orthogonal to #1, both worth doing):
Move /ping to the admin sidecar (http_admin.rs::BoundAdminServer, separate tokio runtime + separate port) — same model as /health already uses.
Or expose a fast-path that short-circuits before dispatch_one's FIFO (atomic counter check, no apply-thread round-trip).
3. WAL active segment never rotates → file grows unbounded (P1)
Reporter observed:
wal-0000000000000000.wal 2,430,543,938 bytes
Despite repeated snapshot written + WAL truncated logs with removed: 0.
crates/beava-persistence/src/rotation.rs::truncate_up_to only removes closed segments (std::fs::remove_file on segments whose successor's start_lsn ≤ covered_lsn). The single open segment wal-0000000000000000.wal is never deleted by truncate. There is no size-based rotation in the snapshot/truncate path, so the active segment grows indefinitely. removed: 0 is honest — there are no closed segments to remove.
Two follow-ups needed:
Add size-based segment rotation (or LSN-based, e.g. rotate on snapshot success) so closed segments actually exist for truncate to clean up.
Either implement in-place ftruncate/fallocate(PUNCH_HOLE) on the open segment, or update the log line to be honest (segments_removed: N, active_segment_bytes: M).
4. Snapshot cadence default too aggressive (P1)
crates/beava-server/src/cli.rs:24-25 documents BEAVA_SNAPSHOT_INTERVAL_MS=30000 (30s). Deployment is running at ~60s and writes 507 MB every cycle. Even after the lock fix, half-a-gig of fsync every 60s is aggressive for any non-trivial workload. Worth raising the default and documenting the trade-off (durability window vs IO bandwidth) in the CLI help.
5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)
recovery.v2_json_decode_failed lsn=379827
"control character (\\u0000-\\u001F) found while parsing a string at line 1 column 108"
Same LSN every restart since 2026-05-14 04:52. Reporter traced it to a Python producer that passed a control byte through into a JSON string field. Two cheap mitigations on the beava side:
Reject POST /push bodies with raw control bytes in string fields at the HTTP boundary (HTTP 400). Stops corrupt records from entering the WAL.
Write a quarantine marker on first decode failure so subsequent restarts don't re-emit the same WARN.
6. /healthz vs /ping semantics (NTH)
The admin sidecar already serves /health on a separate port. The README/quickstart should explicitly recommend the admin-port /health for docker/k8s healthchecks (which would have entirely avoided this incident downstream) and document /ping's purpose as registry-version probe rather than liveness probe.
Production impact (from report)
Metric
Value
Outage duration
7 days, ongoing since 2026-05-14 04:31 UTC
Paper trades (downstream consumer)
0 in 7 days
beava container restarts (docker)
4+
/ping timeout rate
35% (14/40)
Longest observed stall
~75s
Snapshot size
507 MB
WAL size
2.43 GB (single segment, never rotated)
Process state during stall
R, 5 threads, no log output
Reproduction
Per report §7:
Run beava with ~500 MB of accumulated state
Healthcheck timeout 3s, interval 5s, retries 10
Loop POST /ping every 0.5s
Observe ~60s timeout bursts aligned with snapshot mtime
kalshi-pulse operator, filed 2026-05-21. Full incident report attached to the project as Beava Incident Report.pdf. Reporter offers tcpdump / perf trace / strace from next observed stall on request — stall reproduces continuously and is easy to capture on demand. Downstream consumer source: github.com/tk-dg/beava-test (branch claude/add-repo-summary-4Dlp3).
TL;DR
External incident report (kalshi-pulse, 2026-05-21, ongoing 7-day outage):
beavaHTTP becomes unresponsive 60-90s on a ~60s cycle, correlated 1:1 with snapshot file mtime.POST /pingtimes out at the docker healthcheck → restart loop → entire downstream pipeline frozen.Confirmed root cause (mapped to code)
1. Snapshot encoding holds state lock + deep-clones entire state (P0)
crates/beava-server/src/snapshot_task.rs::do_snapshot~lines 91-101:SnapshotBody::from_live(crates/beava-core/src/snapshot_body.rs:115-138) explicitly requires the lock and does:For 507 MB of state (~6M entries × 80B AggOp post-Phase-12.9 boxing) this is multi-second under the lock.
Cascade onto /ping: the apply thread (mio data plane, single-threaded per CLAUDE.md §mio-only Hot-Path Invariant) acquires the same lock on every push (
apply_shard.rs:1025, 1413, 1538). While the snapshot holds the lock, the next push parks the apply thread, and every queued request — including/pingthat doesn't touch the lock — is FIFO-queued behind it./ping's constant-time handler (apply_shard.rs:731) never gets to run until the snapshot releases the lock.2.
/pingis not latency-bounded (P0)POST /pingdispatches on the apply thread (apply_shard.rs:731-734). The handler itself is constant-time (readsregistry.version()), but the apply thread serializes it behind blocked pushes. Liveness probes need a path that doesn't share the apply-thread queue.Two reasonable fixes (orthogonal to #1, both worth doing):
/pingto the admin sidecar (http_admin.rs::BoundAdminServer, separate tokio runtime + separate port) — same model as/healthalready uses.dispatch_one's FIFO (atomic counter check, no apply-thread round-trip).3. WAL active segment never rotates → file grows unbounded (P1)
Reporter observed:
Despite repeated
snapshot written + WAL truncatedlogs withremoved: 0.crates/beava-persistence/src/rotation.rs::truncate_up_toonly removes closed segments (std::fs::remove_fileon segments whose successor'sstart_lsn ≤ covered_lsn). The single open segmentwal-0000000000000000.walis never deleted by truncate. There is no size-based rotation in the snapshot/truncate path, so the active segment grows indefinitely.removed: 0is honest — there are no closed segments to remove.Two follow-ups needed:
ftruncate/fallocate(PUNCH_HOLE)on the open segment, or update the log line to be honest (segments_removed: N, active_segment_bytes: M).4. Snapshot cadence default too aggressive (P1)
crates/beava-server/src/cli.rs:24-25documentsBEAVA_SNAPSHOT_INTERVAL_MS=30000(30s). Deployment is running at ~60s and writes 507 MB every cycle. Even after the lock fix, half-a-gig of fsync every 60s is aggressive for any non-trivial workload. Worth raising the default and documenting the trade-off (durability window vs IO bandwidth) in the CLI help.5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)
Same LSN every restart since 2026-05-14 04:52. Reporter traced it to a Python producer that passed a control byte through into a JSON string field. Two cheap mitigations on the beava side:
POST /pushbodies with raw control bytes in string fields at the HTTP boundary (HTTP 400). Stops corrupt records from entering the WAL.6.
/healthzvs/pingsemantics (NTH)The admin sidecar already serves
/healthon a separate port. The README/quickstart should explicitly recommend the admin-port/healthfor docker/k8s healthchecks (which would have entirely avoided this incident downstream) and document/ping's purpose as registry-version probe rather than liveness probe.Production impact (from report)
Reproduction
Per report §7:
POST /pingevery 0.5sSuggested fix order
Arc<StateTables>via atomic swap on push commit; snapshot reads the immutable Arc without locking. Preserves mio-only invariant./pingto the admin sidecar (or add fast-path beforedispatch_one).BEAVA_SNAPSHOT_INTERVAL_MS; add CLI help noting the trade-off.POST /push+ WAL decode-failure quarantine marker./health(admin port) is the canonical healthcheck endpoint.Source files
crates/beava-server/src/snapshot_task.rs:91-101— lock-hold scopecrates/beava-core/src/snapshot_body.rs:115-138— deep-clone under lockcrates/beava-server/src/apply_shard.rs:731-734—/pingon apply threadcrates/beava-server/src/apply_shard.rs:1025,1413,1538— apply-side state_tables lock acquirerscrates/beava-persistence/src/rotation.rs:41-71— truncate_up_to closed-segments-only logiccrates/beava-server/src/cli.rs:24-25—BEAVA_SNAPSHOT_INTERVAL_MSdoc/defaultReporter
kalshi-pulse operator, filed 2026-05-21. Full incident report attached to the project as
Beava Incident Report.pdf. Reporter offers tcpdump / perf trace / strace from next observed stall on request — stall reproduces continuously and is easy to capture on demand. Downstream consumer source:github.com/tk-dg/beava-test(branchclaude/add-repo-summary-4Dlp3).