Skip to content

[SEV-1] HTTP wedges 60-90s during snapshot writes — snapshot holds state_tables lock + deep-clones state; wal active segment never rotates #151

@petrpan26

Description

@petrpan26

TL;DR

External incident report (kalshi-pulse, 2026-05-21, ongoing 7-day outage): beava HTTP becomes unresponsive 60-90s on a ~60s cycle, correlated 1:1 with snapshot file mtime. POST /ping times out at the docker healthcheck → restart loop → entire downstream pipeline frozen.

n=40 /ping samples (8s ceiling, in-container):
  min=0.7ms p50=1.7ms p90=8000ms p99=8000ms max=8000ms
  14/40 (35%) timed out

Confirmed root cause (mapped to code)

1. Snapshot encoding holds state lock + deep-clones entire state (P0)

crates/beava-server/src/snapshot_task.rs::do_snapshot ~lines 91-101:

let body = {
    let registry_snap = app_state.dev_agg.registry.snapshot();
    let tables = app_state.dev_agg.state_tables.lock();    // ← parking_lot Mutex held
    SnapshotBody::from_live(&registry_snap, &tables, ...)  // ← full deep-clone
};

SnapshotBody::from_live (crates/beava-core/src/snapshot_body.rs:115-138) explicitly requires the lock and does:

let entries: Vec<(EntityKey, Vec<AggOp>)> = table
    .iter_sorted()
    .map(|(k, v)| (k.clone(), v.clone()))
    .collect();

For 507 MB of state (~6M entries × 80B AggOp post-Phase-12.9 boxing) this is multi-second under the lock.

Cascade onto /ping: the apply thread (mio data plane, single-threaded per CLAUDE.md §mio-only Hot-Path Invariant) acquires the same lock on every push (apply_shard.rs:1025, 1413, 1538). While the snapshot holds the lock, the next push parks the apply thread, and every queued request — including /ping that doesn't touch the lock — is FIFO-queued behind it. /ping's constant-time handler (apply_shard.rs:731) never gets to run until the snapshot releases the lock.

2. /ping is not latency-bounded (P0)

POST /ping dispatches on the apply thread (apply_shard.rs:731-734). The handler itself is constant-time (reads registry.version()), but the apply thread serializes it behind blocked pushes. Liveness probes need a path that doesn't share the apply-thread queue.

Two reasonable fixes (orthogonal to #1, both worth doing):

  • Move /ping to the admin sidecar (http_admin.rs::BoundAdminServer, separate tokio runtime + separate port) — same model as /health already uses.
  • Or expose a fast-path that short-circuits before dispatch_one's FIFO (atomic counter check, no apply-thread round-trip).

3. WAL active segment never rotates → file grows unbounded (P1)

Reporter observed:

wal-0000000000000000.wal  2,430,543,938 bytes

Despite repeated snapshot written + WAL truncated logs with removed: 0.

crates/beava-persistence/src/rotation.rs::truncate_up_to only removes closed segments (std::fs::remove_file on segments whose successor's start_lsn ≤ covered_lsn). The single open segment wal-0000000000000000.wal is never deleted by truncate. There is no size-based rotation in the snapshot/truncate path, so the active segment grows indefinitely. removed: 0 is honest — there are no closed segments to remove.

Two follow-ups needed:

  • Add size-based segment rotation (or LSN-based, e.g. rotate on snapshot success) so closed segments actually exist for truncate to clean up.
  • Either implement in-place ftruncate/fallocate(PUNCH_HOLE) on the open segment, or update the log line to be honest (segments_removed: N, active_segment_bytes: M).

4. Snapshot cadence default too aggressive (P1)

crates/beava-server/src/cli.rs:24-25 documents BEAVA_SNAPSHOT_INTERVAL_MS=30000 (30s). Deployment is running at ~60s and writes 507 MB every cycle. Even after the lock fix, half-a-gig of fsync every 60s is aggressive for any non-trivial workload. Worth raising the default and documenting the trade-off (durability window vs IO bandwidth) in the CLI help.

5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)

recovery.v2_json_decode_failed lsn=379827
  "control character (\\u0000-\\u001F) found while parsing a string at line 1 column 108"

Same LSN every restart since 2026-05-14 04:52. Reporter traced it to a Python producer that passed a control byte through into a JSON string field. Two cheap mitigations on the beava side:

  • Reject POST /push bodies with raw control bytes in string fields at the HTTP boundary (HTTP 400). Stops corrupt records from entering the WAL.
  • Write a quarantine marker on first decode failure so subsequent restarts don't re-emit the same WARN.

6. /healthz vs /ping semantics (NTH)

The admin sidecar already serves /health on a separate port. The README/quickstart should explicitly recommend the admin-port /health for docker/k8s healthchecks (which would have entirely avoided this incident downstream) and document /ping's purpose as registry-version probe rather than liveness probe.

Production impact (from report)

Metric Value
Outage duration 7 days, ongoing since 2026-05-14 04:31 UTC
Paper trades (downstream consumer) 0 in 7 days
beava container restarts (docker) 4+
/ping timeout rate 35% (14/40)
Longest observed stall ~75s
Snapshot size 507 MB
WAL size 2.43 GB (single segment, never rotated)
Process state during stall R, 5 threads, no log output

Reproduction

Per report §7:

  1. Run beava with ~500 MB of accumulated state
  2. Healthcheck timeout 3s, interval 5s, retries 10
  3. Loop POST /ping every 0.5s
  4. Observe ~60s timeout bursts aligned with snapshot mtime

Suggested fix order

  1. feat: SDK-driven pipeline + plain-HTTP frontend, with PR-gating CI hardening #1 (P0): Decouple snapshot encoding from apply-thread state lock. Smallest viable fix: apply thread publishes Arc<StateTables> via atomic swap on push commit; snapshot reads the immutable Arc without locking. Preserves mio-only invariant.
  2. ci(deploy): chain deploy after publish-edge to kill version-mismatch race #2 (P0): Move /ping to the admin sidecar (or add fast-path before dispatch_one).
  3. fix(register): force=true must replace additive-against-existing descriptors (prod 409 hotfix) #3 (P1): Add size-or-LSN-based WAL segment rotation so truncate has closed segments to remove.
  4. fix+refactor(register): consolidate dual diff systems + force=true honors additive-against-existing #4 (P1): Raise default BEAVA_SNAPSHOT_INTERVAL_MS; add CLI help noting the trade-off.
  5. ci(deploy): pull+restart container before register; route deploy through publish-edge only #5 (P2): HTTP-boundary control-byte rejection on POST /push + WAL decode-failure quarantine marker.
  6. docs: drop stale tally / v0.0.0 / broken-install refs in README + 11 doc pages #6 (NTH): Quickstart docs: /health (admin port) is the canonical healthcheck endpoint.

Source files

  • crates/beava-server/src/snapshot_task.rs:91-101 — lock-hold scope
  • crates/beava-core/src/snapshot_body.rs:115-138 — deep-clone under lock
  • crates/beava-server/src/apply_shard.rs:731-734/ping on apply thread
  • crates/beava-server/src/apply_shard.rs:1025,1413,1538 — apply-side state_tables lock acquirers
  • crates/beava-persistence/src/rotation.rs:41-71 — truncate_up_to closed-segments-only logic
  • crates/beava-server/src/cli.rs:24-25BEAVA_SNAPSHOT_INTERVAL_MS doc/default

Reporter

kalshi-pulse operator, filed 2026-05-21. Full incident report attached to the project as Beava Incident Report.pdf. Reporter offers tcpdump / perf trace / strace from next observed stall on request — stall reproduces continuously and is easy to capture on demand. Downstream consumer source: github.com/tk-dg/beava-test (branch claude/add-repo-summary-4Dlp3).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: persistenceWAL / snapshot / recovery (beava-persistence)area: serverRust server / core / runtime-core cratesbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions