[SEV-1] HTTP wedges 60-90s during snapshot writes — snapshot holds state_tables lock + deep-clones state; wal active segment never rotates

## TL;DR

External incident report (kalshi-pulse, 2026-05-21, ongoing 7-day outage): `beava` HTTP becomes unresponsive **60-90s on a ~60s cycle**, correlated 1:1 with snapshot file mtime. `POST /ping` times out at the docker healthcheck → restart loop → entire downstream pipeline frozen.

```
n=40 /ping samples (8s ceiling, in-container):
  min=0.7ms p50=1.7ms p90=8000ms p99=8000ms max=8000ms
  14/40 (35%) timed out
```

## Confirmed root cause (mapped to code)

### 1. Snapshot encoding holds state lock + deep-clones entire state (P0)

`crates/beava-server/src/snapshot_task.rs::do_snapshot` ~lines 91-101:

```rust
let body = {
    let registry_snap = app_state.dev_agg.registry.snapshot();
    let tables = app_state.dev_agg.state_tables.lock();    // ← parking_lot Mutex held
    SnapshotBody::from_live(&registry_snap, &tables, ...)  // ← full deep-clone
};
```

`SnapshotBody::from_live` (`crates/beava-core/src/snapshot_body.rs:115-138`) explicitly requires the lock and does:
```rust
let entries: Vec<(EntityKey, Vec<AggOp>)> = table
    .iter_sorted()
    .map(|(k, v)| (k.clone(), v.clone()))
    .collect();
```

For 507 MB of state (~6M entries × 80B AggOp post-Phase-12.9 boxing) this is multi-second under the lock.

**Cascade onto /ping:** the apply thread (mio data plane, single-threaded per CLAUDE.md §mio-only Hot-Path Invariant) acquires the same lock on every push (`apply_shard.rs:1025, 1413, 1538`). While the snapshot holds the lock, the next push parks the apply thread, and every queued request — including `/ping` that doesn't touch the lock — is FIFO-queued behind it. `/ping`'s constant-time handler (`apply_shard.rs:731`) never gets to run until the snapshot releases the lock.

### 2. `/ping` is not latency-bounded (P0)

`POST /ping` dispatches on the apply thread (`apply_shard.rs:731-734`). The handler itself is constant-time (reads `registry.version()`), but the apply thread serializes it behind blocked pushes. Liveness probes need a path that doesn't share the apply-thread queue.

Two reasonable fixes (orthogonal to #1, both worth doing):
- Move `/ping` to the admin sidecar (`http_admin.rs::BoundAdminServer`, separate tokio runtime + separate port) — same model as `/health` already uses.
- Or expose a fast-path that short-circuits before `dispatch_one`'s FIFO (atomic counter check, no apply-thread round-trip).

### 3. WAL active segment never rotates → file grows unbounded (P1)

Reporter observed:
```
wal-0000000000000000.wal  2,430,543,938 bytes
```
Despite repeated `snapshot written + WAL truncated` logs with `removed: 0`.

`crates/beava-persistence/src/rotation.rs::truncate_up_to` only removes **closed** segments (`std::fs::remove_file` on segments whose successor's `start_lsn ≤ covered_lsn`). The single open segment `wal-0000000000000000.wal` is **never deleted** by truncate. There is no size-based rotation in the snapshot/truncate path, so the active segment grows indefinitely. `removed: 0` is honest — there are no closed segments to remove.

Two follow-ups needed:
- Add size-based segment rotation (or LSN-based, e.g. rotate on snapshot success) so closed segments actually exist for truncate to clean up.
- Either implement in-place `ftruncate`/`fallocate(PUNCH_HOLE)` on the open segment, or update the log line to be honest (`segments_removed: N, active_segment_bytes: M`).

### 4. Snapshot cadence default too aggressive (P1)

`crates/beava-server/src/cli.rs:24-25` documents `BEAVA_SNAPSHOT_INTERVAL_MS=30000` (30s). Deployment is running at ~60s and writes 507 MB every cycle. Even after the lock fix, half-a-gig of fsync every 60s is aggressive for any non-trivial workload. Worth raising the default and documenting the trade-off (durability window vs IO bandwidth) in the CLI help.

### 5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)

```
recovery.v2_json_decode_failed lsn=379827
  "control character (\\u0000-\\u001F) found while parsing a string at line 1 column 108"
```

Same LSN every restart since 2026-05-14 04:52. Reporter traced it to a Python producer that passed a control byte through into a JSON string field. Two cheap mitigations on the beava side:
- Reject `POST /push` bodies with raw control bytes in string fields at the HTTP boundary (HTTP 400). Stops corrupt records from entering the WAL.
- Write a quarantine marker on first decode failure so subsequent restarts don't re-emit the same WARN.

### 6. `/healthz` vs `/ping` semantics (NTH)

The admin sidecar already serves `/health` on a separate port. The README/quickstart should explicitly recommend the admin-port `/health` for docker/k8s healthchecks (which would have entirely avoided this incident downstream) and document `/ping`'s purpose as registry-version probe rather than liveness probe.

## Production impact (from report)

| Metric | Value |
|--------|-------|
| Outage duration | 7 days, ongoing since 2026-05-14 04:31 UTC |
| Paper trades (downstream consumer) | 0 in 7 days |
| beava container restarts (docker) | 4+ |
| /ping timeout rate | 35% (14/40) |
| Longest observed stall | ~75s |
| Snapshot size | 507 MB |
| WAL size | 2.43 GB (single segment, never rotated) |
| Process state during stall | R, 5 threads, no log output |

## Reproduction

Per report §7:
1. Run beava with ~500 MB of accumulated state
2. Healthcheck timeout 3s, interval 5s, retries 10
3. Loop `POST /ping` every 0.5s
4. Observe ~60s timeout bursts aligned with snapshot mtime

## Suggested fix order

1. **#1 (P0):** Decouple snapshot encoding from apply-thread state lock. Smallest viable fix: apply thread publishes `Arc<StateTables>` via atomic swap on push commit; snapshot reads the immutable Arc without locking. Preserves mio-only invariant.
2. **#2 (P0):** Move `/ping` to the admin sidecar (or add fast-path before `dispatch_one`).
3. **#3 (P1):** Add size-or-LSN-based WAL segment rotation so truncate has closed segments to remove.
4. **#4 (P1):** Raise default `BEAVA_SNAPSHOT_INTERVAL_MS`; add CLI help noting the trade-off.
5. **#5 (P2):** HTTP-boundary control-byte rejection on `POST /push` + WAL decode-failure quarantine marker.
6. **#6 (NTH):** Quickstart docs: `/health` (admin port) is the canonical healthcheck endpoint.

## Source files

- `crates/beava-server/src/snapshot_task.rs:91-101` — lock-hold scope
- `crates/beava-core/src/snapshot_body.rs:115-138` — deep-clone under lock
- `crates/beava-server/src/apply_shard.rs:731-734` — `/ping` on apply thread
- `crates/beava-server/src/apply_shard.rs:1025,1413,1538` — apply-side state_tables lock acquirers
- `crates/beava-persistence/src/rotation.rs:41-71` — truncate_up_to closed-segments-only logic
- `crates/beava-server/src/cli.rs:24-25` — `BEAVA_SNAPSHOT_INTERVAL_MS` doc/default

## Reporter

kalshi-pulse operator, filed 2026-05-21. Full incident report attached to the project as `Beava Incident Report.pdf`. Reporter offers tcpdump / perf trace / strace from next observed stall on request — stall reproduces continuously and is easy to capture on demand. Downstream consumer source: `github.com/tk-dg/beava-test` (branch `claude/add-repo-summary-4Dlp3`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEV-1] HTTP wedges 60-90s during snapshot writes — snapshot holds state_tables lock + deep-clones state; wal active segment never rotates #151

TL;DR

Confirmed root cause (mapped to code)

1. Snapshot encoding holds state lock + deep-clones entire state (P0)

2. `/ping` is not latency-bounded (P0)

3. WAL active segment never rotates → file grows unbounded (P1)

4. Snapshot cadence default too aggressive (P1)

5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)

6. `/healthz` vs `/ping` semantics (NTH)

Production impact (from report)

Reproduction

Suggested fix order

Source files

Reporter

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value
Outage duration	7 days, ongoing since 2026-05-14 04:31 UTC
Paper trades (downstream consumer)	0 in 7 days
beava container restarts (docker)	4+
/ping timeout rate	35% (14/40)
Longest observed stall	~75s
Snapshot size	507 MB
WAL size	2.43 GB (single segment, never rotated)
Process state during stall	R, 5 threads, no log output

[SEV-1] HTTP wedges 60-90s during snapshot writes — snapshot holds state_tables lock + deep-clones state; wal active segment never rotates #151

Description

TL;DR

Confirmed root cause (mapped to code)

1. Snapshot encoding holds state lock + deep-clones entire state (P0)

2. /ping is not latency-bounded (P0)

3. WAL active segment never rotates → file grows unbounded (P1)

4. Snapshot cadence default too aggressive (P1)

5. Malformed WAL record at LSN 379827 — recurring boot WARN (P2)

6. /healthz vs /ping semantics (NTH)

Production impact (from report)

Reproduction

Suggested fix order

Source files

Reporter

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `/ping` is not latency-bounded (P0)

6. `/healthz` vs `/ping` semantics (NTH)