Skip to content

fix(snapshot): reclaim covered WAL bytes#152

Merged
petrpan26 merged 26 commits into
mainfrom
phase-27.1-snapshot-fork-cow
May 29, 2026
Merged

fix(snapshot): reclaim covered WAL bytes#152
petrpan26 merged 26 commits into
mainfrom
phase-27.1-snapshot-fork-cow

Conversation

@petrpan26
Copy link
Copy Markdown
Contributor

@petrpan26 petrpan26 commented May 22, 2026

Reclaim hand-rolled WAL bytes after durable snapshots and fix misleading WAL logs.
Reject control chars on push, quarantine LSN-known WAL decode failures, expose snapshot write metrics, and harden fork/WAL edge cases from review.
Tests: cargo fmt --all --check
Tests: git diff --check
Tests: cargo test -p beava-runtime-core
Tests: cargo test -p beava-persistence
Tests: cargo test -p beava-server --features testing
Tests: cargo clippy -p beava-server --all-targets --features testing -- -D warnings

Hoang Phan added 8 commits May 21, 2026 23:31
- crates/beava-core/tests/snapshot_lock_hold_repro.rs:
    standalone measurement of the parking_lot lock-hold scope inside
    snapshot_task::do_snapshot. Reproduces incident #151 locally:
    linear scaling ~350 ns/entry on Apple M4, projecting to multi-second
    lock-hold at production scale (5-10M entries).

- crates/beava-server/tests/snapshot_fork.rs:
    integration tests for the new fork+COW snapshot path:
    * fork_env_gate            - BEAVA_SNAPSHOT_FORK=1 opts in
    * fork_snapshot_writes_decodable_file
                               - child writes SnapshotReader-decodable file
    * fork_snapshot_parent_state_intact
                               - parent's app_state usable after fork
    * fork_snapshot_with_zero_state
                               - empty state edge case

These tests reference do_snapshot_via_fork + fork_enabled from a
snapshot_fork module landed in the immediately-following feat commit.
Addresses #151. Drops apply-thread state_tables.lock() hold from
~seconds to ~µs (the fork syscall itself) by mirroring Valkey's BGSAVE
pattern.

## Mechanism

snapshot_fork::do_snapshot_via_fork:
  1. parent acquires state_tables.lock() briefly
  2. libc::fork() — child inherits a COW snapshot of parent address space
  3. parent immediately releases the lock — apply thread unblocks
  4. child reads its own (now-frozen) state_tables, serializes via
     bincode, writes the snapshot file via std::fs, libc::_exit(0)
  5. parent waits on the child via spawn_blocking(waitpid) — does not
     block the tokio current-thread runtime
  6. WAL truncate runs only on ChildExit::Success

## Opt-in

Gated behind BEAVA_SNAPSHOT_FORK=1 env. Default (env unset) preserves
the legacy in-process synchronous path. Flip the default after soak in
a follow-up PR.

## Safety invariants (documented in snapshot_fork.rs module doc)

1. beava's tokio runtime is new_current_thread; total OS threads at fork
   time = tokio main (forker) + mio apply + beava-wal-writer-noop +
   possibly a spawn_blocking worker. All other threads vanish in the
   child per POSIX.
2. System allocator (glibc / macOS libc) is fork-safe via pthread_atfork
   handlers — bincode::serialize allocates safely in the child.
3. Child only reads app_state fields and calls std::fs + libc::_exit.
   It does NOT touch WAL state, tokio runtime, the admin sidecar, or any
   parking_lot::Mutex it didn't already hold at fork time.
4. Child calls libc::_exit (async-signal-safe; skips at_exit handlers)
   rather than std::process::exit which would run destructors that
   could touch parent state.
5. Child error reporting via .error sidecar file (read by parent after
   waitpid) — std{out,err} are inherited from parent and unsafe to
   write to from a forked child.

## Architectural tripwires

All 5 stay green (verified):
- phase12_6_mio_only_dataplane            ok (3/3)
- phase12_6_legacy_axum_killed            ok (6/6)
- phase12_7_no_table_surface              ok (3/3)
- phase12_7_legacy_table_handlers_killed  ok (6/6)
- per_entity_size_dump (AggOp 80B cap)    ok (2/2)

snapshot_fork.rs contains no axum symbols and is not a new caller of
apply_event_to_aggregations.

## TDD Discipline

This is the GREEN commit. Paired with test(27.1) immediately preceding
that exercises every behavioural contract:
- env gate (fork_enabled)
- child writes decodable file
- parent state intact after fork
- empty-state edge case

Test runtime: 0.03s (4/4 pass).
Eight new tests across two files locking in the contract that motivates
this PR.

## snapshot_lock_contention.rs (4 tests)

Direct measurement of the lock-hold scope in each path. Headline result
on Apple M4 / dev build:

  === Lock-hold comparison @ N=100k entities ===
    legacy:  330.20ms (median of 3)
    fork:      0.59ms (median of 3)
    speedup: 556.9×

  === Legacy in-process snapshot: lock-hold scales with N ===
       entries   lock_held_ms
          1000        3.01ms
         50000      165.82ms
         200000     683.84ms

  === fork() snapshot: lock-hold is O(1) (fork syscall only) ===
       entries   lock_held_ms
          1000        0.98ms
         50000        0.40ms
         200000      0.36ms

Tests:
- legacy_lock_hold_scales_with_state_size       — O(N) confirmed; 1k→200k → 3ms→684ms
- fork_lock_hold_is_microseconds_regardless_of_state_size  — O(1) confirmed
- fork_vs_legacy_lock_hold_at_same_state_size   — asserts >=5× speedup floor
- fork_full_path_apply_lock_available_during_child_work
                                                — end-to-end: probe thread
                                                  acquires lock <100ms after
                                                  fork() returns, while child
                                                  is still serializing

Methodology: bypass SnapshotBody::from_live for the legacy measurement
because that function iterates registry.compiled_aggregations (empty in
test harness). Instead time the inner clone-collect directly — it's
byte-for-byte the operation from_live runs once per registered agg.

CI margins: 5× speedup floor (empirical ~500-700×); 50ms ceiling for
fork lock-hold (empirical <1ms); 100ms for end-to-end apply-lock-available.

## snapshot_recovery_time.rs (4 tests)

Measures recovery cost — the time to load a snapshot back into state on
beava boot.

  === Snapshot recovery time vs state size ===
       entries     encoded_KB        open_ms        decode_ms        MB/s_decode
          1000        67.5 KB        0.07ms          0.98ms            67.2
         10000       674.0 KB        0.43ms          9.75ms            67.5
         100000     6738.4 KB        3.97ms         97.96ms            67.2

Decode throughput is ~constant 67 MB/s in debug build (release will be
significantly faster). Projection: 507 MB snapshot decode ≈ 7.5s
on this hardware in debug — recovery is bottlenecked on bincode
deserialize.

Tests:
- snapshot_round_trip_byte_identical            — write→read bytes match,
                                                   decoded body matches input
- snapshot_recovery_time_scaling                — sizes 1k/10k/100k printed
- snapshot_decode_deterministic                 — same body in → same bytes
                                                   out (no nondeterminism)
- fork_and_in_process_produce_identical_format  — fork path doesn't change
                                                   the on-disk schema

## Gate status

12 fork-snapshot tests now pass (4 contract + 4 lock-contention + 4 recovery).
All 5 architectural tripwires green. cargo fmt clean. clippy clean on
all three new test files.
Eight new tests covering lock-hold + recovery-time at the scale that
triggered incident #151. All #[ignore]'d by default — runtime ~20s
release / ~5 min debug, memory peak ~1.5 GB.

## Run

    cargo test --release -p beava-server --test snapshot_big_state \
        -- --ignored --nocapture --test-threads=1

## Measured numbers (Apple M4, release)

### Lock-hold (the kalshi-pulse incident root cause)

  legacy lock-hold @ N=1M:   396.5ms   ← exceeds 3s healthcheck at ~7.5M
  legacy lock-hold @ N=5M:   2281.1ms (2.3s)  ← THE SMOKING GUN

  fork lock-hold   @ N=1M:     0.82ms   ← O(1)
  fork lock-hold   @ N=5M:     7.89ms   ← still O(1); the variance is
                                          fork()'s page-table-copy cost,
                                          NOT state-size dependence

  Speedup @ 1M:    legacy 395ms / fork 9ms = 43× (depends on OS memory
                                                  state; floor asserted 20×)
  Speedup @ 5M:    legacy 2281ms / fork 7.89ms ≈ 289× (informational)

### Recovery (boot-time decode)

  N=1M / 65.8 MB:    open 16.5ms, decode 72ms, throughput 914 MB/s
  N=5M / 329 MB:     open 88ms,   decode 367ms, throughput 897 MB/s
                                            ← decode throughput is nearly
                                              constant; recovery is bound by
                                              bincode deserialize at ~1 GB/s

### Full fork snapshot @ N=1M

  parent wall-clock 18.1ms (fork + child serialize + waitpid)
  child exit success
  apply lock held only across fork syscall (~0.5ms)

## Tests added

- legacy_lock_hold_at_1m_entries           — asserts ≥100ms
- legacy_lock_hold_at_5m_entries           — asserts ≥500ms (smoking gun)
- fork_lock_hold_at_1m_entries             — asserts <100ms
- fork_lock_hold_at_5m_entries             — asserts <100ms
- fork_speedup_at_1m_entries               — asserts ≥20× speedup
- recovery_decode_at_1m_entries            — informational throughput
- recovery_decode_at_5m_entries            — incident-scale recovery cost
- fork_full_snapshot_at_1m_entries         — end-to-end fork path

## Interpretation

These numbers confirm everything claimed in #151's diagnosis:

1. At incident-scale (5M entries, ~329 MB encoded), the legacy snapshot
   holds state_tables.lock() for **2.3 seconds**. That's already past
   the docker healthcheck 3s default at slightly larger state — exactly
   the bug the kalshi-pulse operator reported.

2. The fork path's lock-hold is **independent of state size** as designed.
   Even at 5M entries it's <10ms (the 7.89ms is fork's page-table copy
   over ~750 MB of working set, NOT clone work).

3. Recovery decode is ~1 GB/s — a 507 MB production snapshot deserializes
   in ~500ms. Boot-time recovery is not the bottleneck; the snapshot-write
   phase was. With this PR, snapshot-write no longer blocks the apply
   thread either.

The 20× speedup floor for fork_speedup_at_1m_entries is loose to absorb
OS-memory-state variance from earlier large-state tests in the same
process (fork's page-table-copy cost depends on prior allocation churn).
Empirically 50-1000× depending on test ordering and runner state.
…ong?'

Empirical chart of fork() lock-hold cost vs beava process VM size.

## Headline: TWO regimes

Fresh process (just booted, never touched large VM):
  fork @ 1M entries (691 MB RSS):    0.67 ms median
  fork @ 5M entries (2.1 GB RSS):    ~4 ms median  (linear extrapolation)

Long-lived process (after big alloc/free churn):
  fork @ 5M entries:    33 ms median
  fork @ 10M entries:   15 ms median
  fork @ 20M entries:   16 ms median

Reason: macOS libmalloc / Linux glibc don't aggressively return freed
memory to the OS — process virtual address space (the thing fork()
copies page tables for) does not shrink back. Once beava has touched
N gigabytes of VM in its lifetime, fork pays the N-GB page-table-copy
cost on every snapshot.

## Is 10ms too long?

For #151 (the kalshi-pulse incident): no — 2300 ms → 15 ms is the fix.
A 15 ms /ping spike once per 60 s snapshot cycle does not trip any
docker healthcheck. SEV-1 resolved.

For beava's 3M EPS/core target: borderline. 15 ms × 3M = 45k events
queued per snapshot cycle, drained in ~15 ms after release. 0.025%
wall-clock blocked.

For sub-millisecond fork: requires either (a) keeping process VM tiny
(impractical for long-running production) or (b) moving to ArcSwap
per-table (eliminates fork entirely; #151 issue lists this as the
'safe-Rust equivalent').

## Tests added

- fork_syscall_vs_process_rss              — scaling chart (informational,
                                              no hard assertion)
- fork_at_1m_fresh_process_under_5ms       — best-case floor; <5ms
                                              (must run alone — see comment)
- fork_lock_hold_at_5m_entries_under_50ms  — long-lived realistic ceiling
- fork_lock_hold_at_10m_entries_under_100ms
- fork_lock_hold_at_20m_entries_under_200ms

All #[ignore]'d. Run:

    cargo test --release -p beava-server --test snapshot_fork_scaling \
        -- --ignored --nocapture --test-threads=1

The 'fresh' test asserts <5ms but only passes when run alone (any
test that allocates >1GB earlier in the same process will bloat the
allocator's pool and bias subsequent forks to ~15-40ms).

## Production guidance

- Up to ~5M entries / 2 GB working set: fork stays under 50ms. ✓
- 5M-20M entries: fork is 15-30ms. Tolerable for fraud-serving SLAs.
- Past 20M (~6 GB RSS): consider ArcSwap or sharding state across
  multiple beava instances (Redis-cluster pattern, matches the
  project_no_sharded_apply commitment).
… comparison

Pushes fork() measurement to the upper bound of what a single beava
instance is sized for (50M entries → 7.6 GB RSS). Includes inline
comparison against Redis's published latest_fork_usec numbers across
infrastructure types.

## Empirical results (Apple M4 release)

   entries    RSS    fork_median    fork_max    ms/GB
   30M        6771MB  0.67ms         1.04ms      0.10
   40M        6955MB  0.47ms         1.17ms      0.07
   50M        7611MB  0.44ms         0.54ms      0.06

**Sub-millisecond fork at 7.6 GB RSS.** This is the steady-state regime:
process built up monotonically to 50M entries, never had a larger working
set in the past. The earlier 15-30 ms numbers in snapshot_fork_scaling.rs
were the transient regime: a process that previously held LARGER state,
then shrunk, and is now paying the page-table cost for the bloated VM.

## Published Redis fork numbers (for comparison)

| Infrastructure              | ms/GB     | Source |
|-----------------------------|-----------|--------|
| Apple M4 (this beava test)  | ~0.07-0.1 | this PR |
| Linux physical (Xeon, Redis)|  9        | antirez.com |
| Linux VMware VM (Redis)     | 12.8      | antirez.com |
| AWS EC2 HVM modern (Redis)  | ~10       | Redis docs |
| AWS EC2 Xen old (Redis)     | 239       | Redis docs |
| Linode Xen small VM (Redis) | 424       | Redis docs |

Redis user guidance:
  * fork >10ms = worth investigating
  * fork >200ms = a problem

Apple M4 + beava is ~100× faster than Linux physical for fork — likely
because of:
  * Apple Silicon's hardware page-table walker
  * macOS's pmap COW implementation
  * relatively dense page-table layout when process VM is monotonic

Production Linux deployments should expect ~10 ms/GB (matches Redis's
HVM AWS numbers). At incident scale (5 GB working set) that's ~50ms
fork — well under any docker healthcheck timeout, fully resolves #151.

## Two regimes, summarized

1. **Fresh / monotonic-growth process**: fork is fast (~1 ms/GB)
   - Apple M4: ~0.1 ms/GB
   - Linux physical: ~1 ms/GB extrapolation
   - This is the steady-state production regime when state grows.

2. **Long-lived process with VM bloat** (state grew then shrunk):
   - Apple M4: ~15-30 ms regardless of current state
   - Cause: allocator keeps VM mapped after free; fork copies the full
     virtual-address-space page table.
   - For beava with Phase 12.8 cold-entity TTL eviction, this regime
     applies — state DOES shrink, so production beava will see the
     slower path in long-running processes.

Run: cargo test --release -p beava-server --test snapshot_fork_extreme \
       fork_at_extreme_scale -- --ignored --nocapture

Runtime ~150 seconds; peak ~7.6 GB RAM. Requires 32 GB+ host.
Adds a non-#[ignore]'d test that builds 1M entries (~700 MB RSS) and
asserts fork() lock-hold stays under 10 ms. Runs on every cargo test
to catch regressions that re-introduce O(N) work under
state_tables.lock().

## Why 1M (not 5M+)

1M is the largest scale that fits comfortably in default CI:
  - ~700 MB peak RSS
  - ~10s build in debug, ~1s in release
  - Empirically fork is 0.7-0.8ms on Apple M4 (well under 10ms)

5M+ entries (~2 GB RSS, ~5-10s build) requires the --ignored opt-in
tests already present in snapshot_big_state.rs / snapshot_fork_scaling.rs
/ snapshot_fork_extreme.rs.

## Why a separate test binary

cargo spawns each test binary in its own process. Putting this test in
its own file guarantees a fresh process VM — no sibling tests can
bloat the allocator's reserved range and slow down fork().

## Measured

Apple M4 release:  median 0.70ms, max 0.77ms — 14× margin under ceiling
Apple M4 debug:    median 0.84ms, max 2.35ms — 4× margin under ceiling

Linux runners are typically 2-5× slower for fork than Apple Silicon
(per Redis's published latest_fork_usec numbers); the 10ms ceiling
absorbs this comfortably.

## Regression contract

If anyone re-introduces an iter_sorted-style or clone-collect operation
under state_tables.lock() in the snapshot path, fork lock-hold will
exceed 10ms at 1M entries and this test will fail in CI.
## Conditional snapshot (Redis 'save N M' pattern)

When BEAVA_SNAPSHOT_MIN_EVENTS=N is set (default 0 = off), the periodic
snapshot tick skips snapshotting unless at least N WAL events have
committed since the previous successful snapshot. Mirrors Redis's
'save N M' directive — idle minutes don't write a 500 MB snapshot.

  * New field: SnapshotTaskConfig.min_events_per_snapshot (u64)
  * Tracking: last_snapshot_lsn maintained in spawn_snapshot_task closure
  * Skip log: snapshot.skipped_below_threshold (debug level)
  * Manual trigger (force_snapshot_now) bypasses threshold

## Refactor: env reads centralized to boot site

Per the Phase 13.5.3 architectural rule
(phase13_5_3_no_env_var_pokes_in_tests): env vars must be read once at
boot in server.rs, not by per-tick code or by tests via set_var.

  * SnapshotTaskConfig.use_fork_snapshot replaces the env-on-every-tick
    fork_enabled() call in do_snapshot. server.rs reads
    BEAVA_SNAPSHOT_FORK once at boot.
  * SnapshotTaskConfig.min_events_per_snapshot already followed this
    pattern; min_events_from_env() is the boot-time reader.
  * Removed fork_env_gate and env_parsing tests that poked
    std::env::set_var in beava-server/tests/ (violated tripwire).

## Bug fix: stale-baseline race in conditional skip

Read last_snapshot_lsn BEFORE the 'skip first immediate tick' await.
Previously the first iv.tick().await could yield to the runtime, letting
concurrent appends advance the LSN before the baseline was captured —
the first real tick would then observe delta=0 and skip even though
events accumulated.

## CI fixes from PR #152 first run

  1. clippy box_default in crates/beava-core/tests/snapshot_lock_hold_repro.rs:50
     — Box::new(Default::default()) -> Box::default()
  2. phase13_5_3_no_env_var_pokes_in_tests architectural tripwire —
     removed fork_env_gate + env_parsing tests; refactored as above.
  3. unused CountDistinctStateWrap import after the box_default fix.

## Tests

Default cargo test now has 4 conditional-snapshot tests + the previously-
landed 27 fork tests. All green:

  test default_zero_threshold_always_snapshots_on_tick ... ok
  test nonzero_threshold_skips_when_below ... ok
  test nonzero_threshold_fires_when_met ... ok
  test manual_trigger_bypasses_threshold ... ok

Architectural tripwires green:
  phase12_6_mio_only_dataplane (3/3)
  phase13_5_3_no_env_var_pokes_in_tests (2/2)
petrpan26 pushed a commit that referenced this pull request May 22, 2026
## Conditional snapshot (Redis 'save N M' pattern)

When BEAVA_SNAPSHOT_MIN_EVENTS=N is set (default 0 = off), the periodic
snapshot tick skips snapshotting unless at least N WAL events have
committed since the previous successful snapshot. Mirrors Redis's
'save N M' directive — idle minutes don't write a 500 MB snapshot.

  * New field: SnapshotTaskConfig.min_events_per_snapshot (u64)
  * Tracking: last_snapshot_lsn maintained in spawn_snapshot_task closure
  * Skip log: snapshot.skipped_below_threshold (debug level)
  * Manual trigger (force_snapshot_now) bypasses threshold

## Refactor: env reads centralized to boot site

Per the Phase 13.5.3 architectural rule
(phase13_5_3_no_env_var_pokes_in_tests): env vars must be read once at
boot in server.rs, not by per-tick code or by tests via set_var.

  * SnapshotTaskConfig.use_fork_snapshot replaces the env-on-every-tick
    fork_enabled() call in do_snapshot. server.rs reads
    BEAVA_SNAPSHOT_FORK once at boot.
  * SnapshotTaskConfig.min_events_per_snapshot already followed this
    pattern; min_events_from_env() is the boot-time reader.
  * Removed fork_env_gate and env_parsing tests that poked
    std::env::set_var in beava-server/tests/ (violated tripwire).

## Bug fix: stale-baseline race in conditional skip

Read last_snapshot_lsn BEFORE the 'skip first immediate tick' await.
Previously the first iv.tick().await could yield to the runtime, letting
concurrent appends advance the LSN before the baseline was captured —
the first real tick would then observe delta=0 and skip even though
events accumulated.

## CI fixes from PR #152 first run

  1. clippy box_default in crates/beava-core/tests/snapshot_lock_hold_repro.rs:50
     — Box::new(Default::default()) -> Box::default()
  2. phase13_5_3_no_env_var_pokes_in_tests architectural tripwire —
     removed fork_env_gate + env_parsing tests; refactored as above.
  3. unused CountDistinctStateWrap import after the box_default fix.

## Tests

Default cargo test now has 4 conditional-snapshot tests + the previously-
landed 27 fork tests. All green:

  test default_zero_threshold_always_snapshots_on_tick ... ok
  test nonzero_threshold_skips_when_below ... ok
  test nonzero_threshold_fires_when_met ... ok
  test manual_trigger_bypasses_threshold ... ok

Architectural tripwires green:
  phase12_6_mio_only_dataplane (3/3)
  phase13_5_3_no_env_var_pokes_in_tests (2/2)
Two operator-facing changes that make the fork()+COW snapshot path the
default behaviour on linux/macos and protect it from system-wide
Transparent Huge Pages misconfiguration.

snapshot_fork::fork_enabled():
- Previously opt-in via BEAVA_SNAPSHOT_FORK=1. Now defaults ON whenever
  cfg!(unix). Operators opt out by setting BEAVA_SNAPSHOT_FORK to 0,
  false, no, or empty. Non-unix targets return false (fork(2) absent).
- All four snapshot_conditional.rs tests already pass use_fork_snapshot
  explicitly in their SnapshotTaskConfig, so the default flip changes
  no test behaviour.

new crate::thp module wired into ServerV18::bind:
- prctl(PR_SET_THP_DISABLE) opts this process out of THP regardless of
  the system-wide setting. THP promotes 4KB pages to 2MB, which makes
  fork()+COW page granularity 2MB — a 500x amplifier on COW overhead
  during BGSAVE-style snapshots.
- Logs a structured WARN if /sys/kernel/mm/transparent_hugepage/enabled
  reads [always], mirroring the Redis startup warning. Non-linux is a
  no-op.

Full local gate green: cargo fmt --check, cargo build --workspace,
cargo clippy --workspace --all-targets --all-features -- -D warnings,
cargo test --workspace.
@petrpan26 petrpan26 force-pushed the phase-27.1-snapshot-fork-cow branch from 90b72b4 to 02cf812 Compare May 27, 2026 16:32
Avoid taking the registry RwLock in the fork child, and keep
conditional snapshot accounting anchored to the LSN actually written.

Move THP opt-out success to debug and keep the large lock-hold
repro out of default test runs.
@petrpan26 petrpan26 force-pushed the phase-27.1-snapshot-fork-cow branch from 02cf812 to 7beded7 Compare May 27, 2026 16:51
Track data-plane applied LSNs in snapshot progress and avoid locking
parking_lot mutexes in the fork child.

Skip hand-rolled WAL records covered by a snapshot, and rebuild from
WAL when a registry tail changes aggregation ids after the snapshot.
@petrpan26 petrpan26 force-pushed the phase-27.1-snapshot-fork-cow branch from 7beded7 to 33bc047 Compare May 27, 2026 17:41
Hoang Phan added 3 commits May 27, 2026 14:55
Compact the hand-rolled .wal file after durable snapshots by retaining only records above the snapshot LSN.

Reject control chars on push, quarantine LSN-known WAL decode failures, and publish snapshot write metrics.
@petrpan26 petrpan26 changed the title feat(27.1): fork()+COW snapshot — drop apply-thread lock-hold from seconds to microseconds (fixes #151) fix(snapshot): reclaim covered WAL bytes May 28, 2026
Hoang Phan added 11 commits May 28, 2026 14:38
Open the replacement append fd before swapping the compacted WAL into place so a failed open cannot leave the writer on an unlinked inode.

Treat torn hand-rolled WAL tails like recovery EOF, add control-character coverage, and make fork child waits killable.
After a snapshot child times out, poll waitpid with WNOHANG for a bounded grace window instead of falling back to an unbounded reap.

Cover the timeout path with a fork test that kills and reaps a wedged child.
Check the deadline before retrying interrupted waitpid calls so signal storms cannot bypass the fork child wait budget.
Use the snapshot body applied LSN for data-plane replay and repair torn hand-rolled WAL tails before append. Make compaction directory sync failures visible while keeping the writer on the replacement file.

Tighten fork snapshot sidecars and lock scope, and reject escaped TCP control characters before event lookup.
Homepage run-for-real blocks used bv.App("0.0.0.0:6400").register(...).serve():
bare host:port (SDK requires a URL scheme), a .serve() method that does not
exist, and the wrong port. Point them at http://localhost:8080 and drop
.serve(). Fix bv.col("amount").cast(int) -> cast("int") in the streams
concept page (cast accepts only the string forms).
bv.test.fixture is a pytest-shaped generator yielding a bv.App, not a
context manager with advance_time(); rewrite the testing example to the
real API. The field guides curled /pipelines, /events/{Name} and
/features/{name}?key= — none of which exist. Point them at the real
/register, /push and /get/{feature}/{key} routes.
Two grid blowouts forced the homepage to ~967px on a 390px phone, clipping
the live-reflexes feature column and bleeding the pipeline cards off-screen.
Collapse the live-reflexes rows to a single stacked column under 760px, and
switch the pipelines grid to minmax(0,1fr) so the code panel's long <pre>
lines scroll instead of widening the track.
The live query panel curled POST /batch/{table} with a bare ["id",...] body
and rendered a uid-keyed object — none of which the engine speaks. Use the
real POST /batch_get with {requests:[{table,key},...]} and render the
{results:[...]} list the server actually returns.
The chapter-1 query widget's mock results tacked a user_id onto each row;
the real /batch_get returns a flat feature map per result in request order
(no key wrapping), so drop it. Bump footer nav link tap targets from ~17px
to ~31px (inline-block + vertical padding) for comfortable mobile tapping.
@petrpan26 petrpan26 merged commit 2a38e03 into main May 29, 2026
16 checks passed
@petrpan26 petrpan26 deleted the phase-27.1-snapshot-fork-cow branch May 29, 2026 12:50
petrpan26 added a commit that referenced this pull request May 29, 2026
Cuts **0.0.6**: the Phase 27.1 snapshot fork-COW + WAL hardening (SEV-1
incident fix) and docs alignment, both already merged to main via #152.

- Bump `Cargo.toml` (workspace) + `python/pyproject.toml` to 0.0.6
- Refresh `Cargo.lock`
- CHANGELOG `[0.0.6]` entry

Tag `v0.0.6` will be pushed after merge to fire the release workflows
(GitHub Release, wheels, Docker, Homebrew).

---------

Co-authored-by: Hoang Phan <hoang@beava.dev>
Co-authored-by: dosubot[bot] <131922026+dosubot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant