Skip to content

test: harden record log with property and failure tests (#21)#25

Merged
robertoberto merged 2 commits into
mainfrom
harden/proptest-lock-diskfull
May 21, 2026
Merged

test: harden record log with property and failure tests (#21)#25
robertoberto merged 2 commits into
mainfrom
harden/proptest-lock-diskfull

Conversation

@robertoberto
Copy link
Copy Markdown
Contributor

Summary

Fixes the partial-tail bug reported in #21 by introducing a writer-poison state on RecordLog, and lands the Tier 1 hardening track promised in umbrella #22 (property tests, multi-process lock test, disk-full simulation).

Closes #21. Refs #22 hardening umbrella PR-3.

Writer poisoning (the fix for #21)

After any I/O failure in append_record, fsync, or rotate, the live handle transitions to poisoned. Subsequent mutating calls return a deterministic error with stable wording:

datawal: writer poisoned: <reason>; drop handle and reopen

The wording is part of the contract and is asserted by poison_message_is_stable_format. Read-only operations (scan_iter, recovery_report, active_segment, dir, plus the new is_poisoned()) remain available so the caller can inspect before dropping. Reopen runs the existing longest-valid-prefix recovery and starts fresh.

A typed error variant (e.g. Error::Poisoned) is out of scope for this PR; that would mean expanding the public surface and shipping a typed error story for the whole crate. Tracked separately.

What ships

  • crates/datawal-core/src/record_log.rs — private poisoned: Option<&'static str> field, is_poisoned(), internal check_poisoned() + poison_error() helpers, write paths in append_record / fsync / rotate rewritten to set poison on failure. Rustdoc updated with a "Failure model" section.
  • crates/datawal-core/src/lib.rs#[doc(hidden)] pub mod testing with a single helper poison_record_log_for_test, used by the unit tests to inject synthetic poison without needing a real disk failure.
  • tests/poison_writer.rs (6 tests) — documents the failure model end-to-end.
  • tests/proptest_recovery.rs (5 properties × 64 cases each) — random sequences of put / delete / fsync / reopen / rotate against an in-memory BTreeMap oracle:
    • reopen_preserves_kv_state
    • delete_then_put_resurrects_only_new_value
    • compact_to_preserves_kv_state
    • scan_iter_equals_scan
    • rotate_is_kv_transparent
  • tests/multi_process_lock.rs (1 unix-only test) — env-var bifurcation pattern (same idea as the existing crash_injection.rs): a child opens the store first, the parent observes the contended error containing lock, kills the child, verifies the lock becomes acquirable again.
  • tests/disk_full.rs (1 auto + 1 ignored manual, linux-only) — fills /dev/shm tmpfs with appends until ENOSPC, then asserts the poison contract. Skips cleanly when /dev/shm is absent or unwritable. The ignored manual_disk_full variant lets an operator point DATAWAL_DISK_FULL_DIR at a deliberately-small dedicated tmpfs for pre-release rehearsal.

What stays the same

New dependency

proptest = { version = "1.5", default-features = false, features = ["std"] } as a dev-dependency only. Workspace-level so cargo doc and cargo publish of the library do not see it. 1.5+ supports MSRV 1.66, so the 1.75.0 job stays green.

Public surface delta

Additive only: one new method RecordLog::is_poisoned(&self) -> bool. No new types, no breaking change. The #[doc(hidden)] pub mod testing is documented as a test-only escape hatch and is not part of the contract.

Local validation

  • cargo fmt --all -- --check — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo test --workspace --all-targets — green (123 tests across both crates)
  • RUSTDOCFLAGS='-D warnings' cargo doc --workspace --no-deps — clean
  • cargo bench --workspace --no-run — all benches build
  • cargo publish --dry-run -p datawal --allow-dirty — clean

Tier 2 explicitly deferred

dm-flakey, differential test against a second implementation, memory-leak instrumentation, concurrent fuzz — none of these land here.

Closes #21. Refs #22 hardening umbrella PR-3.

Fixes the partial-tail bug reported in #21 by adding a writer-poison
state to RecordLog. After any I/O failure in append_record, fsync, or
rotate, the handle transitions to poisoned: subsequent mutating calls
return a deterministic error with the stable wording

    datawal: writer poisoned: <reason>; drop handle and reopen

Read-only operations (scan_iter, recovery_report, active_segment, dir)
remain available so callers can inspect before dropping. Reopen runs
the existing longest-valid-prefix recovery and starts fresh; the wire
format is unchanged. The error remains anyhow-wrapped for now;
promoting it to a typed Error variant is a separate PR (out of scope
here to avoid touching the public surface).

Hardening tests added under crates/datawal-core/tests/:

  * poison_writer.rs (6 tests) -- documents the failure model:
    mutating calls poison; read-only calls remain available; reopen
    clears poison; the exact wording is locked in.

  * proptest_recovery.rs (5 properties x 64 cases) -- random
    sequences of put / delete / fsync / reopen / rotate against an
    in-memory BTreeMap oracle. Properties: reopen preserves KV
    state, delete-then-put resurrects only the new value, compact_to
    preserves KV state, scan_iter equals eager scan, rotate is
    KV-transparent. Uses proptest 1.5+ with default-features off
    (std only), failure_persistence disabled to avoid polluting the
    tree.

  * multi_process_lock.rs (1 unix-only test) -- env-var bifurcation
    in the same binary: a child process opens the store first
    (taking the fs2 advisory lock), the parent observes the
    contended error containing "lock", kills the child, and verifies
    the lock becomes acquirable again. Pattern mirrors the existing
    crash_injection.rs tests.

  * disk_full.rs (1 auto + 1 ignored manual, linux-only) -- fills
    /dev/shm tmpfs (64 MiB on standard containers) with appends
    until the underlying write_all fails with ENOSPC, then asserts
    the documented poison contract. Skips cleanly when /dev/shm is
    absent or unwritable. The ignored manual_disk_full variant lets
    an operator point DATAWAL_DISK_FULL_DIR at a deliberately-small
    dedicated tmpfs for stronger pre-release rehearsal.

Tier 2 hardening (dm-flakey, differential test, memory leak, concurrent
fuzz) is intentionally deferred. CI surface unchanged: the new tests
run inside the existing rust(stable) and rust(1.75.0) jobs. proptest
1.5+ supports MSRV 1.66, so the MSRV job stays green. WIRE_VERSION
remains 1; corpus fixtures untouched; formal models untouched.
`proptest >= 1.9` declares `rust-version = "1.82"`, and the resolver
picked up `1.11.0` (rust-version 1.85) on the MSRV 1.75 job, breaking
`cargo check` with:

  package `proptest v1.11.0` cannot be built because it requires
  rustc 1.85 or newer, while the currently active rustc version is
  1.75.0

`1.8.0` is the highest `proptest` line whose `rust-version` (1.74) is
still <= MSRV 1.75. Pin lives only in `Cargo.lock`; the workspace
caret `^1.5` is unchanged so fresh resolvers on stable still pick the
latest. Same pattern as the `getrandom` lock-pin documented in
`AGENTS.md` § "Toolchain pinning".

The 5 proptest properties in `tests/proptest_recovery.rs` were re-run
against 1.8.0 locally and all pass.
@robertoberto robertoberto merged commit ce74484 into main May 21, 2026
7 checks passed
@robertoberto robertoberto deleted the harden/proptest-lock-diskfull branch May 21, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: failed append may leave partial tail and stale active_size

1 participant