test: harden record log with property and failure tests (#21)#25
Merged
Conversation
Closes #21. Refs #22 hardening umbrella PR-3. Fixes the partial-tail bug reported in #21 by adding a writer-poison state to RecordLog. After any I/O failure in append_record, fsync, or rotate, the handle transitions to poisoned: subsequent mutating calls return a deterministic error with the stable wording datawal: writer poisoned: <reason>; drop handle and reopen Read-only operations (scan_iter, recovery_report, active_segment, dir) remain available so callers can inspect before dropping. Reopen runs the existing longest-valid-prefix recovery and starts fresh; the wire format is unchanged. The error remains anyhow-wrapped for now; promoting it to a typed Error variant is a separate PR (out of scope here to avoid touching the public surface). Hardening tests added under crates/datawal-core/tests/: * poison_writer.rs (6 tests) -- documents the failure model: mutating calls poison; read-only calls remain available; reopen clears poison; the exact wording is locked in. * proptest_recovery.rs (5 properties x 64 cases) -- random sequences of put / delete / fsync / reopen / rotate against an in-memory BTreeMap oracle. Properties: reopen preserves KV state, delete-then-put resurrects only the new value, compact_to preserves KV state, scan_iter equals eager scan, rotate is KV-transparent. Uses proptest 1.5+ with default-features off (std only), failure_persistence disabled to avoid polluting the tree. * multi_process_lock.rs (1 unix-only test) -- env-var bifurcation in the same binary: a child process opens the store first (taking the fs2 advisory lock), the parent observes the contended error containing "lock", kills the child, and verifies the lock becomes acquirable again. Pattern mirrors the existing crash_injection.rs tests. * disk_full.rs (1 auto + 1 ignored manual, linux-only) -- fills /dev/shm tmpfs (64 MiB on standard containers) with appends until the underlying write_all fails with ENOSPC, then asserts the documented poison contract. Skips cleanly when /dev/shm is absent or unwritable. The ignored manual_disk_full variant lets an operator point DATAWAL_DISK_FULL_DIR at a deliberately-small dedicated tmpfs for stronger pre-release rehearsal. Tier 2 hardening (dm-flakey, differential test, memory leak, concurrent fuzz) is intentionally deferred. CI surface unchanged: the new tests run inside the existing rust(stable) and rust(1.75.0) jobs. proptest 1.5+ supports MSRV 1.66, so the MSRV job stays green. WIRE_VERSION remains 1; corpus fixtures untouched; formal models untouched.
`proptest >= 1.9` declares `rust-version = "1.82"`, and the resolver picked up `1.11.0` (rust-version 1.85) on the MSRV 1.75 job, breaking `cargo check` with: package `proptest v1.11.0` cannot be built because it requires rustc 1.85 or newer, while the currently active rustc version is 1.75.0 `1.8.0` is the highest `proptest` line whose `rust-version` (1.74) is still <= MSRV 1.75. Pin lives only in `Cargo.lock`; the workspace caret `^1.5` is unchanged so fresh resolvers on stable still pick the latest. Same pattern as the `getrandom` lock-pin documented in `AGENTS.md` § "Toolchain pinning". The 5 proptest properties in `tests/proptest_recovery.rs` were re-run against 1.8.0 locally and all pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the partial-tail bug reported in #21 by introducing a writer-poison state on
RecordLog, and lands the Tier 1 hardening track promised in umbrella #22 (property tests, multi-process lock test, disk-full simulation).Closes #21. Refs #22 hardening umbrella PR-3.
Writer poisoning (the fix for #21)
After any I/O failure in
append_record,fsync, orrotate, the live handle transitions to poisoned. Subsequent mutating calls return a deterministic error with stable wording:The wording is part of the contract and is asserted by
poison_message_is_stable_format. Read-only operations (scan_iter,recovery_report,active_segment,dir, plus the newis_poisoned()) remain available so the caller can inspect before dropping. Reopen runs the existing longest-valid-prefix recovery and starts fresh.A typed error variant (e.g.
Error::Poisoned) is out of scope for this PR; that would mean expanding the public surface and shipping a typed error story for the whole crate. Tracked separately.What ships
crates/datawal-core/src/record_log.rs— privatepoisoned: Option<&'static str>field,is_poisoned(), internalcheck_poisoned()+poison_error()helpers, write paths inappend_record/fsync/rotaterewritten to set poison on failure. Rustdoc updated with a "Failure model" section.crates/datawal-core/src/lib.rs—#[doc(hidden)] pub mod testingwith a single helperpoison_record_log_for_test, used by the unit tests to inject synthetic poison without needing a real disk failure.tests/poison_writer.rs(6 tests) — documents the failure model end-to-end.tests/proptest_recovery.rs(5 properties × 64 cases each) — random sequences ofput / delete / fsync / reopen / rotateagainst an in-memoryBTreeMaporacle:reopen_preserves_kv_statedelete_then_put_resurrects_only_new_valuecompact_to_preserves_kv_statescan_iter_equals_scanrotate_is_kv_transparenttests/multi_process_lock.rs(1 unix-only test) — env-var bifurcation pattern (same idea as the existingcrash_injection.rs): a child opens the store first, the parent observes the contended error containinglock, kills the child, verifies the lock becomes acquirable again.tests/disk_full.rs(1 auto + 1 ignored manual, linux-only) — fills/dev/shmtmpfs with appends until ENOSPC, then asserts the poison contract. Skips cleanly when/dev/shmis absent or unwritable. The ignoredmanual_disk_fullvariant lets an operator pointDATAWAL_DISK_FULL_DIRat a deliberately-small dedicated tmpfs for pre-release rehearsal.What stays the same
WIRE_VERSION = 1. No corpus fixture mutation.RecordIter(added in feat: add RecordLog::scan_iter for record-level lazy scanning #23) unchanged.rust (stable)andrust (1.75.0)jobs.New dependency
proptest = { version = "1.5", default-features = false, features = ["std"] }as a dev-dependency only. Workspace-level socargo docandcargo publishof the library do not see it. 1.5+ supports MSRV 1.66, so the 1.75.0 job stays green.Public surface delta
Additive only: one new method
RecordLog::is_poisoned(&self) -> bool. No new types, no breaking change. The#[doc(hidden)] pub mod testingis documented as a test-only escape hatch and is not part of the contract.Local validation
cargo fmt --all -- --check— cleancargo clippy --workspace --all-targets -- -D warnings— cleancargo test --workspace --all-targets— green (123 tests across both crates)RUSTDOCFLAGS='-D warnings' cargo doc --workspace --no-deps— cleancargo bench --workspace --no-run— all benches buildcargo publish --dry-run -p datawal --allow-dirty— cleanTier 2 explicitly deferred
dm-flakey, differential test against a second implementation, memory-leak instrumentation, concurrent fuzz — none of these land here.