fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir#3838
Merged
Merged
Conversation
…arent dir Two conservative guards for the Linux-only flake on test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate (CI run 25545197594, surfaced via PR #3834). The macOS fix (#3830) and Windows fix (#3831) hardened the parent-side fsync. Linux still surfaces the same symptom — bivariate output missing secondary columns — at a much lower frequency. The most plausible remaining gap is the `qsv stats`-on-joined-CSV subprocess handoff: the spawned child has been observed to produce stats records for fewer columns than the joined CSV actually has. Two guards: 1. Retry once on stats-coverage mismatch. After `qsv stats` runs on the joined CSV, cross-check that every column in the joined CSV's header appears in the stats output's `field` column. On mismatch, re-fsync the joined CSV and re-run `qsv stats` once. Only fail loud if the mismatch persists. Self-heals the rare race; surfaces a precise error if a different root cause ever appears. 2. Fsync the joined CSV's parent directory after `join_datasets_internal` returns. `fsync(file)` on Linux doesn't flush parent-dir metadata, and rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect read-after-write visibility for the follow-up subprocess. New helper `util::sync_directory()` is a no-op on Windows, where dir-handle `FlushFileBuffers` would error. Also includes `tools/repro_moarstats_linux_flake.sh` — a parallel-worker harness that replicates the failing test as a shell loop, optionally adds page-cache pressure (DD_LOAD=1), and classifies failures into "retry caught it" / "retry didn't catch it" / "different bug" buckets to confirm root cause on a Linux VM. Verified on macOS: `cargo build --locked --bin qsv -F all_features` and `cargo clippy --locked --bin qsv -F all_features -- -D warnings` clean; `cargo t -F all_features --test tests test_moarstats::` 80/80 pass at --test-threads=8; failing test in a 30x stress loop all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds additional durability/visibility guards around moarstats’ joined-CSV → qsv stats subprocess flow to mitigate a rare Linux CI flake, and introduces a Linux stress/repro harness.
Changes:
- Add directory fsync helper and use it after join output creation.
- Validate
qsv statscoverage vs joined CSV headers and retry once on mismatch. - Add a parallel worker shell harness to reproduce/classify the flake on Linux.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| tools/repro_moarstats_linux_flake.sh | Adds a parallel stress harness to reproduce and classify the Linux-only flake. |
| src/util.rs | Introduces sync_directory to fsync parent directories (Unix) for durability/visibility. |
| src/cmd/moarstats.rs | Adds parent-dir fsync after join and a retry-on-stats-coverage-mismatch guard for the qsv stats subprocess. |
…ipt hardening All six Copilot comments on PR #3838 applied: src/util.rs - sync_directory: errors no longer propagated. Directory fsync is not uniformly supported across Linux filesystems (FAT, some FUSE/overlay/ network mounts return EINVAL/EOPNOTSUPP); a defensive guard layered on top of file-level fsync must NOT fail the caller. Errors are now logged at debug level and swallowed. Signature changed from CliResult<()> to () to make the contract explicit. - New unit test test_sync_directory_smoke covering both the existing-dir and missing-dir paths (Unix-only). src/cmd/moarstats.rs - Call site updated for the new sync_directory signature. - Stats-output reader now propagates csv parse errors instead of silently dropping rows via filter_map(Result::ok). A malformed/ truncated stats row now surfaces as "Failed to parse joined-stats row N" rather than as a misleading downstream "missing columns" assertion — which is exactly the diagnostic mode this fix is trying to eliminate. tools/repro_moarstats_linux_flake.sh - FIFO is now created inside a real mktemp -d directory, replacing the inherently racy mktemp -u. - cleanup trap now removes LOAD_DIR (when DD_LOAD=1) and FIFO_DIR so temp resources are not leaked on interrupts/early exits. - bivariate awk check now distinguishes "header malformed" (exit 2 -> FAIL_OTHER) from "coverage failed" (exit 1 -> FAIL_OLD_BIVARIATE), so a truncated bivariate header is not silently misclassified as a column-drop bug. Verified on macOS: cargo build --locked clean, cargo clippy -- -D warnings clean, all 80 test_moarstats tests pass at --test-threads=8, the new test_sync_directory_smoke unit test passes, smoke-run of repro script green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two conservative guards for the Linux-only flake on
test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate(CI run 25545197594, surfaced via PR #3834 — that PR only touchesdescribegptand is a bystander).The macOS fix in #3830 and Windows fix in #3831 hardened the parent-side fsync. Linux still surfaces the same symptom — bivariate output missing secondary columns — at a much lower frequency. The most plausible remaining gap is the
qsv stats-on-joined-CSV subprocess handoff: the spawned child has been observed to produce stats records for fewer columns than the joined CSV actually has.Guard 1 — retry once on stats-coverage mismatch (
src/cmd/moarstats.rs)After
qsv statsruns on the joined CSV:fieldcolumn.qsv statsonce.This self-heals the rare race silently (with a
log::warn!for visibility in CI logs) and surfaces a precise error if a different root cause ever appears in the future.Guard 2 — fsync the joined CSV's parent directory (
src/util.rs)fsync(file)on Linux doesn't flush parent-directory metadata. While that primarily matters for crash safety, rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect read-after-write visibility for newly-created files when a follow-up process opens them via path lookup.New helper
util::sync_directory(path):sync_all().FlushFileBufferserrors).Called once on the joined CSV's parent dir right after
join_datasets_internalreturns.Repro harness —
tools/repro_moarstats_linux_flake.shParallel-worker shell harness that replicates the failing test as a shell loop, optionally adds page-cache pressure (
DD_LOAD=1), and classifies failures into:FAIL_NEW_DIAGNOSTIC— strict coverage check fired after retry → confirms subprocess CSV→stats handoff is the root cause.FAIL_OLD_BIVARIATE— bivariate output dropped columns and the strict check did NOT fire → points elsewhere.FAIL_OTHER— qsv error / missing output.Usage on an Ubuntu VM:
QSV_BIN=$PWD/target/debug/qsv WORKERS=4 ITERS=500 DD_LOAD=1 \ ./tools/repro_moarstats_linux_flake.shTest plan
cargo build --locked --bin qsv -F all_featuresclean (macOS).cargo clippy --locked --bin qsv -F all_features -- -D warningsclean.cargo t -F all_features --test tests test_moarstats::— 80/80 pass at--test-threads=8.tools/repro_moarstats_linux_flake.shon an Ubuntu VM (4 vCPU,DD_LOAD=1, ≥2000 runs) — expect either no failures, orFAIL_NEW_DIAGNOSTIC = 0withlog::warn!messages indicating the retry path triggered and self-healed.🤖 Generated with Claude Code