fix(moarstats): close fsync race that silently dropped joined columns on macOS#3830
Merged
Conversation
… on macOS `join_datasets_internal` spawned `qsv join --output <temp>` and then handed that path to a `qsv stats` subprocess without fsyncing. Under heavy parallel test load on macos-26, the follow-up open() could see a short/empty file — producing a "primary-only" bivariate output with no failure signal. The test `moarstats_join_type_full_runs_and_writes_bivariate` flaked on this on macOS CI (run 25488066023). Compounding the issue, the function used `NamedTempFile` + `drop(temp_file)` to "close the file so join can write to it" — but NamedTempFile's Drop deletes the file from disk, not just closes the handle. The path was left as a dangling reservation that `qsv join` re-created with O_CREAT. Three layers of hardening: 1. Replace `NamedTempFile` + `drop` with `into_temp_path().keep()` so the path is owned as a normal file (same fix applied to the joined-stats subprocess temp path). 2. Add `util::sync_subprocess_output(path)` — opens read-only, calls `sync_all()`, validates non-empty. Called after every join/stats subprocess that writes via `--output`. The fsync is the load-bearing barrier on macOS APFS; nearly free on Linux. 3. After each `qsv join` succeeds, parse the joined header and validate every column from the secondary input is present. If a future regression produces silently-corrupt joined output, it now fails with a precise `CliError` instead of confusing downstream stats results. Also harden the four moarstats join-type tests with a new `assert_bivariate_columns_present` helper that pre-checks expected columns before n_pairs assertions, so the next flake (if any) surfaces as "missing columns [value2a, value2b]" rather than "no row found". Verified: 10× stress loop on `test_moarstats::moarstats_join` with `--test-threads=8` — all green; full `test_moarstats::` module 80/80. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens moarstats’s multi-dataset join flow against flaky, platform-specific filesystem timing issues by forcing subprocess output synchronization and by failing fast when join output is structurally incomplete (missing expected columns), improving both reliability and diagnostics.
Changes:
- Added
util::sync_subprocess_outputto fsync and validate non-empty outputs produced via--output. - Updated
moarstatsjoin/stats subprocess pipelines to persist temp paths, fsync outputs, and validate key headers/columns. - Strengthened
test_moarstatsjoin-type tests with a helper that asserts expected bivariate fields are present before checkingn_pairs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/test_moarstats.rs |
Adds a bivariate-field presence assertion helper and applies it to join-related tests to surface missing-column failures clearly. |
src/util.rs |
Introduces a reusable helper to sync subprocess-written output files and ensure they’re non-empty. |
src/cmd/moarstats.rs |
Reworks temp output handling for join/stats subprocesses, adds fsync barriers, and validates join/stats headers to fail loudly on corrupt output. |
…ookup, debug-level header dump Three Copilot suggestions on PR #3830, all applied: - Intermediate join temp files are now retained as `Vec<TempPath>` (without `.keep()`) so they auto-delete when `join_datasets_internal` returns, rather than accumulating in TEMP_FILE_DIR. The final temp path keeps `.keep()` because the caller owns its lifetime. - Joined-header membership check now uses a `HashSet<&str>` built once per iteration, replacing the O(n*m) `iter().any(...)` loop — relevant for wide CSVs and multi-join chains. - The full joined-header vector is now logged at `debug` only; `info`-level output is the compact column count + byte size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
warning: called `map(<f>).unwrap_or(<a>)` on a `Result` value
--> src/cmd/moarstats.rs:708:20
|
708 | let size = std::fs::metadata(output_path).map(|m| m.len()).unwrap_or(0);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#map_unwrap_or
= note: `-W clippy::map-unwrap-or` implied by `-W clippy::pedantic`
= help: to override `-W clippy::pedantic` add `#[allow(clippy::map_unwrap_or)]`
help: use `map_or(<a>, <f>)` instead
|
708 - let size = std::fs::metadata(output_path).map(|m| m.len()).unwrap_or(0);
708 + let size = std::fs::metadata(output_path).map_or(0, |m| m.len());
…rop note Re-introduces `#[allow(clippy::collection_is_never_read)]` (which the original code had) and replaces the comment with a sharper one that flags the Vec as a Drop-guard collection. Surfaced by `cargo +nightly clippy -- -W clippy::nursery`. The Vec's contents are never read — it exists purely so each `TempPath` stays alive until the function returns, at which point Drop removes its intermediate join file from disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 7, 2026
jqnatividad
added a commit
that referenced
this pull request
May 9, 2026
…arent dir (#3838) * fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir Two conservative guards for the Linux-only flake on test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate (CI run 25545197594, surfaced via PR #3834). The macOS fix (#3830) and Windows fix (#3831) hardened the parent-side fsync. Linux still surfaces the same symptom — bivariate output missing secondary columns — at a much lower frequency. The most plausible remaining gap is the `qsv stats`-on-joined-CSV subprocess handoff: the spawned child has been observed to produce stats records for fewer columns than the joined CSV actually has. Two guards: 1. Retry once on stats-coverage mismatch. After `qsv stats` runs on the joined CSV, cross-check that every column in the joined CSV's header appears in the stats output's `field` column. On mismatch, re-fsync the joined CSV and re-run `qsv stats` once. Only fail loud if the mismatch persists. Self-heals the rare race; surfaces a precise error if a different root cause ever appears. 2. Fsync the joined CSV's parent directory after `join_datasets_internal` returns. `fsync(file)` on Linux doesn't flush parent-dir metadata, and rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect read-after-write visibility for the follow-up subprocess. New helper `util::sync_directory()` is a no-op on Windows, where dir-handle `FlushFileBuffers` would error. Also includes `tools/repro_moarstats_linux_flake.sh` — a parallel-worker harness that replicates the failing test as a shell loop, optionally adds page-cache pressure (DD_LOAD=1), and classifies failures into "retry caught it" / "retry didn't catch it" / "different bug" buckets to confirm root cause on a Linux VM. Verified on macOS: `cargo build --locked --bin qsv -F all_features` and `cargo clippy --locked --bin qsv -F all_features -- -D warnings` clean; `cargo t -F all_features --test tests test_moarstats::` 80/80 pass at --test-threads=8; failing test in a 30x stress loop all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address copilot review: best-effort dir fsync, error propagation, script hardening All six Copilot comments on PR #3838 applied: src/util.rs - sync_directory: errors no longer propagated. Directory fsync is not uniformly supported across Linux filesystems (FAT, some FUSE/overlay/ network mounts return EINVAL/EOPNOTSUPP); a defensive guard layered on top of file-level fsync must NOT fail the caller. Errors are now logged at debug level and swallowed. Signature changed from CliResult<()> to () to make the contract explicit. - New unit test test_sync_directory_smoke covering both the existing-dir and missing-dir paths (Unix-only). src/cmd/moarstats.rs - Call site updated for the new sync_directory signature. - Stats-output reader now propagates csv parse errors instead of silently dropping rows via filter_map(Result::ok). A malformed/ truncated stats row now surfaces as "Failed to parse joined-stats row N" rather than as a misleading downstream "missing columns" assertion — which is exactly the diagnostic mode this fix is trying to eliminate. tools/repro_moarstats_linux_flake.sh - FIFO is now created inside a real mktemp -d directory, replacing the inherently racy mktemp -u. - cleanup trap now removes LOAD_DIR (when DD_LOAD=1) and FIFO_DIR so temp resources are not leaked on interrupts/early exits. - bivariate awk check now distinguishes "header malformed" (exit 2 -> FAIL_OTHER) from "coverage failed" (exit 1 -> FAIL_OLD_BIVARIATE), so a truncated bivariate header is not silently misclassified as a column-drop bug. Verified on macOS: cargo build --locked clean, cargo clippy -- -D warnings clean, all 80 test_moarstats tests pass at --test-threads=8, the new test_sync_directory_smoke unit test passes, smoke-run of repro script green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate(run 25488066023) whereqsv join's output was not fsynced before being read by a follow-upqsv statssubprocess, producing a silent "primary-only" bivariate output under heavy parallel test load on APFS.NamedTempFile+droppattern injoin_datasets_internal(Drop deletes the file from disk, not just closes the handle) withinto_temp_path().keep(), and adds a genericutil::sync_subprocess_outputhelper that fsyncs and validates non-emptiness for every join/stats subprocess that writes via--output.qsv join, the secondary input's columns must appear in the joined output, or moarstats fails with a preciseCliErrorinstead of feeding silently-corrupt data downstream. Mirrors this with a newassert_bivariate_columns_presenttest helper used in all four join-type tests so any future flake surfaces as "missing columns [value2a, value2b]" rather than a confusing n_pairs mismatch.Test plan
cargo build --locked --bin qsv -F all_featurescargo test --locked -F all_features test_moarstats::— 80/80 passtest_moarstats::moarstats_joinwith--test-threads=8— all green every iterationcargo clippy— no new warnings in touched code (pre-existing PI-constant errors insrc/cmd/excel.rs/tests/test_luau.rsare unrelated and present on master)🤖 Generated with Claude Code