Skip to content

fix(moarstats): close fsync race that silently dropped joined columns on macOS#3830

Merged
jqnatividad merged 5 commits into
masterfrom
fix-flaky-moarstats-test-202605
May 7, 2026
Merged

fix(moarstats): close fsync race that silently dropped joined columns on macOS#3830
jqnatividad merged 5 commits into
masterfrom
fix-flaky-moarstats-test-202605

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

@jqnatividad jqnatividad commented May 7, 2026

Summary

  • Fixes a flaky macOS CI failure in test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate (run 25488066023) where qsv join's output was not fsynced before being read by a follow-up qsv stats subprocess, producing a silent "primary-only" bivariate output under heavy parallel test load on APFS.
  • Replaces a misleading NamedTempFile + drop pattern in join_datasets_internal (Drop deletes the file from disk, not just closes the handle) with into_temp_path().keep(), and adds a generic util::sync_subprocess_output helper that fsyncs and validates non-emptiness for every join/stats subprocess that writes via --output.
  • Adds joined-header validation: after each qsv join, the secondary input's columns must appear in the joined output, or moarstats fails with a precise CliError instead of feeding silently-corrupt data downstream. Mirrors this with a new assert_bivariate_columns_present test helper used in all four join-type tests so any future flake surfaces as "missing columns [value2a, value2b]" rather than a confusing n_pairs mismatch.

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo test --locked -F all_features test_moarstats:: — 80/80 pass
  • 10× stress loop on test_moarstats::moarstats_join with --test-threads=8 — all green every iteration
  • cargo clippy — no new warnings in touched code (pre-existing PI-constant errors in src/cmd/excel.rs / tests/test_luau.rs are unrelated and present on master)
  • CI macOS run on this PR

🤖 Generated with Claude Code

… on macOS

`join_datasets_internal` spawned `qsv join --output <temp>` and then handed
that path to a `qsv stats` subprocess without fsyncing. Under heavy parallel
test load on macos-26, the follow-up open() could see a short/empty file —
producing a "primary-only" bivariate output with no failure signal. The
test `moarstats_join_type_full_runs_and_writes_bivariate` flaked on this on
macOS CI (run 25488066023).

Compounding the issue, the function used `NamedTempFile` + `drop(temp_file)`
to "close the file so join can write to it" — but NamedTempFile's Drop
deletes the file from disk, not just closes the handle. The path was left as
a dangling reservation that `qsv join` re-created with O_CREAT.

Three layers of hardening:

1. Replace `NamedTempFile` + `drop` with `into_temp_path().keep()` so the
   path is owned as a normal file (same fix applied to the joined-stats
   subprocess temp path).
2. Add `util::sync_subprocess_output(path)` — opens read-only, calls
   `sync_all()`, validates non-empty. Called after every join/stats
   subprocess that writes via `--output`. The fsync is the load-bearing
   barrier on macOS APFS; nearly free on Linux.
3. After each `qsv join` succeeds, parse the joined header and validate
   every column from the secondary input is present. If a future regression
   produces silently-corrupt joined output, it now fails with a precise
   `CliError` instead of confusing downstream stats results.

Also harden the four moarstats join-type tests with a new
`assert_bivariate_columns_present` helper that pre-checks expected
columns before n_pairs assertions, so the next flake (if any) surfaces as
"missing columns [value2a, value2b]" rather than "no row found".

Verified: 10× stress loop on `test_moarstats::moarstats_join` with
`--test-threads=8` — all green; full `test_moarstats::` module 80/80.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad requested a review from Copilot May 7, 2026 10:44
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 7, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens moarstats’s multi-dataset join flow against flaky, platform-specific filesystem timing issues by forcing subprocess output synchronization and by failing fast when join output is structurally incomplete (missing expected columns), improving both reliability and diagnostics.

Changes:

  • Added util::sync_subprocess_output to fsync and validate non-empty outputs produced via --output.
  • Updated moarstats join/stats subprocess pipelines to persist temp paths, fsync outputs, and validate key headers/columns.
  • Strengthened test_moarstats join-type tests with a helper that asserts expected bivariate fields are present before checking n_pairs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_moarstats.rs Adds a bivariate-field presence assertion helper and applies it to join-related tests to surface missing-column failures clearly.
src/util.rs Introduces a reusable helper to sync subprocess-written output files and ensure they’re non-empty.
src/cmd/moarstats.rs Reworks temp output handling for join/stats subprocesses, adds fsync barriers, and validates join/stats headers to fail loudly on corrupt output.

Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs
Comment thread src/cmd/moarstats.rs Outdated
jqnatividad and others added 4 commits May 7, 2026 06:54
…ookup, debug-level header dump

Three Copilot suggestions on PR #3830, all applied:

- Intermediate join temp files are now retained as `Vec<TempPath>` (without
  `.keep()`) so they auto-delete when `join_datasets_internal` returns,
  rather than accumulating in TEMP_FILE_DIR. The final temp path keeps
  `.keep()` because the caller owns its lifetime.
- Joined-header membership check now uses a `HashSet<&str>` built once per
  iteration, replacing the O(n*m) `iter().any(...)` loop — relevant for
  wide CSVs and multi-join chains.
- The full joined-header vector is now logged at `debug` only; `info`-level
  output is the compact column count + byte size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
warning: called `map(<f>).unwrap_or(<a>)` on a `Result` value
   --> src/cmd/moarstats.rs:708:20
    |
708 |         let size = std::fs::metadata(output_path).map(|m| m.len()).unwrap_or(0);
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#map_unwrap_or
    = note: `-W clippy::map-unwrap-or` implied by `-W clippy::pedantic`
    = help: to override `-W clippy::pedantic` add `#[allow(clippy::map_unwrap_or)]`
help: use `map_or(<a>, <f>)` instead
    |
708 -         let size = std::fs::metadata(output_path).map(|m| m.len()).unwrap_or(0);
708 +         let size = std::fs::metadata(output_path).map_or(0, |m| m.len());
…rop note

Re-introduces `#[allow(clippy::collection_is_never_read)]` (which the
original code had) and replaces the comment with a sharper one that flags
the Vec as a Drop-guard collection. Surfaced by
`cargo +nightly clippy -- -W clippy::nursery`.

The Vec's contents are never read — it exists purely so each `TempPath`
stays alive until the function returns, at which point Drop removes its
intermediate join file from disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit f822a13 into master May 7, 2026
14 of 17 checks passed
@jqnatividad jqnatividad deleted the fix-flaky-moarstats-test-202605 branch May 7, 2026 11:05
@jqnatividad jqnatividad restored the fix-flaky-moarstats-test-202605 branch May 7, 2026 11:30
@jqnatividad jqnatividad deleted the fix-flaky-moarstats-test-202605 branch May 7, 2026 11:31
jqnatividad added a commit that referenced this pull request May 9, 2026
…arent dir (#3838)

* fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir

Two conservative guards for the Linux-only flake on
test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate
(CI run 25545197594, surfaced via PR #3834).

The macOS fix (#3830) and Windows fix (#3831) hardened the parent-side
fsync. Linux still surfaces the same symptom — bivariate output missing
secondary columns — at a much lower frequency. The most plausible
remaining gap is the `qsv stats`-on-joined-CSV subprocess handoff: the
spawned child has been observed to produce stats records for fewer
columns than the joined CSV actually has.

Two guards:

1. Retry once on stats-coverage mismatch. After `qsv stats` runs on the
   joined CSV, cross-check that every column in the joined CSV's header
   appears in the stats output's `field` column. On mismatch, re-fsync
   the joined CSV and re-run `qsv stats` once. Only fail loud if the
   mismatch persists. Self-heals the rare race; surfaces a precise error
   if a different root cause ever appears.

2. Fsync the joined CSV's parent directory after `join_datasets_internal`
   returns. `fsync(file)` on Linux doesn't flush parent-dir metadata, and
   rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect
   read-after-write visibility for the follow-up subprocess. New helper
   `util::sync_directory()` is a no-op on Windows, where dir-handle
   `FlushFileBuffers` would error.

Also includes `tools/repro_moarstats_linux_flake.sh` — a parallel-worker
harness that replicates the failing test as a shell loop, optionally
adds page-cache pressure (DD_LOAD=1), and classifies failures into
"retry caught it" / "retry didn't catch it" / "different bug" buckets
to confirm root cause on a Linux VM.

Verified on macOS: `cargo build --locked --bin qsv -F all_features` and
`cargo clippy --locked --bin qsv -F all_features -- -D warnings` clean;
`cargo t -F all_features --test tests test_moarstats::` 80/80 pass at
--test-threads=8; failing test in a 30x stress loop all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address copilot review: best-effort dir fsync, error propagation, script hardening

All six Copilot comments on PR #3838 applied:

src/util.rs
- sync_directory: errors no longer propagated. Directory fsync is not
  uniformly supported across Linux filesystems (FAT, some FUSE/overlay/
  network mounts return EINVAL/EOPNOTSUPP); a defensive guard layered on
  top of file-level fsync must NOT fail the caller. Errors are now logged
  at debug level and swallowed. Signature changed from CliResult<()> to
  () to make the contract explicit.
- New unit test test_sync_directory_smoke covering both the existing-dir
  and missing-dir paths (Unix-only).

src/cmd/moarstats.rs
- Call site updated for the new sync_directory signature.
- Stats-output reader now propagates csv parse errors instead of
  silently dropping rows via filter_map(Result::ok). A malformed/
  truncated stats row now surfaces as "Failed to parse joined-stats
  row N" rather than as a misleading downstream "missing columns"
  assertion — which is exactly the diagnostic mode this fix is trying
  to eliminate.

tools/repro_moarstats_linux_flake.sh
- FIFO is now created inside a real mktemp -d directory, replacing the
  inherently racy mktemp -u.
- cleanup trap now removes LOAD_DIR (when DD_LOAD=1) and FIFO_DIR so
  temp resources are not leaked on interrupts/early exits.
- bivariate awk check now distinguishes "header malformed" (exit 2 ->
  FAIL_OTHER) from "coverage failed" (exit 1 -> FAIL_OLD_BIVARIATE), so
  a truncated bivariate header is not silently misclassified as a
  column-drop bug.

Verified on macOS: cargo build --locked clean, cargo clippy -- -D warnings
clean, all 80 test_moarstats tests pass at --test-threads=8, the new
test_sync_directory_smoke unit test passes, smoke-run of repro script
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants