Skip to content

fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir#3838

Merged
jqnatividad merged 2 commits into
masterfrom
fix/moarstats-linux-stats-validation
May 9, 2026
Merged

fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir#3838
jqnatividad merged 2 commits into
masterfrom
fix/moarstats-linux-stats-validation

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Two conservative guards for the Linux-only flake on test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate (CI run 25545197594, surfaced via PR #3834 — that PR only touches describegpt and is a bystander).

The macOS fix in #3830 and Windows fix in #3831 hardened the parent-side fsync. Linux still surfaces the same symptom — bivariate output missing secondary columns — at a much lower frequency. The most plausible remaining gap is the qsv stats-on-joined-CSV subprocess handoff: the spawned child has been observed to produce stats records for fewer columns than the joined CSV actually has.

Guard 1 — retry once on stats-coverage mismatch (src/cmd/moarstats.rs)

After qsv stats runs on the joined CSV:

  • Cross-check that every column in the joined CSV's header appears in the stats output's field column.
  • On mismatch, re-fsync the joined CSV and re-run qsv stats once.
  • Only fail loud if the mismatch persists after the retry.

This self-heals the rare race silently (with a log::warn! for visibility in CI logs) and surfaces a precise error if a different root cause ever appears in the future.

Guard 2 — fsync the joined CSV's parent directory (src/util.rs)

fsync(file) on Linux doesn't flush parent-directory metadata. While that primarily matters for crash safety, rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect read-after-write visibility for newly-created files when a follow-up process opens them via path lookup.

New helper util::sync_directory(path):

  • Unix: opens the directory and calls sync_all().
  • Windows: no-op (dir-handle FlushFileBuffers errors).

Called once on the joined CSV's parent dir right after join_datasets_internal returns.

Repro harness — tools/repro_moarstats_linux_flake.sh

Parallel-worker shell harness that replicates the failing test as a shell loop, optionally adds page-cache pressure (DD_LOAD=1), and classifies failures into:

  • FAIL_NEW_DIAGNOSTIC — strict coverage check fired after retry → confirms subprocess CSV→stats handoff is the root cause.
  • FAIL_OLD_BIVARIATE — bivariate output dropped columns and the strict check did NOT fire → points elsewhere.
  • FAIL_OTHER — qsv error / missing output.

Usage on an Ubuntu VM:

QSV_BIN=$PWD/target/debug/qsv WORKERS=4 ITERS=500 DD_LOAD=1 \
    ./tools/repro_moarstats_linux_flake.sh

Test plan

  • cargo build --locked --bin qsv -F all_features clean (macOS).
  • cargo clippy --locked --bin qsv -F all_features -- -D warnings clean.
  • cargo t -F all_features --test tests test_moarstats:: — 80/80 pass at --test-threads=8.
  • Failing test in a 30× serial stress loop on macOS — all pass.
  • CI green on Linux + Windows + macOS workflows.
  • (Optional) tools/repro_moarstats_linux_flake.sh on an Ubuntu VM (4 vCPU, DD_LOAD=1, ≥2000 runs) — expect either no failures, or FAIL_NEW_DIAGNOSTIC = 0 with log::warn! messages indicating the retry path triggered and self-healed.

🤖 Generated with Claude Code

…arent dir

Two conservative guards for the Linux-only flake on
test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate
(CI run 25545197594, surfaced via PR #3834).

The macOS fix (#3830) and Windows fix (#3831) hardened the parent-side
fsync. Linux still surfaces the same symptom — bivariate output missing
secondary columns — at a much lower frequency. The most plausible
remaining gap is the `qsv stats`-on-joined-CSV subprocess handoff: the
spawned child has been observed to produce stats records for fewer
columns than the joined CSV actually has.

Two guards:

1. Retry once on stats-coverage mismatch. After `qsv stats` runs on the
   joined CSV, cross-check that every column in the joined CSV's header
   appears in the stats output's `field` column. On mismatch, re-fsync
   the joined CSV and re-run `qsv stats` once. Only fail loud if the
   mismatch persists. Self-heals the rare race; surfaces a precise error
   if a different root cause ever appears.

2. Fsync the joined CSV's parent directory after `join_datasets_internal`
   returns. `fsync(file)` on Linux doesn't flush parent-dir metadata, and
   rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect
   read-after-write visibility for the follow-up subprocess. New helper
   `util::sync_directory()` is a no-op on Windows, where dir-handle
   `FlushFileBuffers` would error.

Also includes `tools/repro_moarstats_linux_flake.sh` — a parallel-worker
harness that replicates the failing test as a shell loop, optionally
adds page-cache pressure (DD_LOAD=1), and classifies failures into
"retry caught it" / "retry didn't catch it" / "different bug" buckets
to confirm root cause on a Linux VM.

Verified on macOS: `cargo build --locked --bin qsv -F all_features` and
`cargo clippy --locked --bin qsv -F all_features -- -D warnings` clean;
`cargo t -F all_features --test tests test_moarstats::` 80/80 pass at
--test-threads=8; failing test in a 30x stress loop all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 9, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds additional durability/visibility guards around moarstats’ joined-CSV → qsv stats subprocess flow to mitigate a rare Linux CI flake, and introduces a Linux stress/repro harness.

Changes:

  • Add directory fsync helper and use it after join output creation.
  • Validate qsv stats coverage vs joined CSV headers and retry once on mismatch.
  • Add a parallel worker shell harness to reproduce/classify the flake on Linux.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
tools/repro_moarstats_linux_flake.sh Adds a parallel stress harness to reproduce and classify the Linux-only flake.
src/util.rs Introduces sync_directory to fsync parent directories (Unix) for durability/visibility.
src/cmd/moarstats.rs Adds parent-dir fsync after join and a retry-on-stats-coverage-mismatch guard for the qsv stats subprocess.

Comment thread src/util.rs Outdated
Comment thread src/cmd/moarstats.rs Outdated
Comment thread tools/repro_moarstats_linux_flake.sh Outdated
Comment thread tools/repro_moarstats_linux_flake.sh Outdated
Comment thread tools/repro_moarstats_linux_flake.sh
Comment thread src/util.rs Outdated
…ipt hardening

All six Copilot comments on PR #3838 applied:

src/util.rs
- sync_directory: errors no longer propagated. Directory fsync is not
  uniformly supported across Linux filesystems (FAT, some FUSE/overlay/
  network mounts return EINVAL/EOPNOTSUPP); a defensive guard layered on
  top of file-level fsync must NOT fail the caller. Errors are now logged
  at debug level and swallowed. Signature changed from CliResult<()> to
  () to make the contract explicit.
- New unit test test_sync_directory_smoke covering both the existing-dir
  and missing-dir paths (Unix-only).

src/cmd/moarstats.rs
- Call site updated for the new sync_directory signature.
- Stats-output reader now propagates csv parse errors instead of
  silently dropping rows via filter_map(Result::ok). A malformed/
  truncated stats row now surfaces as "Failed to parse joined-stats
  row N" rather than as a misleading downstream "missing columns"
  assertion — which is exactly the diagnostic mode this fix is trying
  to eliminate.

tools/repro_moarstats_linux_flake.sh
- FIFO is now created inside a real mktemp -d directory, replacing the
  inherently racy mktemp -u.
- cleanup trap now removes LOAD_DIR (when DD_LOAD=1) and FIFO_DIR so
  temp resources are not leaked on interrupts/early exits.
- bivariate awk check now distinguishes "header malformed" (exit 2 ->
  FAIL_OTHER) from "coverage failed" (exit 1 -> FAIL_OLD_BIVARIATE), so
  a truncated bivariate header is not silently misclassified as a
  column-drop bug.

Verified on macOS: cargo build --locked clean, cargo clippy -- -D warnings
clean, all 80 test_moarstats tests pass at --test-threads=8, the new
test_sync_directory_smoke unit test passes, smoke-run of repro script
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 27eb472 into master May 9, 2026
17 checks passed
@jqnatividad jqnatividad deleted the fix/moarstats-linux-stats-validation branch May 9, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants