Skip to content

fix(moarstats): harden bivariate header read against joined-CSV page-cache race#3873

Merged
jqnatividad merged 3 commits into
masterfrom
fix/moarstats-bivariate-header-race
May 20, 2026
Merged

fix(moarstats): harden bivariate header read against joined-CSV page-cache race#3873
jqnatividad merged 3 commits into
masterfrom
fix/moarstats-bivariate-header-race

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Fixes a flaky CI failure in test_moarstats::moarstats_join_type_left_runs_and_writes_bivariate (example run):

left join must merge primary+secondary numeric columns into bivariate output:
bivariate output is missing expected columns ["value2"] (saw {"id", "value1"});
likely silent join corruption

Root cause

The moarstats --join-inputs --bivariate pipeline shells out to qsv join and qsv stats subprocesses that write temp CSVs, then re-reads them. Under heavy parallel CI load a follow-up read intermittently sees a short/stale view of the joined temp file, dropping a secondary column.

join_datasets_internal and the joined-stats block already guard their reads with sync_subprocess_output + header-coverage validation + a retry. The bivariate header read (csv_headers) was the one link in that chain with no guard — and when a stats-CSV column was missing from that header, the field-pair loop silently continued past it, producing "primary-only" join-corrupt bivariate output.

Change

src/cmd/moarstats.rs — for the joined-input path:

  • fsync the joined CSV before reading its header;
  • validate the header covers every column the stats subprocess produced records for;
  • on a mismatch, log a warning, re-sync, and re-read the header once;
  • if a column is still missing, fail loud with a clear "silent join corruption" diagnostic instead of silently emitting primary-only output.

The non-joined path is unchanged (no new failure mode for users with a pre-existing stats cache).

Test plan

  • cargo test -F all_features test_moarstats::moarstats_join — 6/6 pass, including the previously-failing test.
  • cargo clippy -F all_features --tests — no new warnings in the edited region.
  • cargo check --features=lite — compiles clean (the feature set of the failing CI job).
  • cargo +nightly fmt — applied via the repo's auto-format hook.

🤖 Generated with Claude Code

…cache race

The `moarstats --join-inputs --bivariate` pipeline shells out to `qsv join`
and `qsv stats` subprocesses that write temp CSVs, then re-reads them. Under
heavy parallel CI load a follow-up read can intermittently see a short/stale
view of the joined temp file, dropping a secondary column.

`join_datasets_internal` and the joined-stats block already guard their reads
with `sync_subprocess_output` + coverage validation + retry, but the bivariate
header read (`csv_headers`) had no such guard. When a stats-CSV column was
missing from that header, the field-pair loop silently `continue`d past it,
producing "primary-only" join-corrupt bivariate output.

Harden the joined-input path: fsync the joined CSV, validate its header covers
every column the stats subprocess produced, and retry once after a re-sync. If
a column is still missing, fail loud with a clear diagnostic instead of
silently emitting primary-only output. The non-joined path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 20, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

The first fix hardened the bivariate header read, but CI showed the test
still failing with value2 missing — and the new fail-loud check never fired,
proving the joined CSV header read was fine. The truncation was upstream.

Two real gaps in the joined --bivariate path:

1. The joined-stats coverage check validated `qsv stats` output against a
   FRESH re-read of the joined CSV header. Under load that re-read can come
   back short, weakening the check to a trivial pass.
2. After the joined-stats block validated the stats CSV, `stats_csv_content`
   re-read the same temp file with an independent, unvalidated
   `fs::read_to_string` — which could observe a short file.

Fixes:

- `join_datasets_internal` now returns its already-validated joined header.
  The coverage check validates `qsv stats` output against that trusted
  header instead of a fragile re-read.
- `run_stats_subprocess` reads the stats CSV content once and derives the
  field-coverage set from that same string, so the validated set and the
  content fed downstream are the same snapshot. The validated content is
  carried forward — no second unvalidated read.
- Retry bumped from 1 to up to 3 attempts before failing loud.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens moarstats --join-inputs --bivariate against intermittent short/stale reads of joined/stats temp CSVs under heavy parallel load, turning previously-silent join corruption into validated retries and (if still inconsistent) a clear failure.

Changes:

  • join_datasets_internal now returns the validated final joined CSV header to avoid downstream re-reads of the joined temp file for coverage checks.
  • Joined-input stats generation now reads and reuses a single coverage-validated stats CSV snapshot (string) instead of performing later unvalidated re-reads.
  • Bivariate header read for joined inputs now fsyncs + validates header coverage against stats fields, retries once, and fails loudly if columns are still missing.

Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs Outdated
Two Copilot review findings on PR #3873:
- The `stats_field_names` comment said "numeric/date/string field names" but
  the code collects every `field` value from the stats CSV (one row per
  column, any type). Corrected the comment.
- `missing_cols` did an O(n*m) membership scan (`hdrs.iter().any(...)` per
  stats field). Build a `HashSet` of header names once for O(1) lookups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 70d8732 into master May 20, 2026
18 checks passed
@jqnatividad jqnatividad deleted the fix/moarstats-bivariate-header-race branch May 20, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants