fix(moarstats): harden bivariate header read against joined-CSV page-cache race by jqnatividad · Pull Request #3873 · dathere/qsv

jqnatividad · 2026-05-20T03:25:04Z

Summary

Fixes a flaky CI failure in test_moarstats::moarstats_join_type_left_runs_and_writes_bivariate (example run):

left join must merge primary+secondary numeric columns into bivariate output:
bivariate output is missing expected columns ["value2"] (saw {"id", "value1"});
likely silent join corruption

Root cause

The moarstats --join-inputs --bivariate pipeline shells out to qsv join and qsv stats subprocesses that write temp CSVs, then re-reads them. Under heavy parallel CI load a follow-up read intermittently sees a short/stale view of the joined temp file, dropping a secondary column.

join_datasets_internal and the joined-stats block already guard their reads with sync_subprocess_output + header-coverage validation + a retry. The bivariate header read (csv_headers) was the one link in that chain with no guard — and when a stats-CSV column was missing from that header, the field-pair loop silently continued past it, producing "primary-only" join-corrupt bivariate output.

Change

src/cmd/moarstats.rs — for the joined-input path:

fsync the joined CSV before reading its header;
validate the header covers every column the stats subprocess produced records for;
on a mismatch, log a warning, re-sync, and re-read the header once;
if a column is still missing, fail loud with a clear "silent join corruption" diagnostic instead of silently emitting primary-only output.

The non-joined path is unchanged (no new failure mode for users with a pre-existing stats cache).

Test plan

cargo test -F all_features test_moarstats::moarstats_join — 6/6 pass, including the previously-failing test.
cargo clippy -F all_features --tests — no new warnings in the edited region.
cargo check --features=lite — compiles clean (the feature set of the failing CI job).
cargo +nightly fmt — applied via the repo's auto-format hook.

🤖 Generated with Claude Code

…cache race The `moarstats --join-inputs --bivariate` pipeline shells out to `qsv join` and `qsv stats` subprocesses that write temp CSVs, then re-reads them. Under heavy parallel CI load a follow-up read can intermittently see a short/stale view of the joined temp file, dropping a secondary column. `join_datasets_internal` and the joined-stats block already guard their reads with `sync_subprocess_output` + coverage validation + retry, but the bivariate header read (`csv_headers`) had no such guard. When a stats-CSV column was missing from that header, the field-pair loop silently `continue`d past it, producing "primary-only" join-corrupt bivariate output. Harden the joined-input path: fsync the joined CSV, validate its header covers every column the stats subprocess produced, and retry once after a re-sync. If a column is still missing, fail loud with a clear diagnostic instead of silently emitting primary-only output. The non-joined path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-20T03:25:43Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

The first fix hardened the bivariate header read, but CI showed the test still failing with value2 missing — and the new fail-loud check never fired, proving the joined CSV header read was fine. The truncation was upstream. Two real gaps in the joined --bivariate path: 1. The joined-stats coverage check validated `qsv stats` output against a FRESH re-read of the joined CSV header. Under load that re-read can come back short, weakening the check to a trivial pass. 2. After the joined-stats block validated the stats CSV, `stats_csv_content` re-read the same temp file with an independent, unvalidated `fs::read_to_string` — which could observe a short file. Fixes: - `join_datasets_internal` now returns its already-validated joined header. The coverage check validates `qsv stats` output against that trusted header instead of a fragile re-read. - `run_stats_subprocess` reads the stats CSV content once and derives the field-coverage set from that same string, so the validated set and the content fed downstream are the same snapshot. The validated content is carried forward — no second unvalidated read. - Retry bumped from 1 to up to 3 attempts before failing loud. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Hardens moarstats --join-inputs --bivariate against intermittent short/stale reads of joined/stats temp CSVs under heavy parallel load, turning previously-silent join corruption into validated retries and (if still inconsistent) a clear failure.

Changes:

join_datasets_internal now returns the validated final joined CSV header to avoid downstream re-reads of the joined temp file for coverage checks.
Joined-input stats generation now reads and reuses a single coverage-validated stats CSV snapshot (string) instead of performing later unvalidated re-reads.
Bivariate header read for joined inputs now fsyncs + validates header coverage against stats fields, retries once, and fails loudly if columns are still missing.

Two Copilot review findings on PR #3873: - The `stats_field_names` comment said "numeric/date/string field names" but the code collects every `field` value from the stats CSV (one row per column, any type). Corrected the comment. - `missing_cols` did an O(n*m) membership scan (`hdrs.iter().any(...)` per stats field). Build a `HashSet` of header names once for O(1) lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad requested a review from Copilot May 20, 2026 03:53

Copilot started reviewing on behalf of jqnatividad May 20, 2026 03:54 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/cmd/moarstats.rs Outdated

Comment thread src/cmd/moarstats.rs Outdated

jqnatividad merged commit 70d8732 into master May 20, 2026
18 checks passed

jqnatividad deleted the fix/moarstats-bivariate-header-race branch May 20, 2026 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(moarstats): harden bivariate header read against joined-CSV page-cache race#3873

fix(moarstats): harden bivariate header read against joined-CSV page-cache race#3873
jqnatividad merged 3 commits into
masterfrom
fix/moarstats-bivariate-header-race

jqnatividad commented May 20, 2026

Uh oh!

codacy-production Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 20, 2026

Summary

Root cause

Change

Test plan

Uh oh!

codacy-production Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented May 20, 2026 •

edited

Loading