fix(moarstats): instrument bivariate field_pairs to catch the recurring primary-only flake#3892
Merged
jqnatividad merged 5 commits intoMay 23, 2026
Merged
Conversation
…ng primary-only flake
The recurring CI flake on moarstats join_type_{left,right,full}
bivariate tests (most recently #3891 build 26307747761) keeps producing
a "primary-only" bivariate output — only pairs whose columns are both
from the primary input survive, every pair touching a secondary column
silently vanishes. Five recent commits have shipped guards against
this; the failure persists.
The diagnostic gap is in src/cmd/moarstats.rs:4461-4566 (field_pairs
construction): six silent `continue` branches and zero logging. When
the flake fires, we have no signal as to which filter dropped the
pairs.
This change:
- Adds log::warn! at every `continue` in the field_pairs loop,
including the surviving field name, type string, cardinalities,
stddev/variance and record_count. Each line is sufficient on its
own to identify the dropped pair and the reason.
- Logs an info-level summary after the loop (built/skipped counts per
reason) whenever any pair was dropped. Surfaces in CI logs without
needing RUST_LOG.
- In joined-inputs mode, adds a hard-fail guard right after
field_pairs construction: if no surviving pair covers a column from
one of the `--join-inputs` files, return a fail_clierror! with all
the relevant state (stats_field_names, csv_headers, record_count)
rather than silently writing a misleading bivariate CSV. This is
the precise failure mode of the flake — moving the detection into
the binary makes future occurrences point at moarstats instead of
surfacing as a confusing assertion in the test helper.
- Replaces the HashSet-based stats-coverage check (lines 3503-3546)
with a multiset (HashMap<name, count>) comparison. qsv `join` keeps
the join key on both sides, so the joined CSV header has duplicate
column names. The set-based check silently accepts a 3-record stats
output for a 4-column joined CSV with a duplicate header — exactly
the kind of off-by-one corruption that downstream silently drops
pairs over. The multiset comparison catches it.
- Derives Debug on FieldType so the new logging can print it.
A healthy run is unchanged: 80 moarstats tests pass; the local
left-join repro produces the same 5-pair bivariate output and no
warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
…tion Codex review (job 2365, MEDIUM) flagged that the new joined-inputs guard could spuriously fail valid `--join-inputs --bivariate` runs: it failed whenever ANY secondary header was absent from field_pairs, but field_pairs intentionally excludes join keys (duplicate names alias to the primary's csv_headers position) and columns that legitimately filter via variance/cardinality/type. Rewrite the guard to fire only on the actual corruption signature: at least one secondary column should have produced a pair, and none did. - Read the primary's header and compute its csv_headers positions, so the guard can identify which indices are exclusively from secondary inputs. - For each secondary input, build a list of eligible secondary-only columns: position is not in primary_positions, the name's first-match in csv_headers is THIS position (so aliased duplicate join keys are skipped — they collapse to primary's index anyway), and the stats record has a type that passes the bivariate type filter. - Fire only when eligible_secondary_only is non-empty AND none of its positions appears in any field_pairs key. This is the exact condition under which a healthy run would have contributed secondary-side pairs. Residual false-positive surface: a secondary input whose every type-eligible column happens to also be filtered by variance, cardinality, or all-constant. We accept that narrow edge case in exchange for catching the recurring flake the moment it fires instead of after a confusing test assertion downstream. All 80 moarstats tests pass; the local left-join repro still produces a 5-pair bivariate output with no warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…level filters Codex review (job 2369, MEDIUM) flagged that the previous guard still treated every type-eligible secondary column as expected to appear in field_pairs, but the construction loop also filters pairs for zero variance, both-constant, and `cardinality == rowcount`. A valid joined run whose secondary columns are all-unique or zero-variance would error spuriously. Rebuild the guard's pairability check to mirror the loop's column-level filters (the ones that disqualify a column from ALL pairs, not pair-specific ones): - type recognized AND passes the bivariate type filter (numeric / date / string) - stddev/variance is non-zero (when reported) - cardinality != record_count (when both known) The both-constant (cardinality == 1) filter is intentionally omitted: it's genuinely pair-level (only fires when BOTH columns have card==1), so a card==1 column may still legitimately pair with a higher-cardinality partner. Also gate the guard on primary having at least one column that passes the same pairability check — if the primary side can't pair either, an empty/sparse output is a property of the data, not corruption. All 80 moarstats tests pass; local left-join repro still produces a 5-pair output with no warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review (job 2370, MEDIUM) flagged that the previous column-level `column_pairable` was still stricter than the loop's actual filters: the loop's zero-stddev/variance check only fires when BOTH sides have stddev (or both have variance). A zero-stddev numeric column paired with a TString partner (no stddev/variance) is NOT skipped — so treating zero stddev as an unconditional column-level disqualifier would mask corruption in mixed string/numeric joined datasets. Replace the column-level check with a partner-aware one that mirrors the construction loop's exact pair-level rules: - Extract a small ColInfo struct + read_col_info helper to read each column's (type, stddev, variance, cardinality) from its stats record. - A new pair_compatible(a, b) helper exactly mirrors the loop's pair filters: two-branch zero stddev/variance check, both- constant (BOTH card==1), either card==record_count, and the pair type filter. - column_pairable(name) now returns true iff at least one OTHER column in stats_field_names forms a compatible pair with it. This correctly recognizes the "zero-stddev numeric pairs with a string column" case the prior column-level check missed. All 80 moarstats tests pass; local left-join repro still produces a 5-pair output with no warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds diagnostics and stronger validation around moarstats bivariate field_pairs construction—primarily to make the recurring joined-inputs “primary-only bivariate output” CI flake fail loudly and/or leave enough evidence to pinpoint the drop reason.
Changes:
- Adds per-branch skip counters + per-skip logging during
field_pairsconstruction, plus a summary line of built/skipped counts. - Adds a joined-inputs-mode guard that hard-fails when no surviving pair touches any “pairable” secondary-side column.
- Fixes joined-stats coverage validation by switching from set membership to a multiset (counts) check to detect missing duplicate column names in joined CSVs.
…d alias check Four Copilot findings on PR #3892: - :4502 — log::warn!/log::info! are invisible by default (qsv's default log level is `off` unless QSV_LOG_LEVEL is set), so the instrumentation trail never reaches CI logs. Switched all per-skip warnings and the post-loop summary to wwarn!/winfo!, which write to stderr unconditionally. - :4512 — per-skip warnings embedded the full csv_headers vec for every miss; on wide CSVs with many missing fields this floods logs. Per-skip messages now carry only the field name + reason + (where relevant) `len=N`. The full csv_headers is logged ONCE in the post-loop summary winfo!. - :4825 — the joined-inputs guard did repeated stats_field_names.iter().position() scans inside nested loops via read_col_info/column_pairable, plus two CSV header reads from disk. Precomputed once: primary_headers (read input_path once), csv_name_to_pos (HashMap), col_infos (Vec<Option<ColInfo>>), and stats_name_to_idx (HashMap). The guard is now O(n^2) instead of superlinear in column count. - :4852 — the redundant `if csv_headers.iter().position(|jh| jh == h) != Some(idx)` check could never fire: idx came from the same first-match lookup, so the second call returned Some(idx) by construction. Removed it. The join-key alias case it tried to handle is already covered by `primary_positions.contains(&idx)`. All 80 moarstats tests pass; the local left-join repro still produces 5 pairs with no warnings emitted on the healthy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
jqnatividad
added a commit
that referenced
this pull request
May 23, 2026
…warn! (#3894) Codex review (job 2383, MEDIUM) flagged that PR #3892's instrumentation turned every per-pair skip into an unconditional wwarn! stderr write. Zero-variance, both-constant, cardinality==rowcount, and type-filter skips are routine for many datasets, and the field_pairs loop visits O(n²) pairs — so a healthy moarstats --bivariate run on a wide CSV would now flood stderr. Switch all nine per-pair skip warnings from wwarn! back to log::warn!: - field1_bad_type, field1_missing_in_csv - field2_bad_type, field2_missing_in_csv - zero_stddev, zero_variance (the two-branch case) - both_constant, card_eq_rowcount, type_filter These are now gated behind QSV_LOG_LEVEL — set QSV_LOG_LEVEL=warn to surface them in the qsv log file when actually debugging a flake. The post-loop summary stays as winfo!: it's a single line that already carries the aggregate per-reason skip counts and the full csv_headers — enough signal to spot the corruption mode without per-pair noise. The hard-fail guard for joined-input corruption is unaffected. Healthy left-join repro: stderr is now clean of per-pair 'skipping' lines, output still produces 5 pairs. All 80 moarstats tests pass. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
log::warn!at every silentcontinuein the bivariate field_pairs construction loop (src/cmd/moarstats.rs:4461-4566) so the recurring CI flake leaves an actionable trail instead of producing a confusing test assertion.--join-inputsfiles, fail loud with all the relevant state (stats_field_names,csv_headers,record_count, cardinalities) rather than silently writing a misleading bivariate CSV.HashSet-based stats-coverage check with a multiset (HashMap<name, count>) comparison.qsv joinkeeps the join key on both sides, so a joined CSV has duplicate column names — the set-based check silently accepts a 3-record stats output for a 4-column joined CSV.Why
The recurring
moarstats_join_type_{left,right,full}bivariate flake (most recently https://github.com/dathere/qsv/actions/runs/26307747761) keeps producing "primary-only" bivariate output — only pairs whose columns are both from the primary input survive. Five recent commits have shipped retry/fsync guards against this; the failure persists. The diagnostic gap is that the field_pairs construction loop has six silentcontinuebranches with zero logging, so we have no signal as to which filter dropped the pairs when CI flakes.This change does not attempt yet another fix — it instruments the suspected region so the next flake captures enough data to identify the root cause definitively.
Test plan
cargo test -F all_features test_moarstats::— 80/80 passingcargo clippy --bin qsv -F all_features— no new warnings (one pre-existing collapsible_if warning unrelated to this PR)🤖 Generated with Claude Code