Skip to content

fix(moarstats): instrument bivariate field_pairs to catch the recurring primary-only flake#3892

Merged
jqnatividad merged 5 commits into
masterfrom
fix/moarstats-bivariate-flake-instrumentation
May 23, 2026
Merged

fix(moarstats): instrument bivariate field_pairs to catch the recurring primary-only flake#3892
jqnatividad merged 5 commits into
masterfrom
fix/moarstats-bivariate-flake-instrumentation

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Adds log::warn! at every silent continue in the bivariate field_pairs construction loop (src/cmd/moarstats.rs:4461-4566) so the recurring CI flake leaves an actionable trail instead of producing a confusing test assertion.
  • Adds a hard-fail guard in joined-inputs mode: if no constructed pair covers a column from one of the --join-inputs files, fail loud with all the relevant state (stats_field_names, csv_headers, record_count, cardinalities) rather than silently writing a misleading bivariate CSV.
  • Replaces the HashSet-based stats-coverage check with a multiset (HashMap<name, count>) comparison. qsv join keeps the join key on both sides, so a joined CSV has duplicate column names — the set-based check silently accepts a 3-record stats output for a 4-column joined CSV.

Why

The recurring moarstats_join_type_{left,right,full} bivariate flake (most recently https://github.com/dathere/qsv/actions/runs/26307747761) keeps producing "primary-only" bivariate output — only pairs whose columns are both from the primary input survive. Five recent commits have shipped retry/fsync guards against this; the failure persists. The diagnostic gap is that the field_pairs construction loop has six silent continue branches with zero logging, so we have no signal as to which filter dropped the pairs when CI flakes.

This change does not attempt yet another fix — it instruments the suspected region so the next flake captures enough data to identify the root cause definitively.

Test plan

  • cargo test -F all_features test_moarstats:: — 80/80 passing
  • cargo clippy --bin qsv -F all_features — no new warnings (one pre-existing collapsible_if warning unrelated to this PR)
  • Local repro of left-join scenario: produces the expected 5-pair bivariate output with zero warning lines (healthy run is silent)
  • Next CI run that flakes will produce an actionable summary line + per-pair warnings

🤖 Generated with Claude Code

…ng primary-only flake

The recurring CI flake on moarstats join_type_{left,right,full}
bivariate tests (most recently #3891 build 26307747761) keeps producing
a "primary-only" bivariate output — only pairs whose columns are both
from the primary input survive, every pair touching a secondary column
silently vanishes. Five recent commits have shipped guards against
this; the failure persists.

The diagnostic gap is in src/cmd/moarstats.rs:4461-4566 (field_pairs
construction): six silent `continue` branches and zero logging. When
the flake fires, we have no signal as to which filter dropped the
pairs.

This change:

- Adds log::warn! at every `continue` in the field_pairs loop,
  including the surviving field name, type string, cardinalities,
  stddev/variance and record_count. Each line is sufficient on its
  own to identify the dropped pair and the reason.
- Logs an info-level summary after the loop (built/skipped counts per
  reason) whenever any pair was dropped. Surfaces in CI logs without
  needing RUST_LOG.
- In joined-inputs mode, adds a hard-fail guard right after
  field_pairs construction: if no surviving pair covers a column from
  one of the `--join-inputs` files, return a fail_clierror! with all
  the relevant state (stats_field_names, csv_headers, record_count)
  rather than silently writing a misleading bivariate CSV. This is
  the precise failure mode of the flake — moving the detection into
  the binary makes future occurrences point at moarstats instead of
  surfacing as a confusing assertion in the test helper.
- Replaces the HashSet-based stats-coverage check (lines 3503-3546)
  with a multiset (HashMap<name, count>) comparison. qsv `join` keeps
  the join key on both sides, so the joined CSV header has duplicate
  column names. The set-based check silently accepts a 3-record stats
  output for a 4-column joined CSV with a duplicate header — exactly
  the kind of off-by-one corruption that downstream silently drops
  pairs over. The multiset comparison catches it.
- Derives Debug on FieldType so the new logging can print it.

A healthy run is unchanged: 80 moarstats tests pass; the local
left-join repro produces the same 5-pair bivariate output and no
warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 23, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

jqnatividad and others added 3 commits May 23, 2026 07:29
…tion

Codex review (job 2365, MEDIUM) flagged that the new
joined-inputs guard could spuriously fail valid `--join-inputs
--bivariate` runs: it failed whenever ANY secondary header was
absent from field_pairs, but field_pairs intentionally excludes
join keys (duplicate names alias to the primary's csv_headers
position) and columns that legitimately filter via
variance/cardinality/type.

Rewrite the guard to fire only on the actual corruption
signature: at least one secondary column should have produced a
pair, and none did.

- Read the primary's header and compute its csv_headers
  positions, so the guard can identify which indices are
  exclusively from secondary inputs.
- For each secondary input, build a list of eligible
  secondary-only columns: position is not in primary_positions,
  the name's first-match in csv_headers is THIS position (so
  aliased duplicate join keys are skipped — they collapse to
  primary's index anyway), and the stats record has a type that
  passes the bivariate type filter.
- Fire only when eligible_secondary_only is non-empty AND none
  of its positions appears in any field_pairs key. This is the
  exact condition under which a healthy run would have
  contributed secondary-side pairs.

Residual false-positive surface: a secondary input whose every
type-eligible column happens to also be filtered by variance,
cardinality, or all-constant. We accept that narrow edge case in
exchange for catching the recurring flake the moment it fires
instead of after a confusing test assertion downstream.

All 80 moarstats tests pass; the local left-join repro still
produces a 5-pair bivariate output with no warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…level filters

Codex review (job 2369, MEDIUM) flagged that the previous guard
still treated every type-eligible secondary column as expected to
appear in field_pairs, but the construction loop also filters
pairs for zero variance, both-constant, and
`cardinality == rowcount`. A valid joined run whose secondary
columns are all-unique or zero-variance would error spuriously.

Rebuild the guard's pairability check to mirror the loop's
column-level filters (the ones that disqualify a column from ALL
pairs, not pair-specific ones):

- type recognized AND passes the bivariate type filter
  (numeric / date / string)
- stddev/variance is non-zero (when reported)
- cardinality != record_count (when both known)

The both-constant (cardinality == 1) filter is intentionally
omitted: it's genuinely pair-level (only fires when BOTH columns
have card==1), so a card==1 column may still legitimately pair
with a higher-cardinality partner.

Also gate the guard on primary having at least one column that
passes the same pairability check — if the primary side can't
pair either, an empty/sparse output is a property of the data,
not corruption.

All 80 moarstats tests pass; local left-join repro still produces
a 5-pair output with no warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review (job 2370, MEDIUM) flagged that the previous
column-level `column_pairable` was still stricter than the loop's
actual filters: the loop's zero-stddev/variance check only fires
when BOTH sides have stddev (or both have variance). A
zero-stddev numeric column paired with a TString partner (no
stddev/variance) is NOT skipped — so treating zero stddev as an
unconditional column-level disqualifier would mask corruption in
mixed string/numeric joined datasets.

Replace the column-level check with a partner-aware one that
mirrors the construction loop's exact pair-level rules:

- Extract a small ColInfo struct + read_col_info helper to read
  each column's (type, stddev, variance, cardinality) from its
  stats record.
- A new pair_compatible(a, b) helper exactly mirrors the loop's
  pair filters: two-branch zero stddev/variance check, both-
  constant (BOTH card==1), either card==record_count, and the
  pair type filter.
- column_pairable(name) now returns true iff at least one OTHER
  column in stats_field_names forms a compatible pair with it.
  This correctly recognizes the "zero-stddev numeric pairs with a
  string column" case the prior column-level check missed.

All 80 moarstats tests pass; local left-join repro still produces
a 5-pair output with no warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds diagnostics and stronger validation around moarstats bivariate field_pairs construction—primarily to make the recurring joined-inputs “primary-only bivariate output” CI flake fail loudly and/or leave enough evidence to pinpoint the drop reason.

Changes:

  • Adds per-branch skip counters + per-skip logging during field_pairs construction, plus a summary line of built/skipped counts.
  • Adds a joined-inputs-mode guard that hard-fails when no surviving pair touches any “pairable” secondary-side column.
  • Fixes joined-stats coverage validation by switching from set membership to a multiset (counts) check to detect missing duplicate column names in joined CSVs.

Comment thread src/cmd/moarstats.rs
Comment thread src/cmd/moarstats.rs
Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs
…d alias check

Four Copilot findings on PR #3892:

- :4502 — log::warn!/log::info! are invisible by default (qsv's
  default log level is `off` unless QSV_LOG_LEVEL is set), so the
  instrumentation trail never reaches CI logs. Switched all
  per-skip warnings and the post-loop summary to wwarn!/winfo!,
  which write to stderr unconditionally.
- :4512 — per-skip warnings embedded the full csv_headers vec
  for every miss; on wide CSVs with many missing fields this
  floods logs. Per-skip messages now carry only the field name +
  reason + (where relevant) `len=N`. The full csv_headers is
  logged ONCE in the post-loop summary winfo!.
- :4825 — the joined-inputs guard did repeated
  stats_field_names.iter().position() scans inside nested loops
  via read_col_info/column_pairable, plus two CSV header reads
  from disk. Precomputed once: primary_headers (read input_path
  once), csv_name_to_pos (HashMap), col_infos
  (Vec<Option<ColInfo>>), and stats_name_to_idx (HashMap). The
  guard is now O(n^2) instead of superlinear in column count.
- :4852 — the redundant `if
  csv_headers.iter().position(|jh| jh == h) != Some(idx)` check
  could never fire: idx came from the same first-match lookup,
  so the second call returned Some(idx) by construction. Removed
  it. The join-key alias case it tried to handle is already
  covered by `primary_positions.contains(&idx)`.

All 80 moarstats tests pass; the local left-join repro still
produces 5 pairs with no warnings emitted on the healthy path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 32a58d1 into master May 23, 2026
18 checks passed
@jqnatividad jqnatividad deleted the fix/moarstats-bivariate-flake-instrumentation branch May 23, 2026 14:08
jqnatividad added a commit that referenced this pull request May 23, 2026
…warn! (#3894)

Codex review (job 2383, MEDIUM) flagged that PR #3892's
instrumentation turned every per-pair skip into an unconditional
wwarn! stderr write. Zero-variance, both-constant,
cardinality==rowcount, and type-filter skips are routine for many
datasets, and the field_pairs loop visits O(n²) pairs — so a
healthy moarstats --bivariate run on a wide CSV would now flood
stderr.

Switch all nine per-pair skip warnings from wwarn! back to
log::warn!:
- field1_bad_type, field1_missing_in_csv
- field2_bad_type, field2_missing_in_csv
- zero_stddev, zero_variance (the two-branch case)
- both_constant, card_eq_rowcount, type_filter

These are now gated behind QSV_LOG_LEVEL — set QSV_LOG_LEVEL=warn
to surface them in the qsv log file when actually debugging a
flake.

The post-loop summary stays as winfo!: it's a single line that
already carries the aggregate per-reason skip counts and the
full csv_headers — enough signal to spot the corruption mode
without per-pair noise. The hard-fail guard for joined-input
corruption is unaffected.

Healthy left-join repro: stderr is now clean of per-pair
'skipping' lines, output still produces 5 pairs. All 80
moarstats tests pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants