Skip to content

fix: Fix col-stats corruption from NaN values and stats-truncation#18792

Open
linliu-code wants to merge 6 commits into
apache:masterfrom
linliu-code:fix-colstats-nan-and-truncation
Open

fix: Fix col-stats corruption from NaN values and stats-truncation#18792
linliu-code wants to merge 6 commits into
apache:masterfrom
linliu-code:fix-colstats-nan-and-truncation

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented May 20, 2026

Describe the issue this Pull Request addresses

Fixes two related correctness bugs in the column-stats metadata table that cause silent zero-row query results when data-skipping is enabled:

Both bugs cause silent data loss at query time — no error, no warning, just an incomplete result set.

Summary and Changelog

hudi-hadoop-common/.../ParquetUtils.readColumnStatsFromMetadata did two things that turned out to be unsafe in the presence of unreliable parquet stats:

  1. Called stats.genericGetMin()/getMax() without checking stats.hasNonNullValue() — parquet-mr uses that flag to signal "min/max are unreliable, do not trust them" (NaN in FLOAT/DOUBLE).
  2. Did stats.isEmpty() ? valueCount : getNumNulls(). isEmpty() is true both for all-null columns and for stats-truly-absent columns — the two cases need opposite handling.

The error then propagated: even after ParquetUtils was fixed to report null min/max for unreliable stats, the per-partition aggregator HoodieColumnRangeMetadata.merge silently dropped the null contribution from the unreliable file. This left PartitionStatsIndexSupport.prunePartitions with a partial bound that pruned the partition based on the other files' stats — the silent zero-row symptom in #18754.

The read-side filter generator in DataSkippingUtils also needed updating: a predicate like colA_maxValue > X evaluates to null when the file has null max stat, and Spark treats null as false in WHERE, so the file gets pruned. Wrapping with OR (colMin IS NULL AND valueCount > nullCount) makes the file scanned instead, while still pruning legitimately all-null columns.

Changes:

  • ParquetUtils.readColumnStatsFromMetadata — gate min/max on hasNonNullValue(); gate nullCount = valueCount on isNumNullsSet().
  • HoodieTableMetadataUtil.collectColumnRangeMetadata — skip NaN in the streaming min/max accumulator (Double.compare(NaN, x) returns +1, so an unguarded comparison lets a single NaN poison max forever).
  • HoodieTableMetadataUtil.mergeColumnStatsRecords — distinguish all-null from stats-unreliable when merging MDT records.
  • HoodieColumnRangeMetadata.merge — the per-file → per-partition merge helper used by PartitionStatsIndexSupport. Apply the same unreliability propagation here (the missing piece that kept the read-side bug visible even after the other writer fixes were applied).
  • DataSkippingUtils.scala — wrap min/max-using skip predicates with OR (colMin IS NULL AND valueCount > nullCount). The valueCount-vs-nullCount guard distinguishes unreliable from all-null.

Impact

No public API change. Internal write/read code only.

The fix is strictly more conservative than the prior behavior: files with unreliable stats are now scanned instead of silently pruned. Tables that don't contain NaN values and don't have stats-truncated columns are unaffected (the new gates short-circuit when stats are reliable). For tables that do hit either condition, queries previously returning silently-wrong (incomplete) results now return correct results; pruning is less aggressive on the files that triggered the unreliable signal.

Risk Level

Medium. The merge-helper change (HoodieColumnRangeMetadata.merge) is in a path shared by partition-stats aggregation and any other caller that merges per-file ranges. The semantics shift from "drop null" to "propagate null when stats are unreliable, drop null when column is genuinely all-null". The distinguishing check is nullCount == valueCount, which is the existing convention for the all-null sentinel — no API change, behavior alignment with the documented contract. Updated TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax reflects the new (correct) semantics; the prior assertion silently dropped the null contributor.

Documentation Update

No documentation changes required. The bugs were silent regressions against the existing contract that "data-skipping should never change query results".

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable (unit tests in TestParquetUtils, TestHoodieColumnRangeMetadata, TestHoodieTableMetadataUtil (including testIsStatsUnreliableUsesValueEqualityOnBoxedLong regression for the boxed-Long ref-equality bug surfaced in code review), TestDataSkippingUtils; functional test in TestDataSkippingWithUnreliableColStats; updated existing TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax to reflect new semantics)
  • CI passed

…ta-skipping (apache#18754, apache#18755)

Two correctness bugs that cause silent zero-row query results when data-skipping
is enabled:

* apache#18754 — A single NaN value in a FLOAT/DOUBLE column corrupts the persisted
  col-stats record for that file (parquet-mr clears the stats but leaves
  min/max at their default 0.0). The 1.x read path then uses those fabricated
  stats and prunes the file, dropping rows from any predicate touching the
  file's partition.

* apache#18755 — When a Parquet column omits stats entirely (e.g. one value exceeds
  parquet-mr's stats truncation threshold), the col-stats writer mis-records
  nullCount = valueCount, i.e. "all rows are null". Any non-null predicate
  then skips the file silently.

Both bugs share `ParquetUtils.readColumnStatsFromMetadata`; apache#18754 also
involves the streaming compute path in `HoodieTableMetadataUtil.
collectColumnRangeMetadata`, the per-partition merge helper in
`HoodieColumnRangeMetadata.merge` (used by `PartitionStatsIndexSupport`), and
the read-side predicate translator in `DataSkippingUtils`.

Changes:

* ParquetUtils.readColumnStatsFromMetadata
  - Use `Statistics.hasNonNullValue()` to detect NaN-tainted stats; report
    null min/max instead of the parquet sentinel 0.0.
  - Use `Statistics.isNumNullsSet()` to distinguish "all-null column" (use
    valueCount as nullCount, the original intent of the existing comment)
    from "stats absent" (nullCount unknown — must use 0 so data-skipping
    can't conclude "all rows are null").

* HoodieTableMetadataUtil.collectColumnRangeMetadata
  - Exclude NaN from min/max accumulator. Java's `Double.compare(NaN, x)`
    returns +1, so an unguarded comparison would let a single NaN poison
    the running max forever. Counts the NaN as a non-null value so
    valueCount/nullCount stay accurate. Mirrors parquet-mr's policy.

* HoodieTableMetadataUtil.mergeColumnStatsRecords
  - When merging two col-stats records, distinguish "all-null column"
    (legitimately null min/max, safe to drop in merge) from "stats
    unreliable" (null min/max with nullCount < valueCount — must propagate
    null so the merged result reflects the worst-case unknown bound).

* HoodieColumnRangeMetadata.merge (the per-file → per-partition merge
  helper used by `PartitionStatsIndexSupport`)
  - Apply the same unreliability propagation. This was the missing piece
    that left the read-side bug visible even with the other writer fixes
    applied: the partition aggregator was silently dropping the
    unreliable file's contribution, leaving a partial bound that pruned
    the partition.

* DataSkippingUtils.scala
  - `withStatsAvailable` wraps min/max-based skip predicates so that a
    file with null min stat (= stats unreliable) is always scanned rather
    than pruned by Spark's null-propagating WHERE semantics.
  - Apply to all min/max-using cases: `<`, `>`, `<=`, `>=`, EqualTo (via
    `genColumnValuesEqualToExpression`), `In`, `InSet`, `StartsWith`,
    `Not(StartsWith)`, `Not(In)`, `Not(EqualTo)`.

* SparkBaseIndexSupport.scala
  - In `getCandidateFiles`, augment the index filter with an OR clause
    "any column referenced in the query has a null min stat". Belt-and-
    suspenders against any predicate case we missed.
@linliu-code linliu-code changed the title [HUDI-XXXX] Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755) fix: Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755) May 20, 2026
@linliu-code linliu-code changed the title fix: Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755) fix: Fix col-stats corruption from NaN values and stats-truncation May 20, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for tackling these correctness bugs — the analysis of the NaN poisoning and stats truncation paths is thorough, and the layered fix (collect → merge → query) is the right shape. One concern I'd like to flag for verification: the new isStatsUnreliable helper in HoodieTableMetadataUtil appears to use Java reference equality on autoboxed Long objects when checking nullCount != valueCount. Because Avro's ["null","long"] generates boxed Long fields and Long.valueOf only caches values in [-128, 127], this comparison flips to "different references" for any realistic row count, which would mark legitimately all-null columns as unreliable during MDT merge. Worth double-checking — see the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and comment-clarity suggestions below.

* scanned, preserving correctness at the cost of skipping an opportunity. min/max are written
* jointly (both null or both non-null), so checking min is sufficient.
*/
@inline def withStatsAvailable(skipExpr: Expression, colName: String): Expression = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: withStatsAvailable reads as "apply this only when stats are available", but the method actually does the opposite — it adds an OR-null guard so the file passes through when stats are unavailable. Something like withNullStatsGuard or guardedByStatsReliability would make the semantics clearer at every call site.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

if (value != null) {
return false;
}
// All-null column: stats are legitimately empty for min/max — safe to ignore in merge.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the comment describes the false case (all-null column → reliable) but sits directly before the return that evaluates to true when unreliable (nullCount != valueCount). Could you flip the framing so it describes what true means — e.g. // Unreliable: min/max are null but the column is not all-null (nullCount < valueCount).

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 20, 2026
Adds unit tests covering every behavior branch touched by the col-stats
fix for apache#18754 / apache#18755:

* TestParquetUtils:
  - testReadColumnStatsFromMetadata_nanTaintedDoubleColumn: writes a
    DOUBLE column containing NaN, asserts min/max come back null (was
    silently persisted as 0.0/0.0 — the writer-side half of apache#18754).
  - testReadColumnStatsFromMetadata_statsAbsentForLongValues: writes a
    string column with one value larger than parquet-mr's stats
    truncation threshold (~1KB), asserts nullCount=0 (was incorrectly
    reported as valueCount — apache#18755).
  - testReadColumnStatsFromMetadata_allNullColumnRetainsValueCountAsNullCount:
    pins the existing all-null convention so the apache#18755 fix doesn't
    regress it.

* TestHoodieColumnRangeMetadata (new file):
  - merge of two reliable bounds picks min/max as before
  - merge with one all-null column keeps the other side's bounds
  - merge with one unreliable side propagates null
  - symmetric on left vs right
  - both-unreliable propagates null
  - both-all-null stays legitimately null
  - null-side returns the other (existing contract)

  The "one unreliable + one reliable -> null" case is the read-side
  half of apache#18754: without this propagation, PartitionStatsIndexSupport
  silently dropped the unreliable file's contribution and pruned the
  partition based on a partial bound.

* TestHoodieTableMetadataUtil:
  - testIsFloatingPointNaN: pins the NaN-detection helper used by
    collectColumnRangeMetadata's streaming min/max accumulator.
    Covers NaN, Inf, normal numbers, and non-floating-point values.

* TestDataSkippingUtils:
  - testNullSafetyForUnreliableStats: parameterized test over <, >,
    <=, >=, =, IN. Each row in the synthetic col-stats index includes
    a "file_unreliable" entry with null min/max but valueCount>0
    (the stats-unreliability signal). The test asserts the unreliable
    file is always in the candidate set regardless of predicate shape —
    the read-side wrapping introduced by withStatsAvailable.

HoodieTableMetadataUtil.isFloatingPointNaN was changed from private to
package-private for testability.

Sanity-checked locally that the new ParquetUtils + HoodieColumnRangeMetadata
tests fail against upstream/master (catching the bugs) and pass against the
fix. DataSkippingUtils test runs in CI — relies on Spark resolution that's
easier to set up there than in a standalone JVM.
Adds a functional test class that exercises the full Hudi write→read
path for both bugs covered by this PR.

* testNaNInDoubleColumnDoesNotSilentlyPruneRows (apache#18754): writes 4 files
  in one partition where file 3 contains a NaN value. Asserts that
  queries against d_double AND queries against unrelated columns
  (s_str) return identical row sets with data-skipping ON vs OFF.
  Pre-fix: queries silently dropped rows because file 3's bad parquet
  stats (min=0/max=0) propagated through the col-stats writer into the
  MDT and through the partition-stats aggregator.

* testTruncatedStringStatsDoNotSilentlyPruneRows (apache#18755): writes 2
  files where file 1 contains a long string value that exceeds
  parquet-mr's stats truncation threshold (~1KB). Asserts that a
  point-lookup on the short string in the same file returns the row
  with skipping ON. Pre-fix: the col-stats writer recorded
  nullCount=valueCount, claiming the file was all-null, so
  data-skipping silently skipped it.

Strategy: assertSkippingNeverDropsRows compares the set of rk values
returned with skipping ON to the set returned with skipping OFF, and
fails on any divergence. This is the precise correctness contract that
data-skipping must preserve — a single asymmetric prune is silent data
loss.
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2026
linliu-code and others added 3 commits May 20, 2026 08:39
* DataSkippingUtils.withStatsAvailable now adds an AND-guard on
  {@code valueCount > nullCount} so the OR-IsNull(min) branch only fires
  for files with non-null values whose bounds are unknown (the unreliable
  case). Without the guard the wrap also fired for legitimately all-null
  columns (or test fixtures with unset nullCount), causing legacy
  TestDataSkippingUtils#testLookupFilterExpressions parameterized cases
  to include files that should have been pruned.

* SparkBaseIndexSupport.getCandidateFiles no longer adds a separate
  global null-safety augmentation — the per-predicate wrap in
  DataSkippingUtils now covers all cases with the correct semantics.

* TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax updated
  to reflect the new semantics: merging an "unreliable null" side
  (nullCount<valueCount with null min/max) now propagates null instead
  of silently dropping the contribution. Added a companion test pinning
  that all-null columns are still dropped from merge correctly.
…ty + naming nits

Critical bug surfaced by AI reviewer on HoodieTableMetadataUtil.isStatsUnreliable:
the helper compared boxed Long fields (nullCount, valueCount) with !=, which is
reference equality. Long.valueOf caches only [-128, 127], so any realistic row
count autoboxes to a fresh Long instance and != returns true even when values are
numerically equal. This misclassifies legitimately all-null columns as "stats
unreliable" during MDT col-stats compaction merge, defeating the all-null
pruning fast path. Fix: use !nullCount.equals(valueCount). Sibling helper in
HoodieColumnRangeMetadata is unaffected because its fields are primitive long.

Also addresses three lower-severity review notes:
- Rename DataSkippingUtils.withStatsAvailable → withUnreliableStatsGuard. The
  old name read as "apply when stats are available" but the helper actually
  added an OR-null guard for when they are NOT available.
- Rename genColumnOnlyValuesEqualToExpression → ...ForNegation, and lift the
  Not(...) precondition from an inline comment into the helper's javadoc.
  All 3 current callers already wrap in Not(...), so no semantic change.
- Flip comment framing in HoodieColumnRangeMetadata.isStatsUnreliable so the
  comment describes the true-case (unreliable) sitting next to the return.

Test: new testIsStatsUnreliableUsesValueEqualityOnBoxedLong covers six
quadrants (all-null large N, all-null small N, value-present, unreliable,
missing nullCount, missing valueCount). Verified it fails on the buggy code
and passes after the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uard wrap

Five CI failures (4 in TestColumnStatsIndex.testTranslateQueryFiltersIntoCol-
umnStatsIndexFilterExpr, 1 in TestColumnStatsIndexWithSQL.testTranslateInto-
ColumnStatsIndexFilterExpr, repeated across Spark 3.3-4.1) compared the
translated col-stats filter expression character-by-character against a
hardcoded expected Expression tree. The new null-safety OR-guard added by
DataSkippingUtils.withUnreliableStatsGuard for files with unreliable stats
(NaN-poisoned, value-size truncated) legitimately changes the translated
expression shape from

  (colA_maxValue > V)

to

  (colA_maxValue > V) OR (isnull(colA_minValue) AND valueCount > colA_nullCount)

These tests are exact-match Expression-tree snapshots — production behavior
is correct, the test expectations were stale. Updated:

- TestColumnStatsIndex.scala: inline expected for "c1 > 1" wrapped in the
  guard.
- TestColumnStatsIndexWithSQL.scala: central generateColStatsExprForGreater-
  thanOrEquals helper updated to wrap, which fixes all ~20 call sites that
  use EqualTo translations.
- TestPartitionStatsIndex.scala: dead-but-mirrored helper updated to stay in
  sync (the helper isn't actually called there, but matches the live one to
  prevent confusion).

Verified locally on Spark 3.5: both tests pass (5 total cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels May 20, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.26%. Comparing base (9565926) to head (1629a03).

Files with missing lines Patch % Lines
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 87.50% 2 Missing and 3 partials ⚠️
...java/org/apache/hudi/common/util/ParquetUtils.java 83.33% 1 Missing and 1 partial ⚠️
.../org/apache/spark/sql/hudi/DataSkippingUtils.scala 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18792   +/-   ##
=========================================
  Coverage     68.25%   68.26%           
- Complexity    29337    29352   +15     
=========================================
  Files          2527     2527           
  Lines        141858   141901   +43     
  Branches      17627    17639   +12     
=========================================
+ Hits          96827    96866   +39     
+ Misses        37068    37064    -4     
- Partials       7963     7971    +8     
Flag Coverage Δ
common-and-other-modules 44.42% <44.04%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.91% <34.42%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.23% <73.77%> (-0.01%) ⬇️
spark-java-tests 48.86% <69.04%> (-0.01%) ⬇️
spark-scala-tests 44.92% <45.23%> (+<0.01%) ⬆️
utilities 37.49% <41.66%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/hudi/stats/HoodieColumnRangeMetadata.java 84.61% <100.00%> (+2.39%) ⬆️
.../scala/org/apache/hudi/SparkBaseIndexSupport.scala 82.92% <ø> (ø)
.../org/apache/spark/sql/hudi/DataSkippingUtils.scala 78.78% <95.65%> (+0.76%) ⬆️
...java/org/apache/hudi/common/util/ParquetUtils.java 89.37% <83.33%> (-1.72%) ⬇️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 82.47% <87.50%> (-0.08%) ⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the thorough follow-up! All four prior findings are addressed: the boxed-Long reference-equality bug is fixed with .equals() (and a dedicated regression test that exercises row counts > 127), genColumnOnlyValuesEqualToExpression is renamed to genColumnOnlyValuesEqualToExpressionForNegation with a precondition-documenting Javadoc, withStatsAvailable is renamed to withUnreliableStatsGuard and additionally improved to correctly distinguish all-null columns from unreliable stats via valueCount > nullCount, and the misleading comment in HoodieColumnRangeMetadata.isStatsUnreliable is flipped to describe the truthy case. The refactor from the global guard in SparkBaseIndexSupport to the per-predicate guard at the translator layer is a precision improvement, and the new tests (TestHoodieColumnRangeMetadata, TestDataSkippingWithUnreliableColStats, and the additions to TestParquetUtils) provide solid coverage. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants