fix: Fix col-stats corruption from NaN values and stats-truncation by linliu-code · Pull Request #18792 · apache/hudi

linliu-code · 2026-05-20T13:02:43Z

Describe the issue this Pull Request addresses

Fixes two related correctness bugs in the column-stats metadata table that cause silent zero-row query results when data-skipping is enabled:

[SUPPORT] Single NaN value corrupts col-stats and silently drops query rows across whole table (regression in 1.1.1) #18754 — A single NaN value in a FLOAT/DOUBLE column corrupts the persisted col-stats record for that file. parquet-mr clears the underlying parquet stats (hasNonNullValue=false) but leaves min/max at their default 0.0. Hudi was reading those sentinel values and persisting min=0.0, max=0.0 to the MDT. The 1.x read path then used those fabricated bounds, pruning the file at the partition-stats level and dropping rows from any predicate involving that partition — including predicates on completely unrelated columns of files that contain no NaN at all.
[SUPPORT] Col-stats MDT mis-records nullCount=valueCount when Parquet column stats are absent (e.g. due to long string values), causing silent zero-row query results #18755 — When a Parquet column omits stats entirely (e.g. one value exceeds parquet-mr's stats truncation threshold ~1KB for binary/string), the col-stats writer was mis-recording nullCount = valueCount, i.e. "all rows in this file are null". Any non-null predicate then silently skips that file.

Both bugs cause silent data loss at query time — no error, no warning, just an incomplete result set.

Summary and Changelog

hudi-hadoop-common/.../ParquetUtils.readColumnStatsFromMetadata did two things that turned out to be unsafe in the presence of unreliable parquet stats:

Called stats.genericGetMin()/getMax() without checking stats.hasNonNullValue() — parquet-mr uses that flag to signal "min/max are unreliable, do not trust them" (NaN in FLOAT/DOUBLE).
Did stats.isEmpty() ? valueCount : getNumNulls(). isEmpty() is true both for all-null columns and for stats-truly-absent columns — the two cases need opposite handling.

The error then propagated: even after ParquetUtils was fixed to report null min/max for unreliable stats, the per-partition aggregator HoodieColumnRangeMetadata.merge silently dropped the null contribution from the unreliable file. This left PartitionStatsIndexSupport.prunePartitions with a partial bound that pruned the partition based on the other files' stats — the silent zero-row symptom in #18754.

The read-side filter generator in DataSkippingUtils also needed updating: a predicate like colA_maxValue > X evaluates to null when the file has null max stat, and Spark treats null as false in WHERE, so the file gets pruned. Wrapping with OR (colMin IS NULL AND valueCount > nullCount) makes the file scanned instead, while still pruning legitimately all-null columns.

Changes:

ParquetUtils.readColumnStatsFromMetadata — gate min/max on hasNonNullValue(); gate nullCount = valueCount on isNumNullsSet().
HoodieTableMetadataUtil.collectColumnRangeMetadata — skip NaN in the streaming min/max accumulator (Double.compare(NaN, x) returns +1, so an unguarded comparison lets a single NaN poison max forever).
HoodieTableMetadataUtil.mergeColumnStatsRecords — distinguish all-null from stats-unreliable when merging MDT records.
HoodieColumnRangeMetadata.merge — the per-file → per-partition merge helper used by PartitionStatsIndexSupport. Apply the same unreliability propagation here (the missing piece that kept the read-side bug visible even after the other writer fixes were applied).
DataSkippingUtils.scala — wrap min/max-using skip predicates with OR (colMin IS NULL AND valueCount > nullCount). The valueCount-vs-nullCount guard distinguishes unreliable from all-null.

Impact

No public API change. Internal write/read code only.

The fix is strictly more conservative than the prior behavior: files with unreliable stats are now scanned instead of silently pruned. Tables that don't contain NaN values and don't have stats-truncated columns are unaffected (the new gates short-circuit when stats are reliable). For tables that do hit either condition, queries previously returning silently-wrong (incomplete) results now return correct results; pruning is less aggressive on the files that triggered the unreliable signal.

Risk Level

Medium. The merge-helper change (HoodieColumnRangeMetadata.merge) is in a path shared by partition-stats aggregation and any other caller that merges per-file ranges. The semantics shift from "drop null" to "propagate null when stats are unreliable, drop null when column is genuinely all-null". The distinguishing check is nullCount == valueCount, which is the existing convention for the all-null sentinel — no API change, behavior alignment with the documented contract. Updated TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax reflects the new (correct) semantics; the prior assertion silently dropped the null contributor.

Documentation Update

No documentation changes required. The bugs were silent regressions against the existing contract that "data-skipping should never change query results".

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable (unit tests in TestParquetUtils, TestHoodieColumnRangeMetadata, TestHoodieTableMetadataUtil (including testIsStatsUnreliableUsesValueEqualityOnBoxedLong regression for the boxed-Long ref-equality bug surfaced in code review), TestDataSkippingUtils; functional test in TestDataSkippingWithUnreliableColStats; updated existing TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax to reflect new semantics)
CI passed

…ta-skipping (apache#18754, apache#18755) Two correctness bugs that cause silent zero-row query results when data-skipping is enabled: * apache#18754 — A single NaN value in a FLOAT/DOUBLE column corrupts the persisted col-stats record for that file (parquet-mr clears the stats but leaves min/max at their default 0.0). The 1.x read path then uses those fabricated stats and prunes the file, dropping rows from any predicate touching the file's partition. * apache#18755 — When a Parquet column omits stats entirely (e.g. one value exceeds parquet-mr's stats truncation threshold), the col-stats writer mis-records nullCount = valueCount, i.e. "all rows are null". Any non-null predicate then skips the file silently. Both bugs share `ParquetUtils.readColumnStatsFromMetadata`; apache#18754 also involves the streaming compute path in `HoodieTableMetadataUtil. collectColumnRangeMetadata`, the per-partition merge helper in `HoodieColumnRangeMetadata.merge` (used by `PartitionStatsIndexSupport`), and the read-side predicate translator in `DataSkippingUtils`. Changes: * ParquetUtils.readColumnStatsFromMetadata - Use `Statistics.hasNonNullValue()` to detect NaN-tainted stats; report null min/max instead of the parquet sentinel 0.0. - Use `Statistics.isNumNullsSet()` to distinguish "all-null column" (use valueCount as nullCount, the original intent of the existing comment) from "stats absent" (nullCount unknown — must use 0 so data-skipping can't conclude "all rows are null"). * HoodieTableMetadataUtil.collectColumnRangeMetadata - Exclude NaN from min/max accumulator. Java's `Double.compare(NaN, x)` returns +1, so an unguarded comparison would let a single NaN poison the running max forever. Counts the NaN as a non-null value so valueCount/nullCount stay accurate. Mirrors parquet-mr's policy. * HoodieTableMetadataUtil.mergeColumnStatsRecords - When merging two col-stats records, distinguish "all-null column" (legitimately null min/max, safe to drop in merge) from "stats unreliable" (null min/max with nullCount < valueCount — must propagate null so the merged result reflects the worst-case unknown bound). * HoodieColumnRangeMetadata.merge (the per-file → per-partition merge helper used by `PartitionStatsIndexSupport`) - Apply the same unreliability propagation. This was the missing piece that left the read-side bug visible even with the other writer fixes applied: the partition aggregator was silently dropping the unreliable file's contribution, leaving a partial bound that pruned the partition. * DataSkippingUtils.scala - `withStatsAvailable` wraps min/max-based skip predicates so that a file with null min stat (= stats unreliable) is always scanned rather than pruned by Spark's null-propagating WHERE semantics. - Apply to all min/max-using cases: `<`, `>`, `<=`, `>=`, EqualTo (via `genColumnValuesEqualToExpression`), `In`, `InSet`, `StartsWith`, `Not(StartsWith)`, `Not(In)`, `Not(EqualTo)`. * SparkBaseIndexSupport.scala - In `getCandidateFiles`, augment the index filter with an OR clause "any column referenced in the query has a null min stat". Belt-and- suspenders against any predicate case we missed.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for tackling these correctness bugs — the analysis of the NaN poisoning and stats truncation paths is thorough, and the layered fix (collect → merge → query) is the right shape. One concern I'd like to flag for verification: the new isStatsUnreliable helper in HoodieTableMetadataUtil appears to use Java reference equality on autoboxed Long objects when checking nullCount != valueCount. Because Avro's ["null","long"] generates boxed Long fields and Long.valueOf only caches values in [-128, 127], this comparison flips to "different references" for any realistic row count, which would mark legitimately all-null columns as unreliable during MDT merge. Worth double-checking — see the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and comment-clarity suggestions below.

hudi-agent · 2026-05-20T13:14:58Z

+   * scanned, preserving correctness at the cost of skipping an opportunity. min/max are written
+   * jointly (both null or both non-null), so checking min is sufficient.
+   */
+  @inline def withStatsAvailable(skipExpr: Expression, colName: String): Expression = {


🤖 nit: withStatsAvailable reads as "apply this only when stats are available", but the method actually does the opposite — it adds an OR-null guard so the file passes through when stats are unavailable. Something like withNullStatsGuard or guardedByStatsReliability would make the semantics clearer at every call site.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-20T13:14:58Z

+    if (value != null) {
+      return false;
+    }
+    // All-null column: stats are legitimately empty for min/max — safe to ignore in merge.


🤖 nit: the comment describes the false case (all-null column → reliable) but sits directly before the return that evaluates to true when unreliable (nullCount != valueCount). Could you flip the framing so it describes what true means — e.g. // Unreliable: min/max are null but the column is not all-null (nullCount < valueCount).

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Adds unit tests covering every behavior branch touched by the col-stats fix for apache#18754 / apache#18755: * TestParquetUtils: - testReadColumnStatsFromMetadata_nanTaintedDoubleColumn: writes a DOUBLE column containing NaN, asserts min/max come back null (was silently persisted as 0.0/0.0 — the writer-side half of apache#18754). - testReadColumnStatsFromMetadata_statsAbsentForLongValues: writes a string column with one value larger than parquet-mr's stats truncation threshold (~1KB), asserts nullCount=0 (was incorrectly reported as valueCount — apache#18755). - testReadColumnStatsFromMetadata_allNullColumnRetainsValueCountAsNullCount: pins the existing all-null convention so the apache#18755 fix doesn't regress it. * TestHoodieColumnRangeMetadata (new file): - merge of two reliable bounds picks min/max as before - merge with one all-null column keeps the other side's bounds - merge with one unreliable side propagates null - symmetric on left vs right - both-unreliable propagates null - both-all-null stays legitimately null - null-side returns the other (existing contract) The "one unreliable + one reliable -> null" case is the read-side half of apache#18754: without this propagation, PartitionStatsIndexSupport silently dropped the unreliable file's contribution and pruned the partition based on a partial bound. * TestHoodieTableMetadataUtil: - testIsFloatingPointNaN: pins the NaN-detection helper used by collectColumnRangeMetadata's streaming min/max accumulator. Covers NaN, Inf, normal numbers, and non-floating-point values. * TestDataSkippingUtils: - testNullSafetyForUnreliableStats: parameterized test over <, >, <=, >=, =, IN. Each row in the synthetic col-stats index includes a "file_unreliable" entry with null min/max but valueCount>0 (the stats-unreliability signal). The test asserts the unreliable file is always in the candidate set regardless of predicate shape — the read-side wrapping introduced by withStatsAvailable. HoodieTableMetadataUtil.isFloatingPointNaN was changed from private to package-private for testability. Sanity-checked locally that the new ParquetUtils + HoodieColumnRangeMetadata tests fail against upstream/master (catching the bugs) and pass against the fix. DataSkippingUtils test runs in CI — relies on Spark resolution that's easier to set up there than in a standalone JVM.

Adds a functional test class that exercises the full Hudi write→read path for both bugs covered by this PR. * testNaNInDoubleColumnDoesNotSilentlyPruneRows (apache#18754): writes 4 files in one partition where file 3 contains a NaN value. Asserts that queries against d_double AND queries against unrelated columns (s_str) return identical row sets with data-skipping ON vs OFF. Pre-fix: queries silently dropped rows because file 3's bad parquet stats (min=0/max=0) propagated through the col-stats writer into the MDT and through the partition-stats aggregator. * testTruncatedStringStatsDoNotSilentlyPruneRows (apache#18755): writes 2 files where file 1 contains a long string value that exceeds parquet-mr's stats truncation threshold (~1KB). Asserts that a point-lookup on the short string in the same file returns the row with skipping ON. Pre-fix: the col-stats writer recorded nullCount=valueCount, claiming the file was all-null, so data-skipping silently skipped it. Strategy: assertSkippingNeverDropsRows compares the set of rk values returned with skipping ON to the set returned with skipping OFF, and fails on any divergence. This is the precise correctness contract that data-skipping must preserve — a single asymmetric prune is silent data loss.

* DataSkippingUtils.withStatsAvailable now adds an AND-guard on {@code valueCount > nullCount} so the OR-IsNull(min) branch only fires for files with non-null values whose bounds are unknown (the unreliable case). Without the guard the wrap also fired for legitimately all-null columns (or test fixtures with unset nullCount), causing legacy TestDataSkippingUtils#testLookupFilterExpressions parameterized cases to include files that should have been pruned. * SparkBaseIndexSupport.getCandidateFiles no longer adds a separate global null-safety augmentation — the per-predicate wrap in DataSkippingUtils now covers all cases with the correct semantics. * TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax updated to reflect the new semantics: merging an "unreliable null" side (nullCount<valueCount with null min/max) now propagates null instead of silently dropping the contribution. Added a companion test pinning that all-null columns are still dropped from merge correctly.

…ty + naming nits Critical bug surfaced by AI reviewer on HoodieTableMetadataUtil.isStatsUnreliable: the helper compared boxed Long fields (nullCount, valueCount) with !=, which is reference equality. Long.valueOf caches only [-128, 127], so any realistic row count autoboxes to a fresh Long instance and != returns true even when values are numerically equal. This misclassifies legitimately all-null columns as "stats unreliable" during MDT col-stats compaction merge, defeating the all-null pruning fast path. Fix: use !nullCount.equals(valueCount). Sibling helper in HoodieColumnRangeMetadata is unaffected because its fields are primitive long. Also addresses three lower-severity review notes: - Rename DataSkippingUtils.withStatsAvailable → withUnreliableStatsGuard. The old name read as "apply when stats are available" but the helper actually added an OR-null guard for when they are NOT available. - Rename genColumnOnlyValuesEqualToExpression → ...ForNegation, and lift the Not(...) precondition from an inline comment into the helper's javadoc. All 3 current callers already wrap in Not(...), so no semantic change. - Flip comment framing in HoodieColumnRangeMetadata.isStatsUnreliable so the comment describes the true-case (unreliable) sitting next to the return. Test: new testIsStatsUnreliableUsesValueEqualityOnBoxedLong covers six quadrants (all-null large N, all-null small N, value-present, unreliable, missing nullCount, missing valueCount). Verified it fails on the buggy code and passes after the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uard wrap Five CI failures (4 in TestColumnStatsIndex.testTranslateQueryFiltersIntoCol- umnStatsIndexFilterExpr, 1 in TestColumnStatsIndexWithSQL.testTranslateInto- ColumnStatsIndexFilterExpr, repeated across Spark 3.3-4.1) compared the translated col-stats filter expression character-by-character against a hardcoded expected Expression tree. The new null-safety OR-guard added by DataSkippingUtils.withUnreliableStatsGuard for files with unreliable stats (NaN-poisoned, value-size truncated) legitimately changes the translated expression shape from (colA_maxValue > V) to (colA_maxValue > V) OR (isnull(colA_minValue) AND valueCount > colA_nullCount) These tests are exact-match Expression-tree snapshots — production behavior is correct, the test expectations were stale. Updated: - TestColumnStatsIndex.scala: inline expected for "c1 > 1" wrapped in the guard. - TestColumnStatsIndexWithSQL.scala: central generateColStatsExprForGreater- thanOrEquals helper updated to wrap, which fixes all ~20 call sites that use EqualTo translations. - TestPartitionStatsIndex.scala: dead-but-mirrored helper updated to stay in sync (the helper isn't actually called there, but matches the live one to prevent confusion). Verified locally on Spark 3.5: both tests pass (5 total cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hudi-bot · 2026-05-20T19:49:18Z

CI report:

2399c77 UNKNOWN
d8845a8 UNKNOWN
ef03ae1 UNKNOWN
1629a03 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-20T20:10:29Z

Codecov Report

❌ Patch coverage is 90.47619% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.26%. Comparing base (9565926) to head (1629a03).

Files with missing lines	Patch %	Lines
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	87.50%	2 Missing and 3 partials ⚠️
...java/org/apache/hudi/common/util/ParquetUtils.java	83.33%	1 Missing and 1 partial ⚠️
.../org/apache/spark/sql/hudi/DataSkippingUtils.scala	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18792   +/-   ##
=========================================
  Coverage     68.25%   68.26%           
- Complexity    29337    29352   +15     
=========================================
  Files          2527     2527           
  Lines        141858   141901   +43     
  Branches      17627    17639   +12     
=========================================
+ Hits          96827    96866   +39     
+ Misses        37068    37064    -4     
- Partials       7963     7971    +8

Flag	Coverage Δ
common-and-other-modules	`44.42% <44.04%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.91% <34.42%> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.23% <73.77%> (-0.01%)`	⬇️
spark-java-tests	`48.86% <69.04%> (-0.01%)`	⬇️
spark-scala-tests	`44.92% <45.23%> (+<0.01%)`	⬆️
utilities	`37.49% <41.66%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...g/apache/hudi/stats/HoodieColumnRangeMetadata.java	`84.61% <100.00%> (+2.39%)`	⬆️
.../scala/org/apache/hudi/SparkBaseIndexSupport.scala	`82.92% <ø> (ø)`
.../org/apache/spark/sql/hudi/DataSkippingUtils.scala	`78.78% <95.65%> (+0.76%)`	⬆️
...java/org/apache/hudi/common/util/ParquetUtils.java	`89.37% <83.33%> (-1.72%)`	⬇️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	`82.47% <87.50%> (-0.08%)`	⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the thorough follow-up! All four prior findings are addressed: the boxed-Long reference-equality bug is fixed with .equals() (and a dedicated regression test that exercises row counts > 127), genColumnOnlyValuesEqualToExpression is renamed to genColumnOnlyValuesEqualToExpressionForNegation with a precondition-documenting Javadoc, withStatsAvailable is renamed to withUnreliableStatsGuard and additionally improved to correctly distinguish all-null columns from unreliable stats via valueCount > nullCount, and the misleading comment in HoodieColumnRangeMetadata.isStatsUnreliable is flipped to describe the truthy case. The refactor from the global guard in SparkBaseIndexSupport to the per-predicate guard at the translator layer is a precision improvement, and the new tests (TestHoodieColumnRangeMetadata, TestDataSkippingWithUnreliableColStats, and the additions to TestParquetUtils) provide solid coverage. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

linliu-code changed the title ~~[HUDI-XXXX] Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755)~~ fix: Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755) May 20, 2026

linliu-code changed the title ~~fix: Fix col-stats corruption from NaN values and stats-truncation (fixes #18754, #18755)~~ fix: Fix col-stats corruption from NaN values and stats-truncation May 20, 2026

hudi-agent reviewed May 20, 2026

View reviewed changes

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 20, 2026

linliu-code added 2 commits May 20, 2026 07:00

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2026

linliu-code and others added 3 commits May 20, 2026 08:39

github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels May 20, 2026

hudi-agent reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix col-stats corruption from NaN values and stats-truncation#18792

fix: Fix col-stats corruption from NaN values and stats-truncation#18792
linliu-code wants to merge 6 commits into
apache:masterfrom
linliu-code:fix-colstats-nan-and-truncation

linliu-code commented May 20, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent May 20, 2026

Uh oh!

hudi-agent May 20, 2026

Uh oh!

hudi-bot commented May 20, 2026

Uh oh!

codecov-commenter commented May 20, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

linliu-code commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 20, 2026

CI report:

Uh oh!

codecov-commenter commented May 20, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linliu-code commented May 20, 2026 •

edited

Loading