fix: Fix col-stats corruption from NaN values and stats-truncation#18792
fix: Fix col-stats corruption from NaN values and stats-truncation#18792linliu-code wants to merge 6 commits into
Conversation
…ta-skipping (apache#18754, apache#18755) Two correctness bugs that cause silent zero-row query results when data-skipping is enabled: * apache#18754 — A single NaN value in a FLOAT/DOUBLE column corrupts the persisted col-stats record for that file (parquet-mr clears the stats but leaves min/max at their default 0.0). The 1.x read path then uses those fabricated stats and prunes the file, dropping rows from any predicate touching the file's partition. * apache#18755 — When a Parquet column omits stats entirely (e.g. one value exceeds parquet-mr's stats truncation threshold), the col-stats writer mis-records nullCount = valueCount, i.e. "all rows are null". Any non-null predicate then skips the file silently. Both bugs share `ParquetUtils.readColumnStatsFromMetadata`; apache#18754 also involves the streaming compute path in `HoodieTableMetadataUtil. collectColumnRangeMetadata`, the per-partition merge helper in `HoodieColumnRangeMetadata.merge` (used by `PartitionStatsIndexSupport`), and the read-side predicate translator in `DataSkippingUtils`. Changes: * ParquetUtils.readColumnStatsFromMetadata - Use `Statistics.hasNonNullValue()` to detect NaN-tainted stats; report null min/max instead of the parquet sentinel 0.0. - Use `Statistics.isNumNullsSet()` to distinguish "all-null column" (use valueCount as nullCount, the original intent of the existing comment) from "stats absent" (nullCount unknown — must use 0 so data-skipping can't conclude "all rows are null"). * HoodieTableMetadataUtil.collectColumnRangeMetadata - Exclude NaN from min/max accumulator. Java's `Double.compare(NaN, x)` returns +1, so an unguarded comparison would let a single NaN poison the running max forever. Counts the NaN as a non-null value so valueCount/nullCount stay accurate. Mirrors parquet-mr's policy. * HoodieTableMetadataUtil.mergeColumnStatsRecords - When merging two col-stats records, distinguish "all-null column" (legitimately null min/max, safe to drop in merge) from "stats unreliable" (null min/max with nullCount < valueCount — must propagate null so the merged result reflects the worst-case unknown bound). * HoodieColumnRangeMetadata.merge (the per-file → per-partition merge helper used by `PartitionStatsIndexSupport`) - Apply the same unreliability propagation. This was the missing piece that left the read-side bug visible even with the other writer fixes applied: the partition aggregator was silently dropping the unreliable file's contribution, leaving a partial bound that pruned the partition. * DataSkippingUtils.scala - `withStatsAvailable` wraps min/max-based skip predicates so that a file with null min stat (= stats unreliable) is always scanned rather than pruned by Spark's null-propagating WHERE semantics. - Apply to all min/max-using cases: `<`, `>`, `<=`, `>=`, EqualTo (via `genColumnValuesEqualToExpression`), `In`, `InSet`, `StartsWith`, `Not(StartsWith)`, `Not(In)`, `Not(EqualTo)`. * SparkBaseIndexSupport.scala - In `getCandidateFiles`, augment the index filter with an OR clause "any column referenced in the query has a null min stat". Belt-and- suspenders against any predicate case we missed.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for tackling these correctness bugs — the analysis of the NaN poisoning and stats truncation paths is thorough, and the layered fix (collect → merge → query) is the right shape. One concern I'd like to flag for verification: the new isStatsUnreliable helper in HoodieTableMetadataUtil appears to use Java reference equality on autoboxed Long objects when checking nullCount != valueCount. Because Avro's ["null","long"] generates boxed Long fields and Long.valueOf only caches values in [-128, 127], this comparison flips to "different references" for any realistic row count, which would mark legitimately all-null columns as unreliable during MDT merge. Worth double-checking — see the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and comment-clarity suggestions below.
| * scanned, preserving correctness at the cost of skipping an opportunity. min/max are written | ||
| * jointly (both null or both non-null), so checking min is sufficient. | ||
| */ | ||
| @inline def withStatsAvailable(skipExpr: Expression, colName: String): Expression = { |
There was a problem hiding this comment.
🤖 nit: withStatsAvailable reads as "apply this only when stats are available", but the method actually does the opposite — it adds an OR-null guard so the file passes through when stats are unavailable. Something like withNullStatsGuard or guardedByStatsReliability would make the semantics clearer at every call site.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| if (value != null) { | ||
| return false; | ||
| } | ||
| // All-null column: stats are legitimately empty for min/max — safe to ignore in merge. |
There was a problem hiding this comment.
🤖 nit: the comment describes the false case (all-null column → reliable) but sits directly before the return that evaluates to true when unreliable (nullCount != valueCount). Could you flip the framing so it describes what true means — e.g. // Unreliable: min/max are null but the column is not all-null (nullCount < valueCount).
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Adds unit tests covering every behavior branch touched by the col-stats fix for apache#18754 / apache#18755: * TestParquetUtils: - testReadColumnStatsFromMetadata_nanTaintedDoubleColumn: writes a DOUBLE column containing NaN, asserts min/max come back null (was silently persisted as 0.0/0.0 — the writer-side half of apache#18754). - testReadColumnStatsFromMetadata_statsAbsentForLongValues: writes a string column with one value larger than parquet-mr's stats truncation threshold (~1KB), asserts nullCount=0 (was incorrectly reported as valueCount — apache#18755). - testReadColumnStatsFromMetadata_allNullColumnRetainsValueCountAsNullCount: pins the existing all-null convention so the apache#18755 fix doesn't regress it. * TestHoodieColumnRangeMetadata (new file): - merge of two reliable bounds picks min/max as before - merge with one all-null column keeps the other side's bounds - merge with one unreliable side propagates null - symmetric on left vs right - both-unreliable propagates null - both-all-null stays legitimately null - null-side returns the other (existing contract) The "one unreliable + one reliable -> null" case is the read-side half of apache#18754: without this propagation, PartitionStatsIndexSupport silently dropped the unreliable file's contribution and pruned the partition based on a partial bound. * TestHoodieTableMetadataUtil: - testIsFloatingPointNaN: pins the NaN-detection helper used by collectColumnRangeMetadata's streaming min/max accumulator. Covers NaN, Inf, normal numbers, and non-floating-point values. * TestDataSkippingUtils: - testNullSafetyForUnreliableStats: parameterized test over <, >, <=, >=, =, IN. Each row in the synthetic col-stats index includes a "file_unreliable" entry with null min/max but valueCount>0 (the stats-unreliability signal). The test asserts the unreliable file is always in the candidate set regardless of predicate shape — the read-side wrapping introduced by withStatsAvailable. HoodieTableMetadataUtil.isFloatingPointNaN was changed from private to package-private for testability. Sanity-checked locally that the new ParquetUtils + HoodieColumnRangeMetadata tests fail against upstream/master (catching the bugs) and pass against the fix. DataSkippingUtils test runs in CI — relies on Spark resolution that's easier to set up there than in a standalone JVM.
Adds a functional test class that exercises the full Hudi write→read path for both bugs covered by this PR. * testNaNInDoubleColumnDoesNotSilentlyPruneRows (apache#18754): writes 4 files in one partition where file 3 contains a NaN value. Asserts that queries against d_double AND queries against unrelated columns (s_str) return identical row sets with data-skipping ON vs OFF. Pre-fix: queries silently dropped rows because file 3's bad parquet stats (min=0/max=0) propagated through the col-stats writer into the MDT and through the partition-stats aggregator. * testTruncatedStringStatsDoNotSilentlyPruneRows (apache#18755): writes 2 files where file 1 contains a long string value that exceeds parquet-mr's stats truncation threshold (~1KB). Asserts that a point-lookup on the short string in the same file returns the row with skipping ON. Pre-fix: the col-stats writer recorded nullCount=valueCount, claiming the file was all-null, so data-skipping silently skipped it. Strategy: assertSkippingNeverDropsRows compares the set of rk values returned with skipping ON to the set returned with skipping OFF, and fails on any divergence. This is the precise correctness contract that data-skipping must preserve — a single asymmetric prune is silent data loss.
* DataSkippingUtils.withStatsAvailable now adds an AND-guard on
{@code valueCount > nullCount} so the OR-IsNull(min) branch only fires
for files with non-null values whose bounds are unknown (the unreliable
case). Without the guard the wrap also fired for legitimately all-null
columns (or test fixtures with unset nullCount), causing legacy
TestDataSkippingUtils#testLookupFilterExpressions parameterized cases
to include files that should have been pruned.
* SparkBaseIndexSupport.getCandidateFiles no longer adds a separate
global null-safety augmentation — the per-predicate wrap in
DataSkippingUtils now covers all cases with the correct semantics.
* TestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMax updated
to reflect the new semantics: merging an "unreliable null" side
(nullCount<valueCount with null min/max) now propagates null instead
of silently dropping the contribution. Added a companion test pinning
that all-null columns are still dropped from merge correctly.
…ty + naming nits Critical bug surfaced by AI reviewer on HoodieTableMetadataUtil.isStatsUnreliable: the helper compared boxed Long fields (nullCount, valueCount) with !=, which is reference equality. Long.valueOf caches only [-128, 127], so any realistic row count autoboxes to a fresh Long instance and != returns true even when values are numerically equal. This misclassifies legitimately all-null columns as "stats unreliable" during MDT col-stats compaction merge, defeating the all-null pruning fast path. Fix: use !nullCount.equals(valueCount). Sibling helper in HoodieColumnRangeMetadata is unaffected because its fields are primitive long. Also addresses three lower-severity review notes: - Rename DataSkippingUtils.withStatsAvailable → withUnreliableStatsGuard. The old name read as "apply when stats are available" but the helper actually added an OR-null guard for when they are NOT available. - Rename genColumnOnlyValuesEqualToExpression → ...ForNegation, and lift the Not(...) precondition from an inline comment into the helper's javadoc. All 3 current callers already wrap in Not(...), so no semantic change. - Flip comment framing in HoodieColumnRangeMetadata.isStatsUnreliable so the comment describes the true-case (unreliable) sitting next to the return. Test: new testIsStatsUnreliableUsesValueEqualityOnBoxedLong covers six quadrants (all-null large N, all-null small N, value-present, unreliable, missing nullCount, missing valueCount). Verified it fails on the buggy code and passes after the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uard wrap Five CI failures (4 in TestColumnStatsIndex.testTranslateQueryFiltersIntoCol- umnStatsIndexFilterExpr, 1 in TestColumnStatsIndexWithSQL.testTranslateInto- ColumnStatsIndexFilterExpr, repeated across Spark 3.3-4.1) compared the translated col-stats filter expression character-by-character against a hardcoded expected Expression tree. The new null-safety OR-guard added by DataSkippingUtils.withUnreliableStatsGuard for files with unreliable stats (NaN-poisoned, value-size truncated) legitimately changes the translated expression shape from (colA_maxValue > V) to (colA_maxValue > V) OR (isnull(colA_minValue) AND valueCount > colA_nullCount) These tests are exact-match Expression-tree snapshots — production behavior is correct, the test expectations were stale. Updated: - TestColumnStatsIndex.scala: inline expected for "c1 > 1" wrapped in the guard. - TestColumnStatsIndexWithSQL.scala: central generateColStatsExprForGreater- thanOrEquals helper updated to wrap, which fixes all ~20 call sites that use EqualTo translations. - TestPartitionStatsIndex.scala: dead-but-mirrored helper updated to stay in sync (the helper isn't actually called there, but matches the live one to prevent confusion). Verified locally on Spark 3.5: both tests pass (5 total cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18792 +/- ##
=========================================
Coverage 68.25% 68.26%
- Complexity 29337 29352 +15
=========================================
Files 2527 2527
Lines 141858 141901 +43
Branches 17627 17639 +12
=========================================
+ Hits 96827 96866 +39
+ Misses 37068 37064 -4
- Partials 7963 7971 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the thorough follow-up! All four prior findings are addressed: the boxed-Long reference-equality bug is fixed with .equals() (and a dedicated regression test that exercises row counts > 127), genColumnOnlyValuesEqualToExpression is renamed to genColumnOnlyValuesEqualToExpressionForNegation with a precondition-documenting Javadoc, withStatsAvailable is renamed to withUnreliableStatsGuard and additionally improved to correctly distinguish all-null columns from unreliable stats via valueCount > nullCount, and the misleading comment in HoodieColumnRangeMetadata.isStatsUnreliable is flipped to describe the truthy case. The refactor from the global guard in SparkBaseIndexSupport to the per-predicate guard at the translator layer is a precision improvement, and the new tests (TestHoodieColumnRangeMetadata, TestDataSkippingWithUnreliableColStats, and the additions to TestParquetUtils) provide solid coverage. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Describe the issue this Pull Request addresses
Fixes two related correctness bugs in the column-stats metadata table that cause silent zero-row query results when data-skipping is enabled:
[SUPPORT] Single NaN value corrupts col-stats and silently drops query rows across whole table (regression in 1.1.1) #18754 — A single
NaNvalue in aFLOAT/DOUBLEcolumn corrupts the persisted col-stats record for that file. parquet-mr clears the underlying parquet stats (hasNonNullValue=false) but leaves min/max at their default0.0. Hudi was reading those sentinel values and persistingmin=0.0, max=0.0to the MDT. The 1.x read path then used those fabricated bounds, pruning the file at the partition-stats level and dropping rows from any predicate involving that partition — including predicates on completely unrelated columns of files that contain no NaN at all.[SUPPORT] Col-stats MDT mis-records nullCount=valueCount when Parquet column stats are absent (e.g. due to long string values), causing silent zero-row query results #18755 — When a Parquet column omits stats entirely (e.g. one value exceeds parquet-mr's stats truncation threshold ~1KB for binary/string), the col-stats writer was mis-recording
nullCount = valueCount, i.e. "all rows in this file are null". Any non-null predicate then silently skips that file.Both bugs cause silent data loss at query time — no error, no warning, just an incomplete result set.
Summary and Changelog
hudi-hadoop-common/.../ParquetUtils.readColumnStatsFromMetadatadid two things that turned out to be unsafe in the presence of unreliable parquet stats:stats.genericGetMin()/getMax()without checkingstats.hasNonNullValue()— parquet-mr uses that flag to signal "min/max are unreliable, do not trust them" (NaN in FLOAT/DOUBLE).stats.isEmpty() ? valueCount : getNumNulls().isEmpty()istrueboth for all-null columns and for stats-truly-absent columns — the two cases need opposite handling.The error then propagated: even after
ParquetUtilswas fixed to report null min/max for unreliable stats, the per-partition aggregatorHoodieColumnRangeMetadata.mergesilently dropped the null contribution from the unreliable file. This leftPartitionStatsIndexSupport.prunePartitionswith a partial bound that pruned the partition based on the other files' stats — the silent zero-row symptom in #18754.The read-side filter generator in
DataSkippingUtilsalso needed updating: a predicate likecolA_maxValue > Xevaluates to null when the file has null max stat, and Spark treats null as false inWHERE, so the file gets pruned. Wrapping withOR (colMin IS NULL AND valueCount > nullCount)makes the file scanned instead, while still pruning legitimately all-null columns.Changes:
ParquetUtils.readColumnStatsFromMetadata— gate min/max onhasNonNullValue(); gatenullCount = valueCountonisNumNullsSet().HoodieTableMetadataUtil.collectColumnRangeMetadata— skip NaN in the streaming min/max accumulator (Double.compare(NaN, x)returns+1, so an unguarded comparison lets a single NaN poison max forever).HoodieTableMetadataUtil.mergeColumnStatsRecords— distinguish all-null from stats-unreliable when merging MDT records.HoodieColumnRangeMetadata.merge— the per-file → per-partition merge helper used byPartitionStatsIndexSupport. Apply the same unreliability propagation here (the missing piece that kept the read-side bug visible even after the other writer fixes were applied).DataSkippingUtils.scala— wrap min/max-using skip predicates withOR (colMin IS NULL AND valueCount > nullCount). The valueCount-vs-nullCount guard distinguishes unreliable from all-null.Impact
No public API change. Internal write/read code only.
The fix is strictly more conservative than the prior behavior: files with unreliable stats are now scanned instead of silently pruned. Tables that don't contain NaN values and don't have stats-truncated columns are unaffected (the new gates short-circuit when stats are reliable). For tables that do hit either condition, queries previously returning silently-wrong (incomplete) results now return correct results; pruning is less aggressive on the files that triggered the unreliable signal.
Risk Level
Medium. The merge-helper change (
HoodieColumnRangeMetadata.merge) is in a path shared by partition-stats aggregation and any other caller that merges per-file ranges. The semantics shift from "drop null" to "propagate null when stats are unreliable, drop null when column is genuinely all-null". The distinguishing check isnullCount == valueCount, which is the existing convention for the all-null sentinel — no API change, behavior alignment with the documented contract. UpdatedTestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMaxreflects the new (correct) semantics; the prior assertion silently dropped the null contributor.Documentation Update
No documentation changes required. The bugs were silent regressions against the existing contract that "data-skipping should never change query results".
Contributor's checklist
TestParquetUtils,TestHoodieColumnRangeMetadata,TestHoodieTableMetadataUtil(includingtestIsStatsUnreliableUsesValueEqualityOnBoxedLongregression for the boxed-Long ref-equality bug surfaced in code review),TestDataSkippingUtils; functional test inTestDataSkippingWithUnreliableColStats; updated existingTestBaseFileUtils#testGetColumnRangeInPartitionWithNullMinMaxto reflect new semantics)