Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692
Open
cbb330 wants to merge 1 commit into
Open
Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692cbb330 wants to merge 1 commit into
cbb330 wants to merge 1 commit into
Conversation
dbc0dff to
bbafc90
Compare
pvary
reviewed
Jun 5, 2026
| * @param caseSensitive whether column resolution is case sensitive | ||
| * @return the filter with absent initial-default columns folded to their default value | ||
| */ | ||
| static Expression replaceMissingColumnDefaults( |
Contributor
There was a problem hiding this comment.
Why is this not part of the ParquetFilters.convert method?
…lted column A column added by schema evolution with an initial-default is backfilled with the default at read time, but ParquetMetricsRowGroupFilter evaluates predicates against a column that is absent from a data file as if it were all-null. For a file written before the column existed, a predicate referencing the column -- including the IsNotNull engines infer for null-intolerant predicates -- skipped the row group, silently dropping exactly the rows the default backfills. The full scan was correct, so the case was untested and unobserved. Make the row-group metrics filter default-aware: when a predicate references a column that is missing from the file but carries an initialDefault, evaluate it against the default value instead of assuming null. The filter already receives the table schema carrying the default, so this is a single check at the predicate dispatch point. Columns present in the file, and missing columns without a default, are unchanged. Closes apache#16690 Signed-off-by: Christian Bush <chbush@linkedin.com>
bbafc90 to
8326eb7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A column added by schema evolution with an
initial-defaultis backfilled with the default at readtime, but
ParquetMetricsRowGroupFilterevaluates predicates against a column that is absent froma data file as if it were all-null. For a file written before the column existed, a predicate
referencing the column — including the
IsNotNullengines infer for null-intolerant predicates —skipped the row group, silently dropping exactly the rows the default backfills. The full scan
was correct, so the case was untested and unobserved.
Fixes #16690.
Repro (Spark 3.5, Parquet, format-version 3)
SELECT id, c(1,US),(2,US)✅(1,US),(2,US)WHERE c = 'US'(2)❌(1),(2)✅WHERE upper(c) = 'US'(2)❌(1),(2)✅WHERE c IS NOT NULL(2)❌(1),(2)✅WHERE c = 'CA'()()✅ (absent file still excluded)WHERE c IS NULL()()✅Fix
The drop happens in
ParquetMetricsRowGroupFilter, whose per-predicate handlers treat a columnmissing from the file as all-null (
// the column is not present and is all nulls). The filter isalready constructed with the table read schema, which carries
initialDefault, so it has everythingit needs to be correct.
Override the single predicate dispatch point: when a predicate references a column that is missing
from the file but has an
initialDefault, evaluate it against the default value(
pred.test(default)) instead of falling through to the null-assuming handlers.c = 'US'→ROWS_MIGHT_MATCH,c = 'CA'→ROWS_CANNOT_MATCH,IsNotNull(c)→ match,IsNull(c)→ skip.Columns present in the file, and missing columns without a default, are unchanged — so present-column
files still prune on real stats, and existing missing-column behavior is preserved.
This mirrors how the filter already gets the schema; it's a fix at the source rather than a rewrite
of the predicate at the call site.
Testing
TestMetricsRowGroupFilter#testColumnNotInFileWithInitialDefault(Parquet) — unit coverage for thefold: match/non-match/
IsNull/IsNotNull/notEqualagainst the default.TestDefaultValuesFilteredRead(Spark 3.5) — end-to-end; fails onmainwithout the fix.:iceberg-parquet:test,:iceberg-data:test TestMetricsRowGroupFilter(Parquet + ORC), and:iceberg-arrow:testpass.Scope / notes
ParquetReaderandVectorizedParquetReaderboth share this filter viaReadConf, so the row andvectorized read paths are both fixed.
no change (verified by the end-to-end test).
injected), so it is already correct. ORC throws
UnsupportedOperationExceptionon reading aninitial-default column, so there is no silent drop there.
ParquetFilters.convert) shares the same missing-column assumptionbut is only reached via the
@DeprecatedreadSupportreader, not by any engine read; it can getthe same treatment as a follow-up if desired.