Parquet: Fix initial-default rows dropped when filtering on the defaulted column by cbb330 · Pull Request #16692 · apache/iceberg

cbb330 · 2026-06-05T09:29:00Z

What

A column added by schema evolution with an initial-default is backfilled with the default at read
time, but ParquetMetricsRowGroupFilter evaluates predicates against a column that is absent from
a data file as if it were all-null. For a file written before the column existed, a predicate
referencing the column — including the IsNotNull engines infer for null-intolerant predicates —
skipped the row group, silently dropping exactly the rows the default backfills. The full scan
was correct, so the case was untested and unobserved.

Fixes #16690.

Repro (Spark 3.5, Parquet, format-version 3)

CREATE TABLE t (id bigint, name string) USING iceberg TBLPROPERTIES ('format-version'='3');
INSERT INTO t VALUES (1, 'Alice');                          -- written before column c exists
-- table.updateSchema().addColumn("c", StringType, lit("US")).commit();
INSERT INTO t VALUES (2, 'Bob', 'US');                      -- c present

Query	Before	After
`SELECT id, c`	`(1,US),(2,US)` ✅	`(1,US),(2,US)`
`WHERE c = 'US'`	`(2)` ❌	`(1),(2)` ✅
`WHERE upper(c) = 'US'`	`(2)` ❌	`(1),(2)` ✅
`WHERE c IS NOT NULL`	`(2)` ❌	`(1),(2)` ✅
`WHERE c = 'CA'`	`()`	`()` ✅ (absent file still excluded)
`WHERE c IS NULL`	`()`	`()` ✅

Fix

The drop happens in ParquetMetricsRowGroupFilter, whose per-predicate handlers treat a column
missing from the file as all-null (// the column is not present and is all nulls). The filter is
already constructed with the table read schema, which carries initialDefault, so it has everything
it needs to be correct.

Override the single predicate dispatch point: when a predicate references a column that is missing
from the file but has an initialDefault, evaluate it against the default value
(pred.test(default)) instead of falling through to the null-assuming handlers. c = 'US' →
ROWS_MIGHT_MATCH, c = 'CA' → ROWS_CANNOT_MATCH, IsNotNull(c) → match, IsNull(c) → skip.
Columns present in the file, and missing columns without a default, are unchanged — so present-column
files still prune on real stats, and existing missing-column behavior is preserved.

This mirrors how the filter already gets the schema; it's a fix at the source rather than a rewrite
of the predicate at the call site.

Testing

TestMetricsRowGroupFilter#testColumnNotInFileWithInitialDefault (Parquet) — unit coverage for the
fold: match/non-match/IsNull/IsNotNull/notEqual against the default.
TestDefaultValuesFilteredRead (Spark 3.5) — end-to-end; fails on main without the fix.
Full :iceberg-parquet:test, :iceberg-data:test TestMetricsRowGroupFilter (Parquet + ORC), and
:iceberg-arrow:test pass.

Scope / notes

ParquetReader and VectorizedParquetReader both share this filter via ReadConf, so the row and
vectorized read paths are both fixed.
The dictionary and bloom row-group filters do not skip a column missing from the file, so they need
no change (verified by the end-to-end test).
Avro applies no file-level filtering (residual is applied by the engine after the default is
injected), so it is already correct. ORC throws UnsupportedOperationException on reading an
initial-default column, so there is no silent drop there.
The record-level filter path (ParquetFilters.convert) shares the same missing-column assumption
but is only reached via the @Deprecated readSupport reader, not by any engine read; it can get
the same treatment as a follow-up if desired.

pvary · 2026-06-05T10:37:12Z

+   * @param caseSensitive whether column resolution is case sensitive
+   * @return the filter with absent initial-default columns folded to their default value
+   */
+  static Expression replaceMissingColumnDefaults(


Why is this not part of the ParquetFilters.convert method?

…lted column A column added by schema evolution with an initial-default is backfilled with the default at read time, but ParquetMetricsRowGroupFilter evaluates predicates against a column that is absent from a data file as if it were all-null. For a file written before the column existed, a predicate referencing the column -- including the IsNotNull engines infer for null-intolerant predicates -- skipped the row group, silently dropping exactly the rows the default backfills. The full scan was correct, so the case was untested and unobserved. Make the row-group metrics filter default-aware: when a predicate references a column that is missing from the file but carries an initialDefault, evaluate it against the default value instead of assuming null. The filter already receives the table schema carrying the default, so this is a single check at the predicate dispatch point. Columns present in the file, and missing columns without a default, are unchanged. Closes apache#16690 Signed-off-by: Christian Bush <chbush@linkedin.com>

github-actions Bot added spark parquet labels Jun 5, 2026

cbb330 force-pushed the fix-16690-default-filter-drop branch from dbc0dff to bbafc90 Compare June 5, 2026 09:40

pvary reviewed Jun 5, 2026

View reviewed changes

cbb330 force-pushed the fix-16690-default-filter-drop branch from bbafc90 to 8326eb7 Compare June 5, 2026 20:33

github-actions Bot added the data label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692

Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692
cbb330 wants to merge 1 commit into
apache:mainfrom
cbb330:fix-16690-default-filter-drop

cbb330 commented Jun 5, 2026 •

edited

Loading

Uh oh!

pvary Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cbb330 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Repro (Spark 3.5, Parquet, format-version 3)

Fix

Testing

Scope / notes

Uh oh!

pvary Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cbb330 commented Jun 5, 2026 •

edited

Loading