Skip to content

Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692

Open
cbb330 wants to merge 1 commit into
apache:mainfrom
cbb330:fix-16690-default-filter-drop
Open

Parquet: Fix initial-default rows dropped when filtering on the defaulted column#16692
cbb330 wants to merge 1 commit into
apache:mainfrom
cbb330:fix-16690-default-filter-drop

Conversation

@cbb330
Copy link
Copy Markdown

@cbb330 cbb330 commented Jun 5, 2026

What

A column added by schema evolution with an initial-default is backfilled with the default at read
time, but ParquetMetricsRowGroupFilter evaluates predicates against a column that is absent from
a data file as if it were all-null
. For a file written before the column existed, a predicate
referencing the column — including the IsNotNull engines infer for null-intolerant predicates —
skipped the row group, silently dropping exactly the rows the default backfills. The full scan
was correct, so the case was untested and unobserved.

Fixes #16690.

Repro (Spark 3.5, Parquet, format-version 3)

CREATE TABLE t (id bigint, name string) USING iceberg TBLPROPERTIES ('format-version'='3');
INSERT INTO t VALUES (1, 'Alice');                          -- written before column c exists
-- table.updateSchema().addColumn("c", StringType, lit("US")).commit();
INSERT INTO t VALUES (2, 'Bob', 'US');                      -- c present
Query Before After
SELECT id, c (1,US),(2,US) (1,US),(2,US)
WHERE c = 'US' (2) (1),(2)
WHERE upper(c) = 'US' (2) (1),(2)
WHERE c IS NOT NULL (2) (1),(2)
WHERE c = 'CA' () () ✅ (absent file still excluded)
WHERE c IS NULL () ()

Fix

The drop happens in ParquetMetricsRowGroupFilter, whose per-predicate handlers treat a column
missing from the file as all-null (// the column is not present and is all nulls). The filter is
already constructed with the table read schema, which carries initialDefault, so it has everything
it needs to be correct.

Override the single predicate dispatch point: when a predicate references a column that is missing
from the file but has an initialDefault, evaluate it against the default value
(pred.test(default)) instead of falling through to the null-assuming handlers. c = 'US'
ROWS_MIGHT_MATCH, c = 'CA'ROWS_CANNOT_MATCH, IsNotNull(c) → match, IsNull(c) → skip.
Columns present in the file, and missing columns without a default, are unchanged — so present-column
files still prune on real stats, and existing missing-column behavior is preserved.

This mirrors how the filter already gets the schema; it's a fix at the source rather than a rewrite
of the predicate at the call site.

Testing

  • TestMetricsRowGroupFilter#testColumnNotInFileWithInitialDefault (Parquet) — unit coverage for the
    fold: match/non-match/IsNull/IsNotNull/notEqual against the default.
  • TestDefaultValuesFilteredRead (Spark 3.5) — end-to-end; fails on main without the fix.
  • Full :iceberg-parquet:test, :iceberg-data:test TestMetricsRowGroupFilter (Parquet + ORC), and
    :iceberg-arrow:test pass.

Scope / notes

  • ParquetReader and VectorizedParquetReader both share this filter via ReadConf, so the row and
    vectorized read paths are both fixed.
  • The dictionary and bloom row-group filters do not skip a column missing from the file, so they need
    no change (verified by the end-to-end test).
  • Avro applies no file-level filtering (residual is applied by the engine after the default is
    injected), so it is already correct. ORC throws UnsupportedOperationException on reading an
    initial-default column, so there is no silent drop there.
  • The record-level filter path (ParquetFilters.convert) shares the same missing-column assumption
    but is only reached via the @Deprecated readSupport reader, not by any engine read; it can get
    the same treatment as a follow-up if desired.

@cbb330 cbb330 force-pushed the fix-16690-default-filter-drop branch from dbc0dff to bbafc90 Compare June 5, 2026 09:40
* @param caseSensitive whether column resolution is case sensitive
* @return the filter with absent initial-default columns folded to their default value
*/
static Expression replaceMissingColumnDefaults(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not part of the ParquetFilters.convert method?

…lted column

A column added by schema evolution with an initial-default is backfilled with
the default at read time, but ParquetMetricsRowGroupFilter evaluates predicates
against a column that is absent from a data file as if it were all-null. For a
file written before the column existed, a predicate referencing the column --
including the IsNotNull engines infer for null-intolerant predicates -- skipped
the row group, silently dropping exactly the rows the default backfills. The
full scan was correct, so the case was untested and unobserved.

Make the row-group metrics filter default-aware: when a predicate references a
column that is missing from the file but carries an initialDefault, evaluate it
against the default value instead of assuming null. The filter already receives
the table schema carrying the default, so this is a single check at the
predicate dispatch point. Columns present in the file, and missing columns
without a default, are unchanged.

Closes apache#16690

Signed-off-by: Christian Bush <chbush@linkedin.com>
@cbb330 cbb330 force-pushed the fix-16690-default-filter-drop branch from bbafc90 to 8326eb7 Compare June 5, 2026 20:33
@github-actions github-actions Bot added the data label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v3 initial-default rows are silently dropped when a query filters on the defaulted column

2 participants