Parquet: handle AlwaysFalse in ParquetFilters.convert() by laserninja · Pull Request #16110 · apache/iceberg

laserninja · 2026-04-25T19:30:55Z

Summary

Handle AlwaysFalse.INSTANCE in ParquetFilters.convert() so it returns FilterCompat.NOOP instead of passing the Iceberg-internal sentinel through to Parquet.

Problem

Fixes #16032

ParquetFilters.convert() had a TODO comment acknowledging that AlwaysFalse.INSTANCE was not handled. The method only guarded against AlwaysTrue.INSTANCE before calling FilterCompat.get(pred). Parquet does not understand Iceberg's internal AlwaysFalse type, so the resulting filter was undefined — it could silently read all rows instead of none.

Solution

Added && pred != AlwaysFalse.INSTANCE to the sentinel check in convert(), matching the existing AlwaysTrue.INSTANCE guard. When the expression reduces to AlwaysFalse, the method returns FilterCompat.NOOP. This is safe because Iceberg's own manifest-level filtering has already ensured no matching row groups will be scanned.

Testing

Added TestParquetFilters with five unit tests:

alwaysFalseReturnsNoop — regression guard for the exact bug
alwaysTrueReturnsNoop — confirms existing AlwaysTrue behaviour
realPredicateReturnsFilter — confirms real predicates still produce active filters
notAlwaysFalseReturnsNoop — not(alwaysFalse) resolves via the visitor
andWithAlwaysFalseReturnsNoop — and(alwaysFalse, pred) resolves via the visitor

All existing iceberg-parquet tests pass.

ParquetFilters.convert() passed AlwaysFalse.INSTANCE through to FilterCompat.get(pred), but Parquet does not understand Iceberg's internal AlwaysFalse sentinel. This caused the filter to be applied incorrectly and potentially read all rows instead of none. Add an explicit AlwaysFalse check alongside the existing AlwaysTrue check so that both sentinels return FilterCompat.NOOP. Row-group pruning for always-false predicates is handled by Iceberg's own manifest filtering before Parquet is ever invoked. Fixes apache#16032

pvary · 2026-04-27T12:44:21Z

I have concerns with the proposed solution, because originally we converted the input filter to the equivalent Parquet filter. Following this line of thought, the AlwaysFalse should be converted to a filter which drops every record. Instead of this we assume user behavior and expect that the user will filter everything before coming here.

laserninja · 2026-04-30T03:21:44Z

Yes, I agree NOOP is the wrong semantic. NOOP means "pass everything through", which is the opposite of AlwaysFalse. My rationale (that Iceberg's manifest filtering would have already excluded all relevant files) is an assumption about callers that shouldn't be baked into this method.

The challenge is that Parquet's FilterApi doesn't expose a native alwaysFalse() predicate. A few approaches I can see:

Visitor-level fix, have ConvertFilterToParquet.alwaysFalse() return a UserDefinedPredicate that always returns false. This keeps the semantics correct at the Parquet level but requires picking a concrete column type.

Propagate as a special case, detect the AlwaysFalse.INSTANCE sentinel in convert() and return a RecordFilter / throw, making it the caller's responsibility to avoid this path.

Open to your suggestion, is there an existing pattern in the codebase you'd prefer?
I'll update the PR with whichever direction makes most sense. Happy to hear your preference.

pvary · 2026-04-30T10:55:34Z

@laserninja: Could we have a TableScan based test case where this issue is highlighted? Could this happen in any way with the current code?

laserninja · 2026-05-02T05:30:25Z

Can this happen with current code?

Yes, but not through a direct alwaysFalse() scan filter. When table.newScan().filter(Expressions.alwaysFalse()) is used, Iceberg's manifest-level filtering short-circuits and returns no file tasks, so ParquetFilters.convert() is never called. However, the bug is reachable when:

A predicate is applied on a column that does not exist in an older Parquet file (schema evolution) — the Iceberg manifest evaluator may include the file (null bounds → "could match"), but ConvertFilterToParquet binds against the Parquet file's schema, resolves the predicate to AlwaysFalse.INSTANCE, and the bug triggers.
The lower-level Parquet.read() API is called directly with an alwaysFalse() filter.

What the bug actually does: Both AlwaysTrue and AlwaysFalse are Iceberg-internal FilterPredicate placeholders whose accept() method throws UnsupportedOperationException("AlwaysTrue is a placeholder only"). So the bug causes an exception, not silently incorrect results.

On a TableScan-level test: I can write one using the schema evolution scenario above, write files without a column, add the column, scan with a predicate on that column, and verify no exception is thrown. Would you prefer that scenario, or do you have a specific case in mind? I can also add a Parquet.read() level test that directly demonstrates the exception being thrown before the fix.

pvary · 2026-05-04T09:45:37Z

However, the bug is reachable when:

A predicate is applied on a column that does not exist in an older Parquet file (schema evolution) — the Iceberg manifest evaluator may include the file (null bounds → "could match"), but ConvertFilterToParquet binds against the Parquet file's schema, resolves the predicate to AlwaysFalse.INSTANCE, and the bug triggers.

I think this is reproducible, and we can write a test for it.
Maybe the solution is to only push down filters to the Parquet files where the columns are actually present in the file.

CC: @gaborkaszab

The lower-level Parquet.read() API is called directly with an alwaysFalse() filter.

In my opinion this is an user error.

WDYT?

laserninja · 2026-05-08T05:44:04Z

Agreed on both points.

Agree that Parquet.read() with alwaysFalse() directly is user error, not something we need to protect against.

The schema evolution scenario is the real bug, and "only push down where the columns are present" is a cleaner fix than what this PR currently has. I see two ways to implement it:

In Parquet.java before calling ParquetFilters.convert(fileSchema, filter, caseSensitive), check whether the columns referenced by filter are all present in fileSchema. If any are missing, fall back to FilterCompat.NOOP for that file. The row-level correctness still holds because Parquet will return all rows, and Iceberg's own residual evaluation will handle the rest.

In ConvertFilterToParquet when binding a predicate against a column that doesn't exist in the file schema, return null instead of AlwaysFalse.INSTANCE, and propagate nulls through and/or/not to indicate "not pushable" rather than "false". convert() then treats a null result the same as AlwaysFalse.INSTANCE is treated today, fall back to NOOP.

I think option 1 is simpler and more transparent. Happy to update the PR with the schema evolution test and the Parquet.java-level column-presence check. Let me know if you (or @gaborkaszab) have a preference.

pvary · 2026-05-08T09:08:43Z

Shall we add a new predicate for the Ignored columns and handle this upstream in the expression? Then we could have a "correct" result in many cases.

  /**
   * Sentinel marker for a predicate that referenced a column outside the allowed set and therefore
   * cannot be evaluated. Distinct from {@link AlwaysTrue} so that {@code not(IGNORED)} does not
   * become {@code AlwaysFalse} and {@code or(x, IGNORED)} does not silently drop matching rows.
   */
  private static class Ignored implements FilterPredicate {
    static final Ignored INSTANCE = new Ignored();

    @Override
    public <R> R accept(Visitor<R> visitor) {
      throw new UnsupportedOperationException("Ignored is a placeholder only");
    }
  }

laserninja · 2026-05-11T06:19:48Z

Ignored can be the right abstraction. The key semantic difference from AlwaysFalse is in or: or(x, AlwaysFalse) correctly simplifies to x, but or(x, Ignored) must stay Ignored because the missing column could have matching rows, so we can't push down the OR at all.

Proposed semantics:

not(Ignored) → Ignored
or(x, Ignored) / or(Ignored, x) → Ignored (can't push; might miss rows)
and(real, Ignored) / and(Ignored, real) → real (safe to push the resolvable side; AND with an ignored term can only be more restrictive)
and(Ignored, Ignored) → Ignored
convert(): treat Ignored.INSTANCE same as AlwaysTrue.INSTANCE → NOOP
This gives "correct result in many cases" for AND-heavy filters on partially evolved files.

I'll update the PR with:

The Ignored sentinel class and the visitor changes above
A TableScan-level integration test using the schema evolution scenario (write file without column, add column, scan with predicate on the new column)
Updated unit tests in TestParquetFilters
Does the and semantic look right to you, or would you prefer and(real, Ignored) → Ignored (simpler, NOOP the whole thing)?

github-actions Bot added the parquet label Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110

Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110
laserninja wants to merge 1 commit into
apache:mainfrom
laserninja:fix/16032-parquet-filters-handle-always-false

laserninja commented Apr 25, 2026

Uh oh!

pvary commented Apr 27, 2026

Uh oh!

laserninja commented Apr 30, 2026 •

edited

Loading

Uh oh!

pvary commented Apr 30, 2026

Uh oh!

laserninja commented May 2, 2026 •

edited

Loading

Uh oh!

pvary commented May 4, 2026

Uh oh!

laserninja commented May 8, 2026

Uh oh!

pvary commented May 8, 2026

Uh oh!

laserninja commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laserninja commented Apr 25, 2026

Summary

Problem

Solution

Testing

Uh oh!

pvary commented Apr 27, 2026

Uh oh!

laserninja commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Apr 30, 2026

Uh oh!

laserninja commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented May 4, 2026

Uh oh!

laserninja commented May 8, 2026

Uh oh!

pvary commented May 8, 2026

Uh oh!

laserninja commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

laserninja commented Apr 30, 2026 •

edited

Loading

laserninja commented May 2, 2026 •

edited

Loading