Skip to content

Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110

Open
laserninja wants to merge 1 commit into
apache:mainfrom
laserninja:fix/16032-parquet-filters-handle-always-false
Open

Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110
laserninja wants to merge 1 commit into
apache:mainfrom
laserninja:fix/16032-parquet-filters-handle-always-false

Conversation

@laserninja
Copy link
Copy Markdown
Contributor

Summary

Handle AlwaysFalse.INSTANCE in ParquetFilters.convert() so it returns FilterCompat.NOOP instead of passing the Iceberg-internal sentinel through to Parquet.

Problem

Fixes #16032

ParquetFilters.convert() had a TODO comment acknowledging that AlwaysFalse.INSTANCE was not handled. The method only guarded against AlwaysTrue.INSTANCE before calling FilterCompat.get(pred). Parquet does not understand Iceberg's internal AlwaysFalse type, so the resulting filter was undefined — it could silently read all rows instead of none.

Solution

Added && pred != AlwaysFalse.INSTANCE to the sentinel check in convert(), matching the existing AlwaysTrue.INSTANCE guard. When the expression reduces to AlwaysFalse, the method returns FilterCompat.NOOP. This is safe because Iceberg's own manifest-level filtering has already ensured no matching row groups will be scanned.

Testing

Added TestParquetFilters with five unit tests:

  • alwaysFalseReturnsNoop — regression guard for the exact bug
  • alwaysTrueReturnsNoop — confirms existing AlwaysTrue behaviour
  • realPredicateReturnsFilter — confirms real predicates still produce active filters
  • notAlwaysFalseReturnsNoopnot(alwaysFalse) resolves via the visitor
  • andWithAlwaysFalseReturnsNoopand(alwaysFalse, pred) resolves via the visitor

All existing iceberg-parquet tests pass.

ParquetFilters.convert() passed AlwaysFalse.INSTANCE through to
FilterCompat.get(pred), but Parquet does not understand Iceberg's
internal AlwaysFalse sentinel. This caused the filter to be applied
incorrectly and potentially read all rows instead of none.

Add an explicit AlwaysFalse check alongside the existing AlwaysTrue
check so that both sentinels return FilterCompat.NOOP. Row-group
pruning for always-false predicates is handled by Iceberg's own
manifest filtering before Parquet is ever invoked.

Fixes apache#16032
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 27, 2026

I have concerns with the proposed solution, because originally we converted the input filter to the equivalent Parquet filter. Following this line of thought, the AlwaysFalse should be converted to a filter which drops every record. Instead of this we assume user behavior and expect that the user will filter everything before coming here.

@laserninja
Copy link
Copy Markdown
Contributor Author

laserninja commented Apr 30, 2026

Yes, I agree NOOP is the wrong semantic. NOOP means "pass everything through", which is the opposite of AlwaysFalse. My rationale (that Iceberg's manifest filtering would have already excluded all relevant files) is an assumption about callers that shouldn't be baked into this method.

The challenge is that Parquet's FilterApi doesn't expose a native alwaysFalse() predicate. A few approaches I can see:

Visitor-level fix, have ConvertFilterToParquet.alwaysFalse() return a UserDefinedPredicate that always returns false. This keeps the semantics correct at the Parquet level but requires picking a concrete column type.

Propagate as a special case, detect the AlwaysFalse.INSTANCE sentinel in convert() and return a RecordFilter / throw, making it the caller's responsibility to avoid this path.

Open to your suggestion, is there an existing pattern in the codebase you'd prefer?
I'll update the PR with whichever direction makes most sense. Happy to hear your preference.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 30, 2026

@laserninja: Could we have a TableScan based test case where this issue is highlighted? Could this happen in any way with the current code?

@laserninja
Copy link
Copy Markdown
Contributor Author

laserninja commented May 2, 2026

Can this happen with current code?

Yes, but not through a direct alwaysFalse() scan filter. When table.newScan().filter(Expressions.alwaysFalse()) is used, Iceberg's manifest-level filtering short-circuits and returns no file tasks, so ParquetFilters.convert() is never called. However, the bug is reachable when:

  • A predicate is applied on a column that does not exist in an older Parquet file (schema evolution) — the Iceberg manifest evaluator may include the file (null bounds → "could match"), but ConvertFilterToParquet binds against the Parquet file's schema, resolves the predicate to AlwaysFalse.INSTANCE, and the bug triggers.
  • The lower-level Parquet.read() API is called directly with an alwaysFalse() filter.

What the bug actually does: Both AlwaysTrue and AlwaysFalse are Iceberg-internal FilterPredicate placeholders whose accept() method throws UnsupportedOperationException("AlwaysTrue is a placeholder only"). So the bug causes an exception, not silently incorrect results.

On a TableScan-level test: I can write one using the schema evolution scenario above, write files without a column, add the column, scan with a predicate on that column, and verify no exception is thrown. Would you prefer that scenario, or do you have a specific case in mind? I can also add a Parquet.read() level test that directly demonstrates the exception being thrown before the fix.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 4, 2026

However, the bug is reachable when:

  • A predicate is applied on a column that does not exist in an older Parquet file (schema evolution) — the Iceberg manifest evaluator may include the file (null bounds → "could match"), but ConvertFilterToParquet binds against the Parquet file's schema, resolves the predicate to AlwaysFalse.INSTANCE, and the bug triggers.

I think this is reproducible, and we can write a test for it.
Maybe the solution is to only push down filters to the Parquet files where the columns are actually present in the file.

CC: @gaborkaszab

  • The lower-level Parquet.read() API is called directly with an alwaysFalse() filter.

In my opinion this is an user error.

WDYT?

@laserninja
Copy link
Copy Markdown
Contributor Author

Agreed on both points.

Agree that Parquet.read() with alwaysFalse() directly is user error, not something we need to protect against.

The schema evolution scenario is the real bug, and "only push down where the columns are present" is a cleaner fix than what this PR currently has. I see two ways to implement it:

In Parquet.java before calling ParquetFilters.convert(fileSchema, filter, caseSensitive), check whether the columns referenced by filter are all present in fileSchema. If any are missing, fall back to FilterCompat.NOOP for that file. The row-level correctness still holds because Parquet will return all rows, and Iceberg's own residual evaluation will handle the rest.

In ConvertFilterToParquet when binding a predicate against a column that doesn't exist in the file schema, return null instead of AlwaysFalse.INSTANCE, and propagate nulls through and/or/not to indicate "not pushable" rather than "false". convert() then treats a null result the same as AlwaysFalse.INSTANCE is treated today, fall back to NOOP.

I think option 1 is simpler and more transparent. Happy to update the PR with the schema evolution test and the Parquet.java-level column-presence check. Let me know if you (or @gaborkaszab) have a preference.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 8, 2026

Shall we add a new predicate for the Ignored columns and handle this upstream in the expression? Then we could have a "correct" result in many cases.

  /**
   * Sentinel marker for a predicate that referenced a column outside the allowed set and therefore
   * cannot be evaluated. Distinct from {@link AlwaysTrue} so that {@code not(IGNORED)} does not
   * become {@code AlwaysFalse} and {@code or(x, IGNORED)} does not silently drop matching rows.
   */
  private static class Ignored implements FilterPredicate {
    static final Ignored INSTANCE = new Ignored();

    @Override
    public <R> R accept(Visitor<R> visitor) {
      throw new UnsupportedOperationException("Ignored is a placeholder only");
    }
  }

@laserninja
Copy link
Copy Markdown
Contributor Author

Ignored can be the right abstraction. The key semantic difference from AlwaysFalse is in or: or(x, AlwaysFalse) correctly simplifies to x, but or(x, Ignored) must stay Ignored because the missing column could have matching rows, so we can't push down the OR at all.

Proposed semantics:

not(Ignored) → Ignored
or(x, Ignored) / or(Ignored, x) → Ignored (can't push; might miss rows)
and(real, Ignored) / and(Ignored, real) → real (safe to push the resolvable side; AND with an ignored term can only be more restrictive)
and(Ignored, Ignored) → Ignored
convert(): treat Ignored.INSTANCE same as AlwaysTrue.INSTANCE → NOOP
This gives "correct result in many cases" for AND-heavy filters on partially evolved files.

I'll update the PR with:

The Ignored sentinel class and the visitor changes above
A TableScan-level integration test using the schema evolution scenario (write file without column, add column, scan with predicate on the new column)
Updated unit tests in TestParquetFilters
Does the and semantic look right to you, or would you prefer and(real, Ignored) → Ignored (simpler, NOOP the whole thing)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: ParquetFilters.convert() does not handle AlwaysFalse, may produce incorrect filter pushdown

2 participants