Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110
Parquet: handle AlwaysFalse in ParquetFilters.convert()#16110laserninja wants to merge 1 commit into
Conversation
ParquetFilters.convert() passed AlwaysFalse.INSTANCE through to FilterCompat.get(pred), but Parquet does not understand Iceberg's internal AlwaysFalse sentinel. This caused the filter to be applied incorrectly and potentially read all rows instead of none. Add an explicit AlwaysFalse check alongside the existing AlwaysTrue check so that both sentinels return FilterCompat.NOOP. Row-group pruning for always-false predicates is handled by Iceberg's own manifest filtering before Parquet is ever invoked. Fixes apache#16032
|
I have concerns with the proposed solution, because originally we converted the input filter to the equivalent Parquet filter. Following this line of thought, the |
|
Yes, I agree NOOP is the wrong semantic. NOOP means "pass everything through", which is the opposite of AlwaysFalse. My rationale (that Iceberg's manifest filtering would have already excluded all relevant files) is an assumption about callers that shouldn't be baked into this method. The challenge is that Parquet's FilterApi doesn't expose a native alwaysFalse() predicate. A few approaches I can see: Visitor-level fix, have ConvertFilterToParquet.alwaysFalse() return a UserDefinedPredicate that always returns false. This keeps the semantics correct at the Parquet level but requires picking a concrete column type. Propagate as a special case, detect the AlwaysFalse.INSTANCE sentinel in convert() and return a RecordFilter / throw, making it the caller's responsibility to avoid this path. Open to your suggestion, is there an existing pattern in the codebase you'd prefer? |
|
@laserninja: Could we have a TableScan based test case where this issue is highlighted? Could this happen in any way with the current code? |
Yes, but not through a direct
What the bug actually does: Both On a |
I think this is reproducible, and we can write a test for it. CC: @gaborkaszab
In my opinion this is an user error. WDYT? |
|
Agreed on both points. Agree that Parquet.read() with alwaysFalse() directly is user error, not something we need to protect against. The schema evolution scenario is the real bug, and "only push down where the columns are present" is a cleaner fix than what this PR currently has. I see two ways to implement it: In Parquet.java before calling ParquetFilters.convert(fileSchema, filter, caseSensitive), check whether the columns referenced by filter are all present in fileSchema. If any are missing, fall back to FilterCompat.NOOP for that file. The row-level correctness still holds because Parquet will return all rows, and Iceberg's own residual evaluation will handle the rest. In ConvertFilterToParquet when binding a predicate against a column that doesn't exist in the file schema, return null instead of AlwaysFalse.INSTANCE, and propagate nulls through and/or/not to indicate "not pushable" rather than "false". convert() then treats a null result the same as AlwaysFalse.INSTANCE is treated today, fall back to NOOP. I think option 1 is simpler and more transparent. Happy to update the PR with the schema evolution test and the Parquet.java-level column-presence check. Let me know if you (or @gaborkaszab) have a preference. |
|
Shall we add a new predicate for the |
|
Ignored can be the right abstraction. The key semantic difference from AlwaysFalse is in or: or(x, AlwaysFalse) correctly simplifies to x, but or(x, Ignored) must stay Ignored because the missing column could have matching rows, so we can't push down the OR at all. Proposed semantics: not(Ignored) → Ignored I'll update the PR with: The Ignored sentinel class and the visitor changes above |
Summary
Handle
AlwaysFalse.INSTANCEinParquetFilters.convert()so it returnsFilterCompat.NOOPinstead of passing the Iceberg-internal sentinel through to Parquet.Problem
Fixes #16032
ParquetFilters.convert()had a TODO comment acknowledging thatAlwaysFalse.INSTANCEwas not handled. The method only guarded againstAlwaysTrue.INSTANCEbefore callingFilterCompat.get(pred). Parquet does not understand Iceberg's internalAlwaysFalsetype, so the resulting filter was undefined — it could silently read all rows instead of none.Solution
Added
&& pred != AlwaysFalse.INSTANCEto the sentinel check inconvert(), matching the existingAlwaysTrue.INSTANCEguard. When the expression reduces toAlwaysFalse, the method returnsFilterCompat.NOOP. This is safe because Iceberg's own manifest-level filtering has already ensured no matching row groups will be scanned.Testing
Added
TestParquetFilterswith five unit tests:alwaysFalseReturnsNoop— regression guard for the exact bugalwaysTrueReturnsNoop— confirms existing AlwaysTrue behaviourrealPredicateReturnsFilter— confirms real predicates still produce active filtersnotAlwaysFalseReturnsNoop—not(alwaysFalse)resolves via the visitorandWithAlwaysFalseReturnsNoop—and(alwaysFalse, pred)resolves via the visitorAll existing
iceberg-parquettests pass.