Is your feature request related to a problem or challenge?
Filters like WHERE struct_col IS NOT NULL on struct columns cannot be pushed down into the Parquet scan. The PushdownChecker in row_filter.rs blanket-rejects any filter that references a whole struct column, even though IS NOT NULL only needs the struct's null bitmap — not any leaf data.
This forces a FilterExec above the scan that materializes ALL leaf columns of the struct just to check nullability:
FilterExec: price@4 IS NOT NULL
DataSourceExec: projection=[..., price, ...] -- reads all price.* leaf columns
For a struct with many fields, this means reading and decoding all leaf columns when semantically zero data is needed — only the struct's null bitmap.
PRs #20822 and #20854 added pushdown support for get_field expressions on struct fields (e.g., s['value'] > 10), but IS NULL / IS NOT NULL on the whole struct was explicitly left unsupported. The test struct_data_structures_prevent_pushdown asserts this rejection.
Describe the solution you'd like
Allow IS NULL and IS NOT NULL on struct columns to be pushed down as row-level filters, reading only a single leaf column instead of all leaves.
In Parquet, definition levels encode nullability at every nesting level independently. When arrow-rs reads even one leaf column, it reconstructs the struct's null bitmap from definition levels. is_not_null() then checks this bitmap, not the leaf's data.
This means we can:
- Detect
IS NULL(Column(struct)) / IS NOT NULL(Column(struct)) in PushdownChecker::f_down before the Column node is visited
- Project only the first leaf column of the struct in the
ProjectionMask
- Let arrow-rs reconstruct the struct null bitmap from that single leaf's definition levels
Describe alternatives you've considered
-
Definition-level-only reads: Read only the definition levels of one leaf column without decoding the data values. This would be optimal (~1 bit per row vs 4+ bytes) but requires arrow-rs API changes (ProjectionMask::definition_levels_only() or similar) that don't exist today.
-
Statistics-based pruning only: Use Parquet row-group null_count statistics to prune entire row groups. However, Parquet statistics are stored only for leaf columns, not struct columns, so struct-level null_count isn't directly available from metadata.
Additional context
Related to the filter pushdown EPIC: #20324
Builds on: #20822 (struct field pushdown) and #20854 (leaf projection refinement)
PR generated with Claude Code
Is your feature request related to a problem or challenge?
Filters like
WHERE struct_col IS NOT NULLon struct columns cannot be pushed down into the Parquet scan. ThePushdownCheckerinrow_filter.rsblanket-rejects any filter that references a whole struct column, even thoughIS NOT NULLonly needs the struct's null bitmap — not any leaf data.This forces a
FilterExecabove the scan that materializes ALL leaf columns of the struct just to check nullability:For a struct with many fields, this means reading and decoding all leaf columns when semantically zero data is needed — only the struct's null bitmap.
PRs #20822 and #20854 added pushdown support for
get_fieldexpressions on struct fields (e.g.,s['value'] > 10), butIS NULL/IS NOT NULLon the whole struct was explicitly left unsupported. The teststruct_data_structures_prevent_pushdownasserts this rejection.Describe the solution you'd like
Allow
IS NULLandIS NOT NULLon struct columns to be pushed down as row-level filters, reading only a single leaf column instead of all leaves.In Parquet, definition levels encode nullability at every nesting level independently. When arrow-rs reads even one leaf column, it reconstructs the struct's null bitmap from definition levels.
is_not_null()then checks this bitmap, not the leaf's data.This means we can:
IS NULL(Column(struct))/IS NOT NULL(Column(struct))inPushdownChecker::f_downbefore theColumnnode is visitedProjectionMaskDescribe alternatives you've considered
Definition-level-only reads: Read only the definition levels of one leaf column without decoding the data values. This would be optimal (~1 bit per row vs 4+ bytes) but requires arrow-rs API changes (
ProjectionMask::definition_levels_only()or similar) that don't exist today.Statistics-based pruning only: Use Parquet row-group null_count statistics to prune entire row groups. However, Parquet statistics are stored only for leaf columns, not struct columns, so struct-level null_count isn't directly available from metadata.
Additional context
Related to the filter pushdown EPIC: #20324
Builds on: #20822 (struct field pushdown) and #20854 (leaf projection refinement)
PR generated with Claude Code