Skip to content

Push down IS NULL / IS NOT NULL on struct columns into Parquet scan #21795

@Druva-D

Description

@Druva-D

Is your feature request related to a problem or challenge?

Filters like WHERE struct_col IS NOT NULL on struct columns cannot be pushed down into the Parquet scan. The PushdownChecker in row_filter.rs blanket-rejects any filter that references a whole struct column, even though IS NOT NULL only needs the struct's null bitmap — not any leaf data.

This forces a FilterExec above the scan that materializes ALL leaf columns of the struct just to check nullability:

FilterExec: price@4 IS NOT NULL
  DataSourceExec: projection=[..., price, ...]  -- reads all  price.* leaf columns

For a struct with many fields, this means reading and decoding all leaf columns when semantically zero data is needed — only the struct's null bitmap.

PRs #20822 and #20854 added pushdown support for get_field expressions on struct fields (e.g., s['value'] > 10), but IS NULL / IS NOT NULL on the whole struct was explicitly left unsupported. The test struct_data_structures_prevent_pushdown asserts this rejection.

Describe the solution you'd like

Allow IS NULL and IS NOT NULL on struct columns to be pushed down as row-level filters, reading only a single leaf column instead of all leaves.

In Parquet, definition levels encode nullability at every nesting level independently. When arrow-rs reads even one leaf column, it reconstructs the struct's null bitmap from definition levels. is_not_null() then checks this bitmap, not the leaf's data.

This means we can:

  1. Detect IS NULL(Column(struct)) / IS NOT NULL(Column(struct)) in PushdownChecker::f_down before the Column node is visited
  2. Project only the first leaf column of the struct in the ProjectionMask
  3. Let arrow-rs reconstruct the struct null bitmap from that single leaf's definition levels

Describe alternatives you've considered

  1. Definition-level-only reads: Read only the definition levels of one leaf column without decoding the data values. This would be optimal (~1 bit per row vs 4+ bytes) but requires arrow-rs API changes (ProjectionMask::definition_levels_only() or similar) that don't exist today.

  2. Statistics-based pruning only: Use Parquet row-group null_count statistics to prune entire row groups. However, Parquet statistics are stored only for leaf columns, not struct columns, so struct-level null_count isn't directly available from metadata.

Additional context

Related to the filter pushdown EPIC: #20324
Builds on: #20822 (struct field pushdown) and #20854 (leaf projection refinement)

PR generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions