Fix fully matched row groups with null counts by xudong963 · Pull Request #21907 · apache/datafusion

xudong963 · 2026-04-29T02:07:45Z

Which issue does this PR close?

Related to Skip RowFilter and page pruning for fully matched row groups #21637

Rationale for this change

This is split out from review feedback on #21637. Row groups can only be marked fully matched when all rows are guaranteed to pass the filter. For nullable predicate columns, proving NOT(predicate) is not enough because rows where the predicate evaluates to NULL do not pass the filter.

What changes are included in this PR?

This PR makes the fully matched row-group proof conservative for nulls by adding IS NULL checks for nullable columns referenced by the predicate before evaluating the inverted pruning predicate.

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

Are these changes tested?

Added a regression test covering row groups with known nulls, known zero nulls, and missing null counts.

Are there any user-facing changes?

No API changes. This only prevents false positives in the row-group fully matched optimization.

alamb

Thanks @xudong963 -- this seems better than what is on main, but I am not quite sure about the decision to pick

            missing_null_counts_as_zero: false,

vs

            missing_null_counts_as_zero: true,

alamb · 2026-04-30T18:44:07Z

            parquet_schema,
            row_group_metadatas,
            arrow_schema,
+            missing_null_counts_as_zero: true,


shouldn't this value also be passed down rather than hard coded?

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

I think this is a setting on the reader that controls how missing statistics are interpreted (as older versions of arrow-rs didn't write null counts when there were 0 nulls)

I am not sure why this code path is changing its value

alamb · 2026-04-30T18:44:44Z

                .map(|&i| &groups[i])
                .collect::<Vec<_>>(),
            arrow_schema,
+            missing_null_counts_as_zero: false,


why is this always false? Maybe it is worth some comments explaining what is going on here

Fix fully matched row groups with null counts

fcd521c

github-actions Bot added the datasource Changes to the datasource crate label Apr 29, 2026

Merge branch 'main' into fix-fully-matched-null-counts

8b002d9

blaginin mentioned this pull request Apr 29, 2026

dependencies check are now required to merge ci #21940

Merged

xudong963 mentioned this pull request Apr 30, 2026

Skip RowFilter and page pruning for fully matched row groups #21637

Draft

alamb approved these changes Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fully matched row groups with null counts#21907

Fix fully matched row groups with null counts#21907
xudong963 wants to merge 2 commits intoapache:mainfrom
xudong963:fix-fully-matched-null-counts

xudong963 commented Apr 29, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Apr 30, 2026

Uh oh!

alamb Apr 30, 2026

Uh oh!

alamb Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xudong963 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xudong963 commented Apr 29, 2026 •

edited

Loading