Skip to content

Fix fully matched row groups with null counts#21907

Open
xudong963 wants to merge 2 commits intoapache:mainfrom
xudong963:fix-fully-matched-null-counts
Open

Fix fully matched row groups with null counts#21907
xudong963 wants to merge 2 commits intoapache:mainfrom
xudong963:fix-fully-matched-null-counts

Conversation

@xudong963
Copy link
Copy Markdown
Member

@xudong963 xudong963 commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

This is split out from review feedback on #21637. Row groups can only be marked fully matched when all rows are guaranteed to pass the filter. For nullable predicate columns, proving NOT(predicate) is not enough because rows where the predicate evaluates to NULL do not pass the filter.

What changes are included in this PR?

This PR makes the fully matched row-group proof conservative for nulls by adding IS NULL checks for nullable columns referenced by the predicate before evaluating the inverted pruning predicate.

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

Are these changes tested?

Added a regression test covering row groups with known nulls, known zero nulls, and missing null counts.

Are there any user-facing changes?

No API changes. This only prevents false positives in the row-group fully matched optimization.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xudong963 -- this seems better than what is on main, but I am not quite sure about the decision to pick

            missing_null_counts_as_zero: false,

vs

            missing_null_counts_as_zero: true,

parquet_schema,
row_group_metadatas,
arrow_schema,
missing_null_counts_as_zero: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this value also be passed down rather than hard coded?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.

I think this is a setting on the reader that controls how missing statistics are interpreted (as older versions of arrow-rs didn't write null counts when there were 0 nulls)

I am not sure why this code path is changing its value

.map(|&i| &groups[i])
.collect::<Vec<_>>(),
arrow_schema,
missing_null_counts_as_zero: false,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this always false? Maybe it is worth some comments explaining what is going on here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants