Fix fully matched row groups with null counts#21907
Fix fully matched row groups with null counts#21907xudong963 wants to merge 2 commits intoapache:mainfrom
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thanks @xudong963 -- this seems better than what is on main, but I am not quite sure about the decision to pick
missing_null_counts_as_zero: false,
vs
missing_null_counts_as_zero: true,
| parquet_schema, | ||
| row_group_metadatas, | ||
| arrow_schema, | ||
| missing_null_counts_as_zero: true, |
There was a problem hiding this comment.
shouldn't this value also be passed down rather than hard coded?
There was a problem hiding this comment.
It also threads with_missing_null_counts_as_zero through RowGroupPruningStatistics so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.
I think this is a setting on the reader that controls how missing statistics are interpreted (as older versions of arrow-rs didn't write null counts when there were 0 nulls)
I am not sure why this code path is changing its value
| .map(|&i| &groups[i]) | ||
| .collect::<Vec<_>>(), | ||
| arrow_schema, | ||
| missing_null_counts_as_zero: false, |
There was a problem hiding this comment.
why is this always false? Maybe it is worth some comments explaining what is going on here
Which issue does this PR close?
Rationale for this change
This is split out from review feedback on #21637. Row groups can only be marked fully matched when all rows are guaranteed to pass the filter. For nullable predicate columns, proving
NOT(predicate)is not enough because rows where the predicate evaluates to NULL do not pass the filter.What changes are included in this PR?
This PR makes the fully matched row-group proof conservative for nulls by adding
IS NULLchecks for nullable columns referenced by the predicate before evaluating the inverted pruning predicate.It also threads
with_missing_null_counts_as_zerothroughRowGroupPruningStatisticsso normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass.Are these changes tested?
Added a regression test covering row groups with known nulls, known zero nulls, and missing null counts.
Are there any user-facing changes?
No API changes. This only prevents false positives in the row-group fully matched optimization.