Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet pruning predicate for IS NULL #1591

Closed
houqp opened this issue Jan 17, 2022 · 1 comment · Fixed by #1595
Closed

Parquet pruning predicate for IS NULL #1591

houqp opened this issue Jan 17, 2022 · 1 comment · Fixed by #1595
Labels
datafusion Changes in the datafusion crate performance

Comments

@houqp
Copy link
Member

houqp commented Jan 17, 2022

Describe the bug

col = null expression evaluation throws a runtime error when getting evaluated against statistics array, which resulted in incorrect true result when the stats has null count set to 0.

The other problem is col = null expression is converted into col_min <= NULL AND NULL <= col_max predicate expression. I believe we should be handling null as a special case and return an expression that checks against null count column instead.

To Reproduce

See our test cases at: https://github.com/apache/arrow-datafusion/blob/f027e5f4d9a44ad9cc879c133abc913f78fa76f0/datafusion/src/physical_plan/file_format/parquet.rs#L722-L763

The test case asserts that results for both row groups should return true, while them should both be false instead because both row groups have null count set to 0.

Expected behavior

col = null row group should be evaluated by taking row group null count stats into account.

@alamb
Copy link
Contributor

alamb commented Jan 17, 2022

It should be noted that this is an 'optimization' bug rather than a correctness bug -- in the sense that returning false means "don't filter the row group" and returning true means "do filter (aka skip) the row group"

@houqp houqp added performance and removed bug Something isn't working labels Jan 18, 2022
@alamb alamb added the datafusion Changes in the datafusion crate label Feb 10, 2022
@alamb alamb changed the title Parquet pruning predicate is not handling null comparisons correctly Parquet pruning predicate for IS NULL Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants