Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix NaN handling in parquet statistics #256

Merged
merged 1 commit into from May 5, 2021

Conversation

crepererum
Copy link
Contributor

@crepererum crepererum commented May 4, 2021

Which issue does this PR close?

Closes #255.

Rationale for this change

Fixes NaN handling in parquet statistics. This is in line with the C++ stack:

https://github.com/apache/arrow/blob/b3e43987c47b2f01b204a2d954f882f7161616ef/cpp/src/parquet/statistics_test.cc#L1000-L1043

What changes are included in this PR?

Filter out NaN values from statistics + tests.

Are there any user-facing changes?

Yes: formally NaN were included in the stats but at "random" (i.e. when the data started with an NaN than the min/max values are NaN, otherwise min/max are non-NaN). Now the behavior is: NaN are excluded always. If the only NaN values are present, then min/max are unset.

Comment on lines +925 to +927
// simple "isNaN" check that works for all types
if val == val {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce some generic T::is_nan function into the data type abstraction layer (that basically returns False for every non-float/double type and calls .is_nan() for float/double) or is this local workaround good enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. It'll just be adding overhead for us.

The compiler can figure out what we're trying to do, and will optimise out the branch for anything that's not a floating point number.

Look at this example: https://godbolt.org/z/z7M3Mfqzz

The is_nan::<i32>() always returns false, and the ucomisd does the floating point comparison

@codecov-commenter
Copy link

Codecov Report

Merging #256 (acd4f52) into master (8f030db) will increase coverage by 0.01%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #256      +/-   ##
==========================================
+ Coverage   82.52%   82.53%   +0.01%     
==========================================
  Files         162      162              
  Lines       43672    43786     +114     
==========================================
+ Hits        36039    36139     +100     
- Misses       7633     7647      +14     
Impacted Files Coverage Δ
parquet/src/column/writer.rs 93.76% <90.00%> (-0.27%) ⬇️
parquet/src/arrow/levels.rs 81.25% <0.00%> (-0.44%) ⬇️
parquet/src/arrow/arrow_writer.rs 98.42% <0.00%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f030db...acd4f52. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented May 4, 2021

I believe this fixes #255 rather than #225 so I updated the description accordingly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change (and tests) look really good to me. @sunchao or @nevi-me do you have any thoughts?

@alamb
Copy link
Contributor

alamb commented May 4, 2021

Thanks @crepererum !

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaNs can break parquet statistics
4 participants