Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet page stats for float{16, 32, 64} #10982

Merged
merged 1 commit into from
Jun 19, 2024
Merged

Conversation

tmi
Copy link
Contributor

@tmi tmi commented Jun 18, 2024

Which issue does this PR close?

Closes #10951.

Rationale for this change

Just extends the existing functionality for parquet page data stats -- int cases were covered previously, I just add float cases in the same fashion.

What changes are included in this PR?

  • Float64 was a breeze, just add a macro invocation and case branch,
  • Float32 needed additionally an extension of data generation option -- I hope I didn't miss it existing already,
  • Float16 got a bit messy and required something I'm not happy with. Unlike other data types, it's physically represented as FixedByteLength of 2 bytes, which doesn't behave like a primitive type. In particular, I needed to add a .clone() at a particular place, which required going from ident to expr in a macro, affecting all cases. I'm very open to suggestions here -- I haven't come up with a better idea myself yet.

Are these changes tested?

Yes, I extended the existing test coverage to the newly added cases.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core datafusion crate label Jun 18, 2024
@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

FYI there appears to be another PR open for this #10972

@tmi
Copy link
Contributor Author

tmi commented Jun 18, 2024

oh thx, haven't checked since yesterday...

commented in there, happy both to continue with this one / close in favour of the other one. No further commits into this PR in my queue

@tmi tmi marked this pull request as ready for review June 18, 2024 15:59
@alamb alamb changed the title Minor: add parquet page stats for float{16, 32, 64} Add parquet page stats for float{16, 32, 64} Jun 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- thank you again @tmi

make_data_page_stats_iterator!(
MinFloat16DataPageStatsIterator,
|x: &PageIndex<FixedLenByteArray>| { x.min.clone() },
Index::FIXED_LEN_BYTE_ARRAY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo doc --document-private-items -p datafusion --open

Screenshot 2024-06-18 at 4 32 10 PM

I see what is going on -- I think this makes sense for now

Implies that the iterator is an iterator over Vec<FixedLengthItem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, basically caused by there being no f16 index in https://arrow.apache.org/rust/src/parquet/file/page_index/index.rs.html#74 ... I was thinking of mapping to f16 first earlier on, to make it cleaner... but this turned out to be the simplest I came up with.

#10972 (comment) is I think the best way to make this better eventually

@@ -614,6 +614,94 @@ async fn test_int_8() {
.run();
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 very nice tests

@@ -614,6 +614,94 @@ async fn test_int_8() {
.run();
}

#[tokio::test]
Copy link
Contributor

@tshauck tshauck Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is mostly a duplicate of test_float16 just that one has Check::RowGroup vs Check::Both... maybe one or the other can be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tshauck this is one of the things that were bugging me a bit, the other one being https://github.com/apache/datafusion/pull/10982/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R749-R763 -- both are very repetitive

however, I didn't want to get overeager, only to realize later than the abstractions chosen was not the right one. Perhaps the best way forward would be to address #10952 first (which may also have its own "float16"-like tricky case), and then getting the correct macros for testing, index handling, etc. Speaking of that, would you like to take that one? I'd be happy to review then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me -- let's merge this PR and then work on #10952

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #11000 to track duplication

Copy link
Contributor

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments if we want to go w/ this one around test duplication.

}

#[tokio::test]
async fn test_float_64() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a similar test_float64 in this file that could similarly be updated or removed?

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmi
Thanks this looks good - especially figuring out how to deal with f16.

@alamb alamb merged commit 268f648 into apache:main Jun 19, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 19, 2024

Thanks again @tmi @marvinlanhenke and @tshauck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support extracting Float{16, 32, 64} statistics from Parquet Data Pages
4 participants