Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7350: [Python] Decode parquet statistics as scalars #12902

Closed

Conversation

wjones127
Copy link
Member

Thanks to Joris for pointing out we had this function in C++.

@github-actions
Copy link

@wjones127 wjones127 requested a review from lidavidm April 18, 2022 21:31
@lidavidm lidavidm closed this in aa641d5 Apr 18, 2022
@wjones127 wjones127 deleted the ARROW-7350-parquet-stats-decode branch April 18, 2022 21:42
@ursabot
Copy link

ursabot commented Apr 19, 2022

Benchmark runs are scheduled for baseline = 5b2c0a0 and contender = aa641d5. aa641d5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.46% ⬆️0.13%] test-mac-arm
[Failed ⬇️0.75% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.3% ⬆️0.13%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/528| aa641d51 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/515| aa641d51 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/514| aa641d51 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/525| aa641d51 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/527| 5b2c0a0f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/514| 5b2c0a0f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/513| 5b2c0a0f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/524| 5b2c0a0f ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jorisvandenbossche
Copy link
Member

And thanks for the PR @wjones127 !

Unfortunately it seems this one caused some failures in one of the nightly integration builds: https://github.com/ursacomputing/crossbow/runs/6071471966?check_suite_focus=true
The kartothek build is failing with errors that seem related to this (but didn't actually verify this!)

From a quick look at the error message, this might be because we now return a datetime.date object instead of integer for date type?

@lidavidm
Copy link
Member

Hmm, it sure seems like it. Should we revert this? Or maybe we expose the 'old' value under min and the 'new' value under min_value or min_scalar? (We can use the same impl in both cases, with a special case for dates?)

@jorisvandenbossche
Copy link
Member

We already have min/max and min_raw/max_raw, so I would like to avoid adding a third one to overcome this backwards compatibility issue ...

It's certainly also an option to see this as a "bug fix" (we shouldn't have used integers for the date type before) and ask kartothek to update their code. It's also for the lesser used date type (not for timestamp), and quite likely the behaviour here now also might have changes for other less used types that were not explicitly handled before (eg I don't how "interval" type would be handled?)

@lidavidm
Copy link
Member

Hmm, fair. In that case we should test all the data types then just to make the behavior explicit.

@wjones127
Copy link
Member Author

FWIW I considered it a bug that date statistics returned an integer. That's why I added tests for dates and decimals.

Would you like me to add tests for all other types in a new PR?

@jorisvandenbossche
Copy link
Member

Ah, I missed that you added date already to the tests as well.
Checking the parquet docs, I think the only other logical type that we don't yet explicitly handled is "interval" (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval), but that is also not yet handled by the C++ StatisticsAsScalars, so this won't have changed in behaviour (and in general I don't know the support of mapping our own interval types to parquet interval).

@wjones127 would you want to open an issue with kartothek to warn them about the upcoming change?

@jorisvandenbossche
Copy link
Member

Thanks for opening that issue!

On the short term, we should also fix our nightly builds (either temporarily disabling them altogether, or ideally on skipping those failing tests). Opened https://issues.apache.org/jira/browse/ARROW-16262 for tracking that

jorisvandenbossche pushed a commit that referenced this pull request Apr 22, 2022
…gration

As discussed on #12902 (comment) there is an issue on the Kartothek integration JDASoftwareGroup/kartothek#515
This PR aims to skip the failing tests.

Closes #12947 from raulcd/ARROW-16262

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants