ARROW-7350: [Python] Decode parquet statistics as scalars #12902

wjones127 · 2022-04-15T22:16:12Z

Thanks to Joris for pointing out we had this function in C++.

github-actions · 2022-04-15T22:16:30Z

https://issues.apache.org/jira/browse/ARROW-7350

python/pyarrow/_parquet.pyx

ursabot · 2022-04-19T02:21:17Z

Benchmark runs are scheduled for baseline = 5b2c0a0 and contender = aa641d5. aa641d5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.46% ⬆️0.13%] test-mac-arm
[Failed ⬇️0.75% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.3% ⬆️0.13%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/528| aa641d51 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/515| aa641d51 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/514| aa641d51 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/525| aa641d51 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/527| 5b2c0a0f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/514| 5b2c0a0f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/513| 5b2c0a0f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/524| 5b2c0a0f ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jorisvandenbossche · 2022-04-19T14:50:37Z

And thanks for the PR @wjones127 !

Unfortunately it seems this one caused some failures in one of the nightly integration builds: https://github.com/ursacomputing/crossbow/runs/6071471966?check_suite_focus=true
The kartothek build is failing with errors that seem related to this (but didn't actually verify this!)

From a quick look at the error message, this might be because we now return a datetime.date object instead of integer for date type?

lidavidm · 2022-04-19T14:59:51Z

Hmm, it sure seems like it. Should we revert this? Or maybe we expose the 'old' value under min and the 'new' value under min_value or min_scalar? (We can use the same impl in both cases, with a special case for dates?)

jorisvandenbossche · 2022-04-19T15:24:32Z

We already have min/max and min_raw/max_raw, so I would like to avoid adding a third one to overcome this backwards compatibility issue ...

It's certainly also an option to see this as a "bug fix" (we shouldn't have used integers for the date type before) and ask kartothek to update their code. It's also for the lesser used date type (not for timestamp), and quite likely the behaviour here now also might have changes for other less used types that were not explicitly handled before (eg I don't how "interval" type would be handled?)

lidavidm · 2022-04-19T15:27:31Z

Hmm, fair. In that case we should test all the data types then just to make the behavior explicit.

wjones127 · 2022-04-19T15:54:53Z

FWIW I considered it a bug that date statistics returned an integer. That's why I added tests for dates and decimals.

Would you like me to add tests for all other types in a new PR?

jorisvandenbossche · 2022-04-19T16:19:49Z

Ah, I missed that you added date already to the tests as well.
Checking the parquet docs, I think the only other logical type that we don't yet explicitly handled is "interval" (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval), but that is also not yet handled by the C++ StatisticsAsScalars, so this won't have changed in behaviour (and in general I don't know the support of mapping our own interval types to parquet interval).

@wjones127 would you want to open an issue with kartothek to warn them about the upcoming change?

jorisvandenbossche · 2022-04-21T09:26:34Z

Thanks for opening that issue!

On the short term, we should also fix our nightly builds (either temporarily disabling them altogether, or ideally on skipping those failing tests). Opened https://issues.apache.org/jira/browse/ARROW-16262 for tracking that

…gration As discussed on #12902 (comment) there is an issue on the Kartothek integration JDASoftwareGroup/kartothek#515 This PR aims to skip the failing tests. Closes #12947 from raulcd/ARROW-16262 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Decode statistics as scalars

8b70feb

github-actions bot added the Component: Python label Apr 15, 2022

lidavidm reviewed Apr 18, 2022

View reviewed changes

python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved

Check status and use better names

37aded4

wjones127 requested a review from lidavidm April 18, 2022 21:31

lidavidm approved these changes Apr 18, 2022

View reviewed changes

lidavidm closed this in aa641d5 Apr 18, 2022

wjones127 deleted the ARROW-7350-parquet-stats-decode branch April 18, 2022 21:42

wjones127 mentioned this pull request Apr 19, 2022

Parquet stats for dates will no longer return ints in PyArrow 8.0 (upcoming) JDASoftwareGroup/kartothek#515

Open

raulcd mentioned this pull request Apr 21, 2022

ARROW-16262: [CI][Integration] Skip failing tests from kartothek integration #12947

Closed

asfimport mentioned this pull request Apr 25, 2022

[CI] Kartothek nightly integration build is failing because of Parquet statistics date change #31656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7350: [Python] Decode parquet statistics as scalars #12902

ARROW-7350: [Python] Decode parquet statistics as scalars #12902

wjones127 commented Apr 15, 2022

github-actions bot commented Apr 15, 2022

ursabot commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

lidavidm commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

lidavidm commented Apr 19, 2022

wjones127 commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

jorisvandenbossche commented Apr 21, 2022

ARROW-7350: [Python] Decode parquet statistics as scalars #12902

ARROW-7350: [Python] Decode parquet statistics as scalars #12902

Conversation

wjones127 commented Apr 15, 2022

github-actions bot commented Apr 15, 2022

ursabot commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

lidavidm commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

lidavidm commented Apr 19, 2022

wjones127 commented Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022

jorisvandenbossche commented Apr 21, 2022