-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up percentiles_summary
logic
#11094
Conversation
# `data.values` doesn't work for cudf, so we need to | ||
# use `quantile(..., method="table")` as a fallback | ||
# Series.quantile doesn't work with some data types (e.g. strings) | ||
if PANDAS_GE_150: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this will ensure that the unnecessary _percentile
code path gets removed when pandas<1.5 is no longer supported.
# Convert to array if necessary (and possible) | ||
if is_series_like(vals): | ||
try: | ||
vals = vals.values | ||
except (ValueError, TypeError): | ||
# cudf->cupy won't work if nulls are present, | ||
# or if this is a string dtype | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense is that this entire code block can/should be removed for the sake of simplicity. However, I'm currently leaving the .values
operation in place to keep the scope if this PR small.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 3h 24m 15s ⏱️ +39s Results for commit 4f29798. ± Comparison against base commit efb4a62. |
thx |
This is essentially a revision of #10551, which generalized the partition-wise
quantiles
logic used inpercentiles_summary
to work with bothcudf
andpandas
-backed data. It turns out that those changes do not quite cover the case that a numerical column contains null values.An additional problem with the current logic is that error handling for the
.values
operation is mixed with error handling for thequantile
operation itself. This PR separates these steps a bit to make the logic less confusing.Overview of the partition-wise
quantiles
logic after this PR:Series.quantile
method (without calling.values
yet)DataFrame.quantile(..., method="table")
method as ifpandas>=1.5
pandas<1.5
, we fall back topercentile
(this entire code path can be removed when pandas2+ is required).values
. Sincecudf->cupy
will fail to do this for string columns and/or null values, we simplypass
if this step fails..values
here at all. I believe NEP18 covers all necessarySeries
logic anyway. Perhaps we can remove this step?