Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up percentiles_summary logic #11094

Merged
merged 1 commit into from
May 6, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented May 3, 2024

This is essentially a revision of #10551, which generalized the partition-wise quantiles logic used in percentiles_summary to work with both cudf and pandas-backed data. It turns out that those changes do not quite cover the case that a numerical column contains null values.

An additional problem with the current logic is that error handling for the .values operation is mixed with error handling for the quantile operation itself. This PR separates these steps a bit to make the logic less confusing.

Overview of the partition-wise quantiles logic after this PR:

  1. We try to use the Series.quantile method (without calling .values yet)
  2. If (1) fails, we use the more-robust DataFrame.quantile(..., method="table") method as if pandas>=1.5
  3. If (1) fails and pandas<1.5, we fall back to percentile (this entire code path can be removed when pandas2+ is required)
  4. After (1), (2) or (3) has succeeded, we try converting the result to an array with .values. Since cudf->cupy will fail to do this for string columns and/or null values, we simply pass if this step fails.
  • NOTE: It is not clear to me why we need to call .values here at all. I believe NEP18 covers all necessary Series logic anyway. Perhaps we can remove this step?

@rjzamora rjzamora added dataframe enhancement Improve existing functionality or make things work better labels May 3, 2024
@rjzamora rjzamora self-assigned this May 3, 2024
# `data.values` doesn't work for cudf, so we need to
# use `quantile(..., method="table")` as a fallback
# Series.quantile doesn't work with some data types (e.g. strings)
if PANDAS_GE_150:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this will ensure that the unnecessary _percentile code path gets removed when pandas<1.5 is no longer supported.

Comment on lines +457 to +464
# Convert to array if necessary (and possible)
if is_series_like(vals):
try:
vals = vals.values
except (ValueError, TypeError):
# cudf->cupy won't work if nulls are present,
# or if this is a string dtype
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is that this entire code block can/should be removed for the sake of simplicity. However, I'm currently leaving the .values operation in place to keep the scope if this PR small.

Copy link
Contributor

github-actions bot commented May 3, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ±0       15 suites  ±0   3h 24m 15s ⏱️ +39s
 13 121 tests ±0   12 190 ✅ ±0     931 💤 ±0  0 ❌ ±0 
162 468 runs  ±0  142 412 ✅ ±0  20 056 💤 ±0  0 ❌ ±0 

Results for commit 4f29798. ± Comparison against base commit efb4a62.

@phofl phofl merged commit bc6f42b into dask:main May 6, 2024
27 of 28 checks passed
@phofl
Copy link
Collaborator

phofl commented May 6, 2024

thx

@rjzamora rjzamora deleted the revise-percentiles_summary branch May 6, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe enhancement Improve existing functionality or make things work better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants