Clean up `percentiles_summary` logic #11094

rjzamora · 2024-05-03T18:11:03Z

This is essentially a revision of #10551, which generalized the partition-wise quantiles logic used in percentiles_summary to work with both cudf and pandas-backed data. It turns out that those changes do not quite cover the case that a numerical column contains null values.

An additional problem with the current logic is that error handling for the .values operation is mixed with error handling for the quantile operation itself. This PR separates these steps a bit to make the logic less confusing.

Overview of the partition-wise quantiles logic after this PR:

We try to use the Series.quantile method (without calling .values yet)
If (1) fails, we use the more-robust DataFrame.quantile(..., method="table") method as if pandas>=1.5
If (1) fails and pandas<1.5, we fall back to percentile (this entire code path can be removed when pandas2+ is required)
After (1), (2) or (3) has succeeded, we try converting the result to an array with .values. Since cudf->cupy will fail to do this for string columns and/or null values, we simply pass if this step fails.

NOTE: It is not clear to me why we need to call .values here at all. I believe NEP18 covers all necessary Series logic anyway. Perhaps we can remove this step?

rjzamora · 2024-05-03T18:12:38Z

dask/dataframe/partitionquantiles.py

-            # `data.values` doesn't work for cudf, so we need to
-            # use `quantile(..., method="table")` as a fallback
+        # Series.quantile doesn't work with some data types (e.g. strings)
+        if PANDAS_GE_150:


Hopefully this will ensure that the unnecessary _percentile code path gets removed when pandas<1.5 is no longer supported.

rjzamora · 2024-05-03T18:14:19Z

dask/dataframe/partitionquantiles.py

+    # Convert to array if necessary (and possible)
+    if is_series_like(vals):
+        try:
+            vals = vals.values
+        except (ValueError, TypeError):
+            # cudf->cupy won't work if nulls are present,
+            # or if this is a string dtype
+            pass


My sense is that this entire code block can/should be removed for the sake of simplicity. However, I'm currently leaving the .values operation in place to keep the scope if this PR small.

github-actions · 2024-05-03T18:43:06Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ±0 15 suites ±0 3h 24m 15s ⏱️ +39s
13 121 tests ±0 12 190 ✅ ±0 931 💤 ±0 0 ❌ ±0
162 468 runs ±0 142 412 ✅ ±0 20 056 💤 ±0 0 ❌ ±0

Results for commit 4f29798. ± Comparison against base commit efb4a62.

phofl · 2024-05-06T09:44:38Z

thx

align and simplify percentiles_summary logic

4f29798

rjzamora added dataframe enhancement Improve existing functionality or make things work better labels May 3, 2024

rjzamora self-assigned this May 3, 2024

rjzamora commented May 3, 2024

View reviewed changes

phofl approved these changes May 6, 2024

View reviewed changes

phofl merged commit bc6f42b into dask:main May 6, 2024
27 of 28 checks passed

rjzamora deleted the revise-percentiles_summary branch May 6, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up `percentiles_summary` logic #11094

Clean up `percentiles_summary` logic #11094

rjzamora commented May 3, 2024

rjzamora May 3, 2024

rjzamora May 3, 2024

github-actions bot commented May 3, 2024

phofl commented May 6, 2024

Clean up percentiles_summary logic #11094

Clean up percentiles_summary logic #11094

Conversation

rjzamora commented May 3, 2024

rjzamora May 3, 2024

Choose a reason for hiding this comment

rjzamora May 3, 2024

Choose a reason for hiding this comment

github-actions bot commented May 3, 2024

Unit Test Results

phofl commented May 6, 2024

Clean up `percentiles_summary` logic #11094

Clean up `percentiles_summary` logic #11094