ENH: Adds sort_values to dask.DataFrame #7286

gerrymanoim · 2021-02-26T22:18:09Z

Closes add support for .sort_values #958
Tests added / passed
Passes black dask / flake8 dask. Running black on core.py and shuffle.py caused large diffs - maybe I have something misconfigured?

As suggested #958 (comment) and #2367, I've pretty much reused the code in set_index and don't mess with the underlying bits. Shared code between set_index and sort_values is in _calculate_divisions.

Implementation caveats:

only single column sorting is supported
only ascending=True is supported

Closes #958

jsignell · 2021-03-01T20:50:05Z

This seems like a reasonable solution to me and I feel like we are open to an implementation of sort_values

code duplication

I'd prefer that this much duplication be abstracted away a bit.

Running black on core.py and shuffle.py caused large diffs - maybe I have something misconfigured?

We generally use pre-commit to run black. You can read more about how that works here: https://docs.dask.org/en/latest/develop.html#code-formatting

gerrymanoim · 2021-03-01T20:53:23Z

Thanks for the feedback - I'll get this cleaned up.

gerrymanoim · 2021-03-06T21:18:05Z

@jsignell I think this is good to go, let me know if there are changes I should make.

jsignell

There are some lingering issues to be resolved, but this PR is coming along nicely!

dask/dataframe/tests/test_shuffle.py

dask/dataframe/shuffle.py

dask/dataframe/core.py

jsignell · 2021-03-08T14:52:01Z

dask/dataframe/core.py

+        divisions: list, optional
+            Known values on which to separate index values of the partitions.
+            See https://docs.dask.org/en/latest/dataframe-design.html#partitions
+            Defaults to computing this with a single pass over the data. Note
+            that if ``sorted=True``, specified divisions are assumed to match
+            the existing partitions in the data. If ``sorted=False``, you should
+            leave divisions empty and call ``repartition`` after ``set_index``.


This docstring looks copied from set_index and I'm not sure how much of it applies to sort_values. In particular, it is probably safe to assume that the column is not sorted, so I don't think divisions should even be an option in this method.

Yeah - you're right here. I was copying the interface from set_index and this doesn't make sense, removing.

jsignell · 2021-03-08T15:11:22Z

dask/dataframe/shuffle.py

+            divisions = mins + [maxes[-1]]
+            return df.map_partitions(M.sort_values, value)
+    df = rearrange_by_divisions(df, value, divisions)
+    df.divisions = divisions


I don't think you want to set df.divisions to divisions. The output of _calculate_division is the divisions that would be on the sort_by_col if it were the index, but in the case of sort_by_values the column does not become the index, so the divisions in this method are not equivalent to the divisions on the resultant dataframe.

jsignell · 2021-03-08T15:14:07Z

dask/dataframe/tests/test_shuffle.py

+    tm.assert_frame_equal(
+        ddf.sort_values("a").compute().reset_index(drop=True),
+        df.sort_values("a").reset_index(drop=True),
+    )


It's preferable to use assert_eq like the rest of the tests in this file do. That helper function checks properties of the dask dataframe before compute is called. By changing this test to use assert_eq you'll see the divisions issue that I mentioned above.

Ah! I see - thanks!

gerrymanoim · 2021-03-09T06:53:01Z

This should be in a better state now - thanks for catching those oversights.

jakirkham · 2021-03-15T17:35:57Z

cc @rjzamora (in case this if of interest 🙂)

rjzamora · 2021-03-15T18:10:47Z

cc @rjzamora (in case this if of interest 🙂)

Seems like the logical way to handle single-column sort_values - Thanks for this @gerrymanoim!

Dask-CuDF actually does the same exact thing to support both single- and multi-column sort_values. However, this is only possible with cudf, because the divisions calculation stage is performed with cudf.DataFrame.quantiles (which returns a single quantile column for the entire dataframe input, rather than independent quatiles for each column).

gerrymanoim · 2021-03-15T19:12:46Z

which returns a single quantile column for the entire dataframe input, rather than independent quatiles for each column

Ah that would be a nice extension.

jsignell · 2021-03-17T15:46:10Z

Thanks for the PR @gerrymanoim!

ryan-williams · 2021-03-24T02:44:05Z

dask/dataframe/shuffle.py

+        and npartitions == df.npartitions
+    ):
+        # divisions are in the right place
+        divisions = mins + [maxes[-1]]


just noticing while rebasing #7214: I think this line is dead code.

Not sure if this value was supposed to go anywhere, from skimming the review comments, it seems like it was originally passed to something but is no longer (which is correct / as it should be), so this line can just be removed.

ryan-williams · 2021-03-24T02:47:00Z

dask/dataframe/shuffle.py

+    partition_size=128e6,
+    **kwargs,
+):
+    """ See _Frame.sort_values for docstring """


I think it ended up as DataFrame.sort_values, though moving to _Frame does sound like it should "just work" and would be beneficial

actually Series.sort_values() needs a different API than DataFrame.sort_values (former does not take a by column name), so someone can add it in another PR; I just updated this comment in #7462

ENH: Add sort_values method to dask.dataframe

f883524

gerrymanoim changed the title ~~[WIP] ENH: Adds sort_values to dask.DataFrame~~ ENH: Adds sort_values to dask.DataFrame Mar 6, 2021

jsignell reviewed Mar 8, 2021

View reviewed changes

Base automatically changed from master to main March 8, 2021 20:20

ENH: Add sort_values method to dask.dataframe

e374f26

jsignell merged commit 8da02d0 into dask:main Mar 17, 2021

gerrymanoim deleted the sort-values branch March 17, 2021 16:42

ryan-williams reviewed Mar 24, 2021

View reviewed changes

ryan-williams mentioned this pull request Mar 24, 2021

minor sort_values housekeeping #7462

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adds sort_values to dask.DataFrame #7286

ENH: Adds sort_values to dask.DataFrame #7286

gerrymanoim commented Feb 26, 2021 •

edited

jsignell commented Mar 1, 2021

gerrymanoim commented Mar 1, 2021

gerrymanoim commented Mar 6, 2021

jsignell left a comment

jsignell Mar 8, 2021

gerrymanoim Mar 9, 2021 •

edited

jsignell Mar 8, 2021

jsignell Mar 8, 2021

gerrymanoim Mar 9, 2021

gerrymanoim commented Mar 9, 2021

jakirkham commented Mar 15, 2021

rjzamora commented Mar 15, 2021

gerrymanoim commented Mar 15, 2021

jsignell commented Mar 17, 2021

ryan-williams Mar 24, 2021 •

edited

ryan-williams Mar 24, 2021

ryan-williams Mar 24, 2021

ENH: Adds sort_values to dask.DataFrame #7286

ENH: Adds sort_values to dask.DataFrame #7286

Conversation

gerrymanoim commented Feb 26, 2021 • edited

jsignell commented Mar 1, 2021

gerrymanoim commented Mar 1, 2021

gerrymanoim commented Mar 6, 2021

jsignell left a comment

Choose a reason for hiding this comment

jsignell Mar 8, 2021

Choose a reason for hiding this comment

gerrymanoim Mar 9, 2021 • edited

Choose a reason for hiding this comment

jsignell Mar 8, 2021

Choose a reason for hiding this comment

jsignell Mar 8, 2021

Choose a reason for hiding this comment

gerrymanoim Mar 9, 2021

Choose a reason for hiding this comment

gerrymanoim commented Mar 9, 2021

jakirkham commented Mar 15, 2021

rjzamora commented Mar 15, 2021

gerrymanoim commented Mar 15, 2021

jsignell commented Mar 17, 2021

ryan-williams Mar 24, 2021 • edited

Choose a reason for hiding this comment

ryan-williams Mar 24, 2021

Choose a reason for hiding this comment

ryan-williams Mar 24, 2021

Choose a reason for hiding this comment

gerrymanoim commented Feb 26, 2021 •

edited

gerrymanoim Mar 9, 2021 •

edited

ryan-williams Mar 24, 2021 •

edited