Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adds sort_values to dask.DataFrame #7286

Merged
merged 2 commits into from Mar 17, 2021
Merged

ENH: Adds sort_values to dask.DataFrame #7286

merged 2 commits into from Mar 17, 2021

Conversation

gerrymanoim
Copy link
Contributor

@gerrymanoim gerrymanoim commented Feb 26, 2021

  • Closes add support for .sort_values #958
  • Tests added / passed
  • Passes black dask / flake8 dask. Running black on core.py and shuffle.py caused large diffs - maybe I have something misconfigured?

As suggested #958 (comment) and #2367, I've pretty much reused the code in set_index and don't mess with the underlying bits. Shared code between set_index and sort_values is in _calculate_divisions.

Implementation caveats:

  • only single column sorting is supported
  • only ascending=True is supported

Closes #958

@jsignell
Copy link
Member

jsignell commented Mar 1, 2021

This seems like a reasonable solution to me and I feel like we are open to an implementation of sort_values

code duplication

I'd prefer that this much duplication be abstracted away a bit.

Running black on core.py and shuffle.py caused large diffs - maybe I have something misconfigured?

We generally use pre-commit to run black. You can read more about how that works here: https://docs.dask.org/en/latest/develop.html#code-formatting

@gerrymanoim
Copy link
Contributor Author

Thanks for the feedback - I'll get this cleaned up.

@gerrymanoim gerrymanoim changed the title [WIP] ENH: Adds sort_values to dask.DataFrame ENH: Adds sort_values to dask.DataFrame Mar 6, 2021
@gerrymanoim
Copy link
Contributor Author

@jsignell I think this is good to go, let me know if there are changes I should make.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some lingering issues to be resolved, but this PR is coming along nicely!

dask/dataframe/tests/test_shuffle.py Outdated Show resolved Hide resolved
dask/dataframe/shuffle.py Outdated Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
Comment on lines 3829 to 3835
divisions: list, optional
Known values on which to separate index values of the partitions.
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
Defaults to computing this with a single pass over the data. Note
that if ``sorted=True``, specified divisions are assumed to match
the existing partitions in the data. If ``sorted=False``, you should
leave divisions empty and call ``repartition`` after ``set_index``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring looks copied from set_index and I'm not sure how much of it applies to sort_values. In particular, it is probably safe to assume that the column is not sorted, so I don't think divisions should even be an option in this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - you're right here. I was copying the interface from set_index and this doesn't make sense, removing.

divisions = mins + [maxes[-1]]
return df.map_partitions(M.sort_values, value)
df = rearrange_by_divisions(df, value, divisions)
df.divisions = divisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to set df.divisions to divisions. The output of _calculate_division is the divisions that would be on the sort_by_col if it were the index, but in the case of sort_by_values the column does not become the index, so the divisions in this method are not equivalent to the divisions on the resultant dataframe.

Comment on lines 1146 to 1149
tm.assert_frame_equal(
ddf.sort_values("a").compute().reset_index(drop=True),
df.sort_values("a").reset_index(drop=True),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's preferable to use assert_eq like the rest of the tests in this file do. That helper function checks properties of the dask dataframe before compute is called. By changing this test to use assert_eq you'll see the divisions issue that I mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I see - thanks!

Base automatically changed from master to main March 8, 2021 20:20
@gerrymanoim
Copy link
Contributor Author

This should be in a better state now - thanks for catching those oversights.

@jakirkham
Copy link
Member

cc @rjzamora (in case this if of interest 🙂)

@rjzamora
Copy link
Member

cc @rjzamora (in case this if of interest 🙂)

Seems like the logical way to handle single-column sort_values - Thanks for this @gerrymanoim!

Dask-CuDF actually does the same exact thing to support both single- and multi-column sort_values. However, this is only possible with cudf, because the divisions calculation stage is performed with cudf.DataFrame.quantiles (which returns a single quantile column for the entire dataframe input, rather than independent quatiles for each column).

@gerrymanoim
Copy link
Contributor Author

which returns a single quantile column for the entire dataframe input, rather than independent quatiles for each column

Ah that would be a nice extension.

@jsignell jsignell merged commit 8da02d0 into dask:main Mar 17, 2021
@jsignell
Copy link
Member

Thanks for the PR @gerrymanoim!

@gerrymanoim gerrymanoim deleted the sort-values branch March 17, 2021 16:42
and npartitions == df.npartitions
):
# divisions are in the right place
divisions = mins + [maxes[-1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticing while rebasing #7214: I think this line is dead code.

Not sure if this value was supposed to go anywhere, from skimming the review comments, it seems like it was originally passed to something but is no longer (which is correct / as it should be), so this line can just be removed.

partition_size=128e6,
**kwargs,
):
""" See _Frame.sort_values for docstring """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it ended up as DataFrame.sort_values, though moving to _Frame does sound like it should "just work" and would be beneficial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually Series.sort_values() needs a different API than DataFrame.sort_values (former does not take a by column name), so someone can add it in another PR; I just updated this comment in #7462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for .sort_values
5 participants