New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Adds sort_values to dask.DataFrame #7286
Conversation
This seems like a reasonable solution to me and I feel like we are open to an implementation of
I'd prefer that this much duplication be abstracted away a bit.
We generally use pre-commit to run black. You can read more about how that works here: https://docs.dask.org/en/latest/develop.html#code-formatting |
Thanks for the feedback - I'll get this cleaned up. |
@jsignell I think this is good to go, let me know if there are changes I should make. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some lingering issues to be resolved, but this PR is coming along nicely!
dask/dataframe/core.py
Outdated
divisions: list, optional | ||
Known values on which to separate index values of the partitions. | ||
See https://docs.dask.org/en/latest/dataframe-design.html#partitions | ||
Defaults to computing this with a single pass over the data. Note | ||
that if ``sorted=True``, specified divisions are assumed to match | ||
the existing partitions in the data. If ``sorted=False``, you should | ||
leave divisions empty and call ``repartition`` after ``set_index``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring looks copied from set_index
and I'm not sure how much of it applies to sort_values
. In particular, it is probably safe to assume that the column is not sorted, so I don't think divisions
should even be an option in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - you're right here. I was copying the interface from set_index
and this doesn't make sense, removing.
dask/dataframe/shuffle.py
Outdated
divisions = mins + [maxes[-1]] | ||
return df.map_partitions(M.sort_values, value) | ||
df = rearrange_by_divisions(df, value, divisions) | ||
df.divisions = divisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to set df.divisions
to divisions
. The output of _calculate_division
is the divisions that would be on the sort_by_col
if it were the index, but in the case of sort_by_values
the column does not become the index, so the divisions in this method are not equivalent to the divisions on the resultant dataframe.
dask/dataframe/tests/test_shuffle.py
Outdated
tm.assert_frame_equal( | ||
ddf.sort_values("a").compute().reset_index(drop=True), | ||
df.sort_values("a").reset_index(drop=True), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's preferable to use assert_eq
like the rest of the tests in this file do. That helper function checks properties of the dask dataframe before compute
is called. By changing this test to use assert_eq
you'll see the divisions issue that I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I see - thanks!
This should be in a better state now - thanks for catching those oversights. |
cc @rjzamora (in case this if of interest 🙂) |
Seems like the logical way to handle single-column Dask-CuDF actually does the same exact thing to support both single- and multi-column |
Ah that would be a nice extension. |
Thanks for the PR @gerrymanoim! |
and npartitions == df.npartitions | ||
): | ||
# divisions are in the right place | ||
divisions = mins + [maxes[-1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticing while rebasing #7214: I think this line is dead code.
Not sure if this value was supposed to go anywhere, from skimming the review comments, it seems like it was originally passed to something but is no longer (which is correct / as it should be), so this line can just be removed.
partition_size=128e6, | ||
**kwargs, | ||
): | ||
""" See _Frame.sort_values for docstring """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it ended up as DataFrame.sort_values
, though moving to _Frame
does sound like it should "just work" and would be beneficial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually Series.sort_values()
needs a different API than DataFrame.sort_values
(former does not take a by
column name), so someone can add it in another PR; I just updated this comment in #7462
black dask
/flake8 dask
. Running black oncore.py
andshuffle.py
caused large diffs - maybe I have something misconfigured?As suggested #958 (comment) and #2367, I've pretty much reused the code in
set_index
and don't mess with the underlying bits. Shared code betweenset_index
andsort_values
is in_calculate_divisions
.Implementation caveats:
ascending=True
is supportedCloses #958