Skip to content

[REVIEW] Expose sort= argument for groupby#5801

Merged
TomAugspurger merged 9 commits intodask:masterfrom
rjzamora:groupby-sort
Feb 5, 2020
Merged

[REVIEW] Expose sort= argument for groupby#5801
TomAugspurger merged 9 commits intodask:masterfrom
rjzamora:groupby-sort

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Jan 17, 2020

This may address #5441 and cudf#3319, and may be a reasonable alternative to #5450

The idea here is to accept a sort= argument for groupby operations, which can be passed along the final apply phase of apply-concat-apply groupby aggregations. Currently, aggregations triggered by _groupby_aggregate are hard-coded to use sort=False (I assume for performance reasons), while others us the backend's default behavior. Ideally, the (final) sorting behavior should always depend on the argument added here.

Notes:

  • Support sort= keyword in some groupby-aggregations #5450 achieves something similar by exposing sort= for aggregations themselves. The approach used here seems easier to implement/maintain, but I am open to feedback.
  • [TODO] Perhaps we should also set sort=False for the first apply phase of aggregation (for performace reasons). Aggregations use sort=False for all but the final ACA phase (if the groupby object's sort attribute is True)
  • [TODO] The changes in this PR currently address cudf#3319, but thorough testing still needs to be added.

cc @beckernick (please feel free to advise on downstream needs here)

  • Tests added / passed
  • Passes black dask / flake8 dask

@rjzamora rjzamora changed the title [WIP] Expose sort= argument for groupby [REVIEW] Expose sort= argument for groupby Jan 22, 2020
@rjzamora
Copy link
Copy Markdown
Member Author

@mrocklin @beckernick - Any thoughts/concerns here?

Copy link
Copy Markdown
Member

@beckernick beckernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, but I'm a bit concerned about how this would interact with split_out. It would be useful to explicitly test a few multi-partition output scenarios

@rjzamora
Copy link
Copy Markdown
Member Author

This generally LGTM, but I'm a bit concerned about how this would interact with split_out. It would be useful to explicitly test a few multi-partition output scenarios

Sure - We can test the behavior of plit_out, but the expected behavior would be that sort=True/False would only affect the order of the groupby keys within each of the split_out partitions (because the kwarg would only be passed for the final aggregation after the data is already split between partitions). Perhaps we should add a note/comment to this effect?

@mrocklin
Copy link
Copy Markdown
Member

but the expected behavior would be that sort=True/False would only affect the order of the groupby keys within each of the split_out partitions

I think that a novice user might actually expect that this keyword sorts the entire dataset, rather than sort per-partition.

Perhaps we should add a note/comment to this effect?

I don't think that we can trust users to read notes like this.

Instead, short term we might raise an informative NotImplementedError. We could also try to do some sorting after-the-fact.

@TomAugspurger
Copy link
Copy Markdown
Member

Was the split_out issue resolved? I didn't quite follow the disagreement. Is the issue that you wouldn't necessarily get sorted output with sort=True?

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Feb 4, 2020

Was the split_out issue resolved? I didn't quite follow the disagreement. Is the issue that you wouldn't necessarily get sorted output with sort=True?

Right - Still didn't get a chance to revisit/test this. I am just assuming that the output will not always be sorted when split_out>1. The discussion was just about how this information should be conveyed to the user (comment, error), or if more work is needed to make sure the output is actually sorted.

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Feb 5, 2020

@mrocklin @TomAugspurger - For now, I am raising an error if the user specifies sort=True for the groupby, and then split_out>1 for an aggregation. I would like to revisit this and make sorting work accross split_out partitions, but I'd like to get the current version merged in case I don't have time in the near future.

@TomAugspurger
Copy link
Copy Markdown
Member

Thanks @rjzamora!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants