[REVIEW] Expose sort= argument for groupby#5801
Conversation
|
@mrocklin @beckernick - Any thoughts/concerns here? |
Sure - We can test the behavior of |
I think that a novice user might actually expect that this keyword sorts the entire dataset, rather than sort per-partition.
I don't think that we can trust users to read notes like this. Instead, short term we might raise an informative NotImplementedError. We could also try to do some sorting after-the-fact. |
|
Was the |
Right - Still didn't get a chance to revisit/test this. I am just assuming that the output will not always be sorted when |
|
@mrocklin @TomAugspurger - For now, I am raising an error if the user specifies |
|
Thanks @rjzamora! |
This may address #5441 and cudf#3319, and may be a reasonable alternative to #5450
The idea here is to accept a
sort=argument for groupby operations, which can be passed along the final apply phase of apply-concat-apply groupby aggregations. Currently, aggregations triggered by_groupby_aggregateare hard-coded to usesort=False(I assume for performance reasons), while others us the backend's default behavior. Ideally, the (final) sorting behavior should always depend on the argument added here.Notes:
sort=for aggregations themselves. The approach used here seems easier to implement/maintain, but I am open to feedback.[TODO] Perhaps we should also setAggregations usesort=Falsefor the first apply phase of aggregation (for performace reasons).sort=Falsefor all but the final ACA phase (if the groupby object'ssortattribute isTrue)[TODO] The changes in this PR currently address cudf#3319, but thorough testing still needs to be added.cc @beckernick (please feel free to advise on downstream needs here)
black dask/flake8 dask