New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for observed= in groupby operations #4385
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I'm a little concerned about the need to store the ._kwargs
and pass it through everywhere. I was hoping the change could be a bit more localized, but I haven't thought of a way better than how you've done it. I'll think on it a bit more.
Happy to rework it if you find a better alternative :) |
dask/dataframe/tests/test_groupby.py
Outdated
result_f = getattr(result_g, agg_func) | ||
|
||
expected = expected_f() | ||
result = result_f().compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove .compute
. assert_eq will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed it; otherwise assert_eq fails. It boils down to dask not really supporting multiindex
dask/dataframe/tests/test_groupby.py
Outdated
expected = expected_f() | ||
result = result_f().compute() | ||
|
||
# when applying groupby(['c1', 'c2'])(['c3']).sum(), pandas returns nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And dask returns 0 here? I'm not sure what's expected here, though the pandas behavior seems buggy. I would expect pandas to follow min_count=0
and return 0
, but I don't recall if we special cased unobserved categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same bug as above, resolved in pandas 0.24
dask/dataframe/groupby.py
Outdated
@@ -923,8 +1003,41 @@ def count(self, split_every=None, split_out=1): | |||
|
|||
@derived_from(pd.core.groupby.GroupBy) | |||
def mean(self, split_every=None, split_out=1): | |||
return (self.sum(split_every=split_every, split_out=split_out) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not work anymore?
I am starting to believe that it doesn't make that much sense to support the observed= keyword, if dask does not support MultiIndex in the index. What's your opinion, @TomAugspurger ? |
This is still useful for results that have a MultiIndex I think In [2]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])})
In [3]: df.groupby("B", observed=True).A.sum()
Out[3]:
B
a 3
b 7
Name: A, dtype: int64 And dask does have limited support for MultiIndex results, so it should be OK I think. |
@jmunar , did you mean to close this, are you still working on it? |
I am no longer working on it, these days I am using dd.core.aca for almost every groupby operation; it |
hey @jmunar do you happen to have an example of your workaround using |
flake8 dask
Feature explained in #4371