Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for observed= in groupby operations #4385

Closed
wants to merge 0 commits into from
Closed

Add support for observed= in groupby operations #4385

wants to merge 0 commits into from

Conversation

jmunar
Copy link

@jmunar jmunar commented Jan 14, 2019

  • Tests added / passed
  • Passes flake8 dask

Feature explained in #4371

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I'm a little concerned about the need to store the ._kwargs and pass it through everywhere. I was hoping the change could be a bit more localized, but I haven't thought of a way better than how you've done it. I'll think on it a bit more.

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
@jmunar
Copy link
Author

jmunar commented Jan 15, 2019

Happy to rework it if you find a better alternative :)

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
result_f = getattr(result_g, agg_func)

expected = expected_f()
result = result_f().compute()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove .compute. assert_eq will do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed it; otherwise assert_eq fails. It boils down to dask not really supporting multiindex

expected = expected_f()
result = result_f().compute()

# when applying groupby(['c1', 'c2'])(['c3']).sum(), pandas returns nans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And dask returns 0 here? I'm not sure what's expected here, though the pandas behavior seems buggy. I would expect pandas to follow min_count=0 and return 0, but I don't recall if we special cased unobserved categories.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same bug as above, resolved in pandas 0.24

@@ -923,8 +1003,41 @@ def count(self, split_every=None, split_out=1):

@derived_from(pd.core.groupby.GroupBy)
def mean(self, split_every=None, split_out=1):
return (self.sum(split_every=split_every, split_out=split_out) /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not work anymore?

@jmunar
Copy link
Author

jmunar commented Jan 18, 2019

I am starting to believe that it doesn't make that much sense to support the observed= keyword, if dask does not support MultiIndex in the index. What's your opinion, @TomAugspurger ?

@TomAugspurger
Copy link
Member

This is still useful for results that have a MultiIndex I think

In [2]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])})

In [3]: df.groupby("B", observed=True).A.sum()
Out[3]:
B
a    3
b    7
Name: A, dtype: int64

And dask does have limited support for MultiIndex results, so it should be OK I think.

@jmunar jmunar closed this Mar 5, 2019
@martindurant
Copy link
Member

@jmunar , did you mean to close this, are you still working on it?

@jmunar
Copy link
Author

jmunar commented May 1, 2019

I am no longer working on it, these days I am using dd.core.aca for almost every groupby operation; it
s not ideal but it works

@maximz
Copy link

maximz commented Sep 3, 2020

hey @jmunar do you happen to have an example of your workaround using dd.core.aca to avoid this issue? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants