Add support for observed= in groupby operations #4385

jmunar · 2019-01-14T23:06:43Z

Tests added / passed
Passes flake8 dask

Feature explained in #4371

TomAugspurger

Thanks for the PR. I'm a little concerned about the need to store the ._kwargs and pass it through everywhere. I was hoping the change could be a bit more localized, but I haven't thought of a way better than how you've done it. I'll think on it a bit more.

dask/dataframe/groupby.py

jmunar · 2019-01-15T19:50:40Z

Happy to rework it if you find a better alternative :)

dask/dataframe/groupby.py

TomAugspurger · 2019-01-16T20:39:51Z

dask/dataframe/tests/test_groupby.py

+    result_f = getattr(result_g, agg_func)
+
+    expected = expected_f()
+    result = result_f().compute()


Remove .compute. assert_eq will do that.

I needed it; otherwise assert_eq fails. It boils down to dask not really supporting multiindex

TomAugspurger · 2019-01-16T20:42:39Z

dask/dataframe/tests/test_groupby.py

+    expected = expected_f()
+    result = result_f().compute()
+
+    # when applying groupby(['c1', 'c2'])(['c3']).sum(), pandas returns nans


And dask returns 0 here? I'm not sure what's expected here, though the pandas behavior seems buggy. I would expect pandas to follow min_count=0 and return 0, but I don't recall if we special cased unobserved categories.

This is the same bug as above, resolved in pandas 0.24

TomAugspurger · 2019-01-16T20:43:35Z

dask/dataframe/groupby.py

@@ -923,8 +1003,41 @@ def count(self, split_every=None, split_out=1):

    @derived_from(pd.core.groupby.GroupBy)
    def mean(self, split_every=None, split_out=1):
-        return (self.sum(split_every=split_every, split_out=split_out) /


Does this not work anymore?

jmunar · 2019-01-18T23:23:39Z

I am starting to believe that it doesn't make that much sense to support the observed= keyword, if dask does not support MultiIndex in the index. What's your opinion, @TomAugspurger ?

TomAugspurger · 2019-01-21T16:15:11Z

This is still useful for results that have a MultiIndex I think

In [2]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])})

In [3]: df.groupby("B", observed=True).A.sum()
Out[3]:
B
a    3
b    7
Name: A, dtype: int64

And dask does have limited support for MultiIndex results, so it should be OK I think.

martindurant · 2019-04-30T16:10:54Z

@jmunar , did you mean to close this, are you still working on it?

jmunar · 2019-05-01T16:17:29Z

I am no longer working on it, these days I am using dd.core.aca for almost every groupby operation; it
s not ideal but it works

maximz · 2020-09-03T23:13:35Z

hey @jmunar do you happen to have an example of your workaround using dd.core.aca to avoid this issue? thanks.

TomAugspurger reviewed Jan 15, 2019

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

TomAugspurger reviewed Jan 16, 2019

View reviewed changes

jmunar closed this Mar 5, 2019

ig248 mentioned this pull request Jan 22, 2020

GroupBy-Aggregate on multiple categorical columns fails with MemoryError due to pre-allocating full Cartesian product of categories #5820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for observed= in groupby operations #4385

Add support for observed= in groupby operations #4385

jmunar commented Jan 14, 2019

TomAugspurger left a comment

jmunar commented Jan 15, 2019

TomAugspurger Jan 16, 2019

jmunar Jan 18, 2019

TomAugspurger Jan 16, 2019

jmunar Jan 18, 2019

TomAugspurger Jan 16, 2019

jmunar commented Jan 18, 2019

TomAugspurger commented Jan 21, 2019

martindurant commented Apr 30, 2019

jmunar commented May 1, 2019

maximz commented Sep 3, 2020

Add support for observed= in groupby operations #4385

Add support for observed= in groupby operations #4385

Conversation

jmunar commented Jan 14, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jmunar commented Jan 15, 2019

TomAugspurger Jan 16, 2019

Choose a reason for hiding this comment

jmunar Jan 18, 2019

Choose a reason for hiding this comment

TomAugspurger Jan 16, 2019

Choose a reason for hiding this comment

jmunar Jan 18, 2019

Choose a reason for hiding this comment

TomAugspurger Jan 16, 2019

Choose a reason for hiding this comment

jmunar commented Jan 18, 2019

TomAugspurger commented Jan 21, 2019

martindurant commented Apr 30, 2019

jmunar commented May 1, 2019

maximz commented Sep 3, 2020