Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add support for cumulative aggregations in GroupBy.agg #9620

Open
charlesbluca opened this issue Nov 3, 2022 · 1 comment
Open

[ENH] Add support for cumulative aggregations in GroupBy.agg #9620

charlesbluca opened this issue Nov 3, 2022 · 1 comment
Labels
dataframe feature Something is missing needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Comments

@charlesbluca
Copy link
Member

Currently, cumulative grouped aggregations can be called using their named methods, but passing them into .agg() raises an error:

from dask.datasets import timeseries

ddf = timeseries()

ddf,groupby("name").cumsum()  # works
ddf.groupby("name").agg("cumsum")
ValueError                                Traceback (most recent call last)
Cell In [3], line 3
      1 from dask.datasets import timeseries
----> 3 timeseries().groupby("name").agg("cumsum")

File ~/dev/dask/main/dask/dataframe/groupby.py:2557, in DataFrameGroupBy.agg(self, arg, split_every, split_out, shuffle, **kwargs)
   2555 @_aggregate_docstring(based_on="pd.core.groupby.DataFrameGroupBy.agg")
   2556 def agg(self, arg=None, split_every=None, split_out=1, shuffle=None, **kwargs):
-> 2557     return self.aggregate(
   2558         arg=arg,
   2559         split_every=split_every,
   2560         split_out=split_out,
   2561         shuffle=shuffle,
   2562         **kwargs,
   2563     )

File ~/dev/dask/main/dask/dataframe/groupby.py:2547, in DataFrameGroupBy.aggregate(self, arg, split_every, split_out, shuffle, **kwargs)
   2544 if arg == "size":
   2545     return self.size()
-> 2547 return super().aggregate(
   2548     arg=arg,
   2549     split_every=split_every,
   2550     split_out=split_out,
   2551     shuffle=shuffle,
   2552     **kwargs,
   2553 )

File ~/dev/dask/main/dask/dataframe/groupby.py:1976, in _GroupBy.aggregate(self, arg, split_every, split_out, shuffle, **kwargs)
   1973 else:
   1974     raise ValueError(f"aggregate on unknown object {self.obj}")
-> 1976 chunk_funcs, aggregate_funcs, finalizers = _build_agg_args(spec)
   1978 if isinstance(self.by, (tuple, list)) and len(self.by) > 1:
   1979     levels = list(range(len(self.by)))

File ~/dev/dask/main/dask/dataframe/groupby.py:833, in _build_agg_args(spec)
    830 if not isinstance(func, Aggregation):
    831     func = funcname(known_np_funcs.get(func, func))
--> 833 impls = _build_agg_args_single(result_column, func, input_column)
    835 # overwrite existing result-columns, generate intermediates only once
    836 for spec in impls["chunk_funcs"]:

File ~/dev/dask/main/dask/dataframe/groupby.py:886, in _build_agg_args_single(result_column, func, input_column)
    883     return _build_agg_args_custom(result_column, func, input_column)
    885 else:
--> 886     raise ValueError(f"unknown aggregate {func}")

ValueError: unknown aggregate cumsum

It looks like the underlying issue here is that we don't have a _build_agg_args_* function for any of the cumulative aggregations; not sure if that's all that needs to be done here to unblock this functionality.

@charlesbluca charlesbluca added dataframe feature Something is missing labels Nov 3, 2022
@rubenvdg
Copy link
Contributor

rubenvdg commented Nov 29, 2022

ddf.groupby("name").agg("cumsum") should not work imo because cumsum is not an aggregation (like min and sum) but a transformation.

The fact that df.groupby("name").agg('cumsum') currently works in pandas is due to a bug (e.g. pandas-dev/pandas#44845, pandas-dev/pandas#35725).

For example, in pandas if you'd do something like:

df = pd.DataFrame({
    "key": ["a", "a", "b", "b", "b"],
    "value": range(5)
})

df.groupby("key")["value"].agg(lambda col: col.cumsum())

you catch ValueError: Must produce aggregated value (which is expected behavior).

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe feature Something is missing needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

No branches or pull requests

2 participants