Skip to content

Commit

Permalink
Avoid groupby.agg(callable) in groupby-var (#4482)
Browse files Browse the repository at this point in the history
This has two benefits

1.  It's much faster the following benchmark shows a 5x improvement
2.  It doesn't require the pandas-like container to implement
    groupby.agg(callable), which helps cudf

Benchmark
---------

I get five-ish seconds for this on master
And less than one second on this branch

```
from time import time
import dask
df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist()

start = time()
for i in range(3):
    df.groupby('id').data.std().compute()
stop = time()

print(stop - start)`
```
  • Loading branch information
mrocklin committed Feb 14, 2019
1 parent 2ed8b63 commit af231de
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion dask/dataframe/groupby.py
Expand Up @@ -242,8 +242,14 @@ def _var_chunk(df, *index):
df = df.to_frame()
g = _groupby_raise_unaligned(df, by=index)
x = g.sum()
x2 = g.agg(lambda x: (x**2).sum()).rename(columns=lambda c: c + '-x2')

n = g.count().rename(columns=lambda c: c + '-count')

df2 = df ** 2
g2 = _groupby_raise_unaligned(df2, by=index)
x2 = g2.sum().rename(columns=lambda c: c + '-x2')

x2.index = x.index
return pd.concat([x, x2, n], axis=1)


Expand Down

0 comments on commit af231de

Please sign in to comment.