Skip to content

Avoid groupby.agg(callable) in groupby-var#4482

Merged
mrocklin merged 1 commit intodask:masterfrom
mrocklin:dataframe-groupby-accel
Feb 14, 2019
Merged

Avoid groupby.agg(callable) in groupby-var#4482
mrocklin merged 1 commit intodask:masterfrom
mrocklin:dataframe-groupby-accel

Conversation

@mrocklin
Copy link
Copy Markdown
Member

This has two benefits

  1. It's much faster the following benchmark shows a 5x improvement
  2. It doesn't require the pandas-like container to implement
    groupby.agg(callable), which helps cudf

Benchmark

I get five-ish seconds for this on master
And less than one second on this branch

from time import time
import dask
df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist()

start = time()
for i in range(3):
    df.groupby('id').data.std().compute()
stop = time()

print(stop - start)`
  • Tests added / passed
  • Passes flake8 dask

cc @thomcom @TomAugspurger

@mrocklin
Copy link
Copy Markdown
Member Author

@jangorecki this may affect your benchmarks

This has two benefits

1.  It's much faster the following benchmark shows a 5x improvement
2.  It doesn't require the pandas-like container to implement
    groupby.agg(callable), which helps cudf

Benchmark
---------

I get five-ish seconds for this on master
And less than one second on this branch

```
from time import time
import dask
df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist()

start = time()
for i in range(3):
    df.groupby('id').data.std().compute()
stop = time()

print(stop - start)`
```
@mrocklin mrocklin force-pushed the dataframe-groupby-accel branch from 27e911c to 095793f Compare February 14, 2019 01:02
@mrocklin mrocklin merged commit af231de into dask:master Feb 14, 2019
@mrocklin mrocklin deleted the dataframe-groupby-accel branch February 14, 2019 15:03
@thomcom
Copy link
Copy Markdown
Contributor

thomcom commented Feb 14, 2019

Nice!

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
This has two benefits

1.  It's much faster the following benchmark shows a 5x improvement
2.  It doesn't require the pandas-like container to implement
    groupby.agg(callable), which helps cudf

Benchmark
---------

I get five-ish seconds for this on master
And less than one second on this branch

```
from time import time
import dask
df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist()

start = time()
for i in range(3):
    df.groupby('id').data.std().compute()
stop = time()

print(stop - start)`
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants