Avoid groupby.agg(callable) in groupby-var by mrocklin · Pull Request #4482 · dask/dask

mrocklin · 2019-02-14T00:37:01Z

This has two benefits

It's much faster the following benchmark shows a 5x improvement
It doesn't require the pandas-like container to implement
groupby.agg(callable), which helps cudf

Benchmark

I get five-ish seconds for this on master
And less than one second on this branch

from time import time
import dask
df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist()

start = time()
for i in range(3):
    df.groupby('id').data.std().compute()
stop = time()

print(stop - start)`

Tests added / passed
Passes flake8 dask

cc @thomcom @TomAugspurger

mrocklin · 2019-02-14T00:43:20Z

@jangorecki this may affect your benchmarks

This has two benefits 1. It's much faster the following benchmark shows a 5x improvement 2. It doesn't require the pandas-like container to implement groupby.agg(callable), which helps cudf Benchmark --------- I get five-ish seconds for this on master And less than one second on this branch ``` from time import time import dask df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist() start = time() for i in range(3): df.groupby('id').data.std().compute() stop = time() print(stop - start)` ```

thomcom · 2019-02-14T16:42:30Z

Nice!

This has two benefits 1. It's much faster the following benchmark shows a 5x improvement 2. It doesn't require the pandas-like container to implement groupby.agg(callable), which helps cudf Benchmark --------- I get five-ish seconds for this on master And less than one second on this branch ``` from time import time import dask df = dask.datasets.timeseries(dtypes={'id': int, 'data': float}).persist() start = time() for i in range(3): df.groupby('id').data.std().compute() stop = time() print(stop - start)` ```

mrocklin mentioned this pull request Feb 14, 2019

[FEA] Support callables in groupby.agg rapidsai/cudf#901

Closed

mrocklin force-pushed the dataframe-groupby-accel branch from 27e911c to 095793f Compare February 14, 2019 01:02

TomAugspurger approved these changes Feb 14, 2019

View reviewed changes

mrocklin merged commit af231de into dask:master Feb 14, 2019

mrocklin deleted the dataframe-groupby-accel branch February 14, 2019 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid groupby.agg(callable) in groupby-var#4482

Avoid groupby.agg(callable) in groupby-var#4482
mrocklin merged 1 commit intodask:masterfrom
mrocklin:dataframe-groupby-accel

mrocklin commented Feb 14, 2019

Uh oh!

mrocklin commented Feb 14, 2019

Uh oh!

thomcom commented Feb 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mrocklin commented Feb 14, 2019

Benchmark

Uh oh!

mrocklin commented Feb 14, 2019

Uh oh!

thomcom commented Feb 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants