Avoid apply in _compute_sum_of_squares for groupby std agg by rjzamora · Pull Request #6186 · dask/dask

rjzamora · 2020-05-08T16:55:56Z

This PR addresses #6034 by avoiding the apply call in _compute_sum_of_squares for std groupby aggregations.

Motivation: The following does not currently work with dask_cudf, because the dask.dataframe implementation uses groupby-apply (which does not exhibit "correct" behavior for all cases in cudf):

import cudf, dask_cudf

df = cudf.DataFrame({'a': [1,1,2,2],'b': [4,5,6,10]})
ddf = dask_cudf.from_cudf(df, npartitions=2)
ddf.groupby('a').agg({'b':['mean','std']}).compute()

Note that this change also improves pandas-backed dask.dataframe performance:

import cudf, dask_cudf
import dask.dataframe as dd
import pandas as pd
import numpy as np

size = 100_000_000
df = pd.DataFrame({'a': np.random.randint(10, size=size),'b': np.random.randint(10, size=size)})
ddf = dd.from_pandas(df, npartitions=2)

%timeit ddf.groupby('a').agg({'b':['mean','std']}).compute()

Master: 3.46 s ± 71.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This PR: 1.74 s ± 36.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tests added / passed
Passes black dask / flake8 dask

mrocklin · 2020-05-08T18:53:28Z

Oh great! I'm glad to see that it was possible to find a solution to this that used only more efficient Pandas/cudf API. Seeing the speedup on pandas is great, and knowing that it allows cudf operations to work at all is awesome. This seems to pass tests, so I'm +1, but I figure we wait a day in case someone like @TomAugspurger has thoughts.

TomAugspurger

Just one question. Looks good otherwise.

dask/dataframe/groupby.py

rjzamora · 2020-05-08T21:04:23Z

@shwina - Let me know if you have any thoughts/advice on how I am using cudf/groupby here. [EDIT: More specifically, is there a "public" attribute I should use instead?]

rjzamora · 2020-05-08T21:09:37Z

Oh - It looks like Ashwin pointed out the keys attribute here.

kkraus14 · 2020-05-09T02:08:26Z

dask/dataframe/groupby.py

 def _compute_sum_of_squares(grouped, column):
-    base = grouped[column] if column is not None else grouped
-    return base.apply(lambda x: (x ** 2).sum())
+    # Note: CuDF cannot use `groupby.apply`.


Just a clarification, we can use groupby.apply but we should avoid it for both Pandas and cuDF whenever possible as it goes through a slower iteration based code path.

jcrist · 2020-05-11T13:56:09Z

LGTM, thanks @rjzamora. Merging.

avoid apply in _compute_sum_of_squares

92cf63b

TomAugspurger reviewed May 8, 2020

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

Add explicit note about cudf-specific unpacking

8d52424

use

b48742b

rjzamora mentioned this pull request May 8, 2020

[BUG] std on dask_cudf frame fails using the agg api rapidsai/cudf#4388

Closed

kkraus14 reviewed May 9, 2020

View reviewed changes

jcrist merged commit b0adaa3 into dask:master May 11, 2020

rjzamora deleted the avoid-aply branch May 11, 2020 14:07

rjzamora mentioned this pull request May 11, 2020

cuDF reductions on the same groups #6034

Closed

TomAugspurger mentioned this pull request May 11, 2020

Dask groupby count & standard deviation gives incorrect result after datetime to date conversion #6180

Closed

rjzamora mentioned this pull request Jan 10, 2024

Remove usage of pandas grouper in groupby #10770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid apply in _compute_sum_of_squares for groupby std agg#6186

Avoid apply in _compute_sum_of_squares for groupby std agg#6186
jcrist merged 3 commits intodask:masterfrom
rjzamora:avoid-aply

rjzamora commented May 8, 2020

Uh oh!

mrocklin commented May 8, 2020

Uh oh!

TomAugspurger left a comment

Uh oh!

Uh oh!

rjzamora commented May 8, 2020 •

edited

Loading

Uh oh!

rjzamora commented May 8, 2020

Uh oh!

kkraus14 May 9, 2020

Uh oh!

jcrist commented May 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

rjzamora commented May 8, 2020

Uh oh!

mrocklin commented May 8, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented May 8, 2020

Uh oh!

kkraus14 May 9, 2020

Choose a reason for hiding this comment

Uh oh!

jcrist commented May 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented May 8, 2020 •

edited

Loading