Dask slow in simple benchmark #4344

andaag · 2018-12-31T12:42:39Z

So this is in the category "I'm not sure I should open a bug", as it might be me being new to dask and not understanding why this operation is slow.

See https://gist.github.com/andaag/207fcdc6965b86b7085406221279e4c2

With pandas:

df3['test2'] = df3['salary'].apply(f)

Runtime 22 seconds.

With dask:

%%time
dfn['test'] = dfn['salary'].apply(f, meta=('x', 'f8'))
dfn.compute()

Runtime 36 seconds - ok, it's a fairly small sample, but that seems like huge overhead for the extra threads we're adding... Lets try with a custom parallel function that tears the array apart, runs parallel tasks then puts it back together again to rule that out.

With custom parallel function:

%%time
df3['test'] = parallel(df3['salary'], f, n_jobs=n_jobs)

Runtime 4.7 seconds. What's going on here?

If dask is slow in this case due to the overhead of parallelization my custom function should not be faster.

This experiment is run inside of https://hub.docker.com/r/andaag/aibox_cuda9, which is built from https://github.com/andaag/aibox. (Which is huge due to deep learning libraries and cuda.. sorry)

The text was updated successfully, but these errors were encountered:

andaag · 2018-12-31T12:43:02Z

For completeness I also tried using scheduler processing:

%%time
dfn['test'] = dfn['salary'].apply(f, meta=('x', 'f8'))
dfn.compute(scheduler="processes")

Runtime 2min 12s, what's going on here?!

mrocklin · 2019-01-02T17:45:03Z

Apply won't be any faster with threads because it uses Python for loops.

Not sure what is going on with the processes situation. I would run this on the local distributed scheduler and watch the dashboard to get a sense of what is going on.

GenevieveBuckley added the dataframe label Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask slow in simple benchmark #4344

Dask slow in simple benchmark #4344

andaag commented Dec 31, 2018

andaag commented Dec 31, 2018

mrocklin commented Jan 2, 2019

Dask slow in simple benchmark #4344

Dask slow in simple benchmark #4344

Comments

andaag commented Dec 31, 2018

With pandas:

With dask:

With custom parallel function:

andaag commented Dec 31, 2018

mrocklin commented Jan 2, 2019