Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask slow in simple benchmark #4344

Open
andaag opened this issue Dec 31, 2018 · 2 comments
Open

Dask slow in simple benchmark #4344

andaag opened this issue Dec 31, 2018 · 2 comments

Comments

@andaag
Copy link

andaag commented Dec 31, 2018

So this is in the category "I'm not sure I should open a bug", as it might be me being new to dask and not understanding why this operation is slow.

See https://gist.github.com/andaag/207fcdc6965b86b7085406221279e4c2

With pandas:

df3['test2'] = df3['salary'].apply(f)

Runtime 22 seconds.

With dask:

%%time
dfn['test'] = dfn['salary'].apply(f, meta=('x', 'f8'))
dfn.compute()

Runtime 36 seconds - ok, it's a fairly small sample, but that seems like huge overhead for the extra threads we're adding... Lets try with a custom parallel function that tears the array apart, runs parallel tasks then puts it back together again to rule that out.

With custom parallel function:

%%time
df3['test'] = parallel(df3['salary'], f, n_jobs=n_jobs)

Runtime 4.7 seconds. What's going on here?

If dask is slow in this case due to the overhead of parallelization my custom function should not be faster.

This experiment is run inside of https://hub.docker.com/r/andaag/aibox_cuda9, which is built from https://github.com/andaag/aibox. (Which is huge due to deep learning libraries and cuda.. sorry)

@andaag
Copy link
Author

andaag commented Dec 31, 2018

For completeness I also tried using scheduler processing:

%%time
dfn['test'] = dfn['salary'].apply(f, meta=('x', 'f8'))
dfn.compute(scheduler="processes")

Runtime 2min 12s, what's going on here?!

@mrocklin
Copy link
Member

mrocklin commented Jan 2, 2019

Apply won't be any faster with threads because it uses Python for loops.

Not sure what is going on with the processes situation. I would run this on the local distributed scheduler and watch the dashboard to get a sense of what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants