New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use atop fusion in dask dataframe #4229
Conversation
OK, most dataframe tests pass here now |
OK, this is ready for review. There is some future work though:
|
@jcrist @TomAugspurger if either of you have time it would be helpful if you could take a look at this |
Will probably have time next week.
…On Wed, Dec 12, 2018 at 5:13 PM Matthew Rocklin ***@***.***> wrote:
@jcrist <https://github.com/jcrist> @TomAugspurger
<https://github.com/TomAugspurger> if either of you have time it would be
helpful if you could take a look at this
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4229 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIh_ilQHPW7ecby3zsAWyGHgDRl91ks5u4Y2agaJpZM4Yrofg>
.
|
assert len(b) <= 15 | ||
|
||
|
||
@pytest.mark.xfail(reason="need better high level fusion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is future work. We need better fusion on the high level graph level. cc @jcrist ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that'd be fun to work on. Is this blocking for this PR to merge, or just future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking. Things are better with this PR (df + 1 + 2 + 3 + 4
happens in one task) but could be much better still with better fusion. In particular I've seen the use case in this failing test happen frequently. It seems that common use of Pandas includes dozens of lines of modifying columns in place, all of which generate diamond-like graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself could also use review though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good, but I haven't really kept up with the new high-level graph stuff.
Only request would be a docstring for the new broadcast
method. And could you explain the reasoning behind that name? IIUC, adds a task to each partition / block of the input args? I worry a bit about broadcast because it clashes with NumPy's concept (and broadcasting data in distributed), but I haven't come up with a better name.
A fun motivating stack overflow question: https://stackoverflow.com/questions/53844188/how-do-i-use-dask-to-efficiently-calculate-many-simple-statistics/53869772#53869772 |
Renamed to |
Just did a test locally and this branch had a dramatic impact (~2x improvement) in performance for the code below, and even has a lower memory footprint for the larger stat tests.
For calculating 200 statistics(10 unique filtering, 20 different statistics from that filtered subset)
For calculating 2000 statisticsI also tried to increase the scale of stats ( However, on this branch, the calculation with 2000 stats actually completes! Also, the The biggest difference from current |
Thank you for trying things out @bluecoconut and for the benchmark. It's always nice to see things work well on unseen problems :) I look forward to finding out why graph construction was more expensive (I would have expected it to decrease). I suspect that you'll see additional boosts as we get a bit better at fusing things at the high level (see the comments on the xfailed test). |
Merging this tomorrow if there are no further comments. |
The high level atop fusion layers can be used in dask dataframe as well as dask array. Today this makes it cheaper to perform task fusion on dataframes with many partitions. In the future this should make it easier to build more sophisticated high level optimizations.
This currently builds on #4092
This currently breaks tests. There is a fair amount that we can clean up in both dask.dataframe's broadcast operations (map_partitions, elemwise) and in the dask.array.atop code.
flake8 dask