Skip to content

Test that numpy literals in graph don't get inlined#2080

Merged
mrocklin merged 3 commits intodask:masterfrom
mrocklin:dont-fuse-numpy-arrays
Mar 15, 2017
Merged

Test that numpy literals in graph don't get inlined#2080
mrocklin merged 3 commits intodask:masterfrom
mrocklin:dont-fuse-numpy-arrays

Conversation

@mrocklin
Copy link
Copy Markdown
Member

OK, so it's important that numpy arrays stay where they are in the graph and don't get fused. This is important for two reasons:

  1. We can serialize values better if they are naked in the graph rather than enclosed within tasks
  2. If a value becomes inlined into a fast-function task (like getitem) then it can be copied to many other tasks during fast_function_inlining. This can cause blowup of serialization costs.

cc @eriknw any thoughts on how to address this cheaply?

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Mar 14, 2017

Include the tasks you don't want fused in the keys= keyword argument to fuse. All or most optimization functions accept keys=.

@mrocklin
Copy link
Copy Markdown
Member Author

mrocklin commented Mar 14, 2017 via email

@mrocklin
Copy link
Copy Markdown
Member Author

OK, expanding on this a bit I actually also don't want to fuse the last task of a getitem chain. When thinking about distributed computing we want to quickly split up large numpy arrays with getitem calls, and then move them to other workers as necessary before fusing with any other computation. We want to avoid moving large intermediates.

So in regards to fusion this is a bit odd. We want to fuse long getitem chains (this is very important for single-machine xarray workloads). But we don't want to fuse a getitem chain to anything else that comes afterwards (like a +100 operation), this helps keep data movement easier in a distributed system.

My current approach uses the suggestion from @eriknw to specify a set of keys to specifically avoid fusing. We start at the bottom of the graph from all data nodes (anything that isn't a task) and move up while there are getitem chains. We claim the topmost element of that chain as a key that we don't want to lose/fuse away. This costs two linear scans, one to find data and then one call to reverse_dict (possibly expensive).

@mrocklin mrocklin force-pushed the dont-fuse-numpy-arrays branch from fe9c701 to f1537d8 Compare March 15, 2017 12:07
@mrocklin mrocklin force-pushed the dont-fuse-numpy-arrays branch from f1537d8 to 915c6bd Compare March 15, 2017 12:39
@mrocklin
Copy link
Copy Markdown
Member Author

I would like to merge this soon. @shoyer I've run this against the xarray test suite and didn't run into any problems. Still you might want to look at this if you get a chance. This stops Dask from fusing getitem calls into other, non-getitem calls.

@mrocklin
Copy link
Copy Markdown
Member Author

OK, I'm going ahead with this.

@mrocklin mrocklin merged commit 7b7c765 into dask:master Mar 15, 2017
@mrocklin mrocklin deleted the dont-fuse-numpy-arrays branch March 15, 2017 21:03
@sinhrks sinhrks added this to the 0.14.1 milestone Mar 30, 2017
jcrist added a commit to jcrist/dask that referenced this pull request May 19, 2017
In dask#2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.
jcrist added a commit that referenced this pull request May 30, 2017
* Fuse slices works with alias in graph

In #2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.

* Fuse slices works with locks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants