New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reuse of temporaries with numpy #5933
Conversation
numpy has an optimisation that allows operations to be done in-place if an operand is a temporary (has no more references) e.g. in the expression `a + b + c`, `c` will be added in-place to the result of `a + b`. Tweak the signle-threaded scheduler to avoid creating unnecessary references. To make this effective, also modify Blockwise to run `fuse` on the subgraph before creating SubgraphCallable. This reduces the total runtime of the following code from 7.3s to 5.7s on my machine: ```python from pprint import pprint import dask.array as da from dask.blockwise import optimize_blockwise from dask.base import visualize from dask.array.optimization import optimize a = da.ones(2000000000, chunks=10000000) b = a + a + a c = da.sum(b) c.compute() ```
I have a few questions:
|
@jcrist you may be interested in this one |
@jcrist any thoughts? |
It'd would also be nice to have an asv for this in https://github.com/dask/dask-benchmarks/pulls. I could easily see it being undone in a refactor adding another reference to a value.
That sounds right to me, too. Can you share the output from the failed test? It might make sense to adjust it. |
Overall this seems sane to me. I agree with Tom that a benchmark here would be good.
The optimization should be done prior to passing to |
If you want to look into the errors, here's a patch against this branch for my alternative implementation: diff --git a/dask/blockwise.py b/dask/blockwise.py
index 82af9ae9..c574e96c 100644
--- a/dask/blockwise.py
+++ b/dask/blockwise.py
@@ -193,8 +193,7 @@ class Blockwise(Mapping):
return self._cached_dict
else:
keys = tuple(map(blockwise_token, range(len(self.indices))))
- dsk, _ = fuse(self.dsk, [self.output])
- func = SubgraphCallable(dsk, self.output, keys)
+ func = SubgraphCallable(self.dsk, self.output, keys)
self._cached_dict = make_blockwise_graph(
func,
self.output,
@@ -683,6 +682,7 @@ def rewrite_blockwise(inputs):
sub = {blockwise_token(k): blockwise_token(v) for k, v in sub.items()}
dsk = {k: subs(v, sub) for k, v in dsk.items()}
+ dsk, _ = fuse(dsk, [root])
indices_check = {k for k, v in indices if v is not None}
numblocks = toolz.merge([inp.numblocks for inp in inputs.values()]) Here's the first failure - there are several though. It should be possible to adjust the expected values to account for the optimisation, but I worry that it will make the test more fragile because the expected value will now depend on implementation details, both of what optimizations we choose to run and the implementation of those optimizations. I guess we can handle the latter by parametrizing with the unoptimized graph and optimizing it inside the test.
|
Added a benchmark for this in TomAugspurger/dask-benchmarks@9daf334. Thanks @bmerry! |
numpy has an optimisation that allows operations to be done in-place if
an operand is a temporary (has no more references) e.g. in the
expression
a + b + c
,c
will be added in-place to the result ofa + b
.Tweak the signle-threaded scheduler to avoid creating unnecessary
references. To make this effective, also modify Blockwise to run
fuse
on the subgraph before creating SubgraphCallable.This reduces the total runtime of the following code from 7.3s to 5.7s
on my machine:
black dask
/flake8 dask