New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimizations to dask.array to remap ufunc(data)[key] to ufunc(data[key]) #746

Open
shoyer opened this Issue Sep 22, 2015 · 20 comments

Comments

Projects
None yet
4 participants
@shoyer
Copy link
Member

shoyer commented Sep 22, 2015

This is a reminder to finish up/reboot the aborted PR #410

These optimizations would be extremely useful for xray's integration with dask (both internally and for our users), by allowing us to write transformations on entire dask arrays without any cost in unnecessary IO. Once this is working robustly, I'll be able to delete a pretty significant chunk of code from xray that deals with lazy computation.

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Sep 23, 2015

I'm working on this. For single argument ufuncs, this is fairly simple. I'm uncertain how to handle functions with multiple arguments (or operators), as the slicing can't be pushed to all arguments due to broadcasting:

In [87]: a = np.arange(27).reshape((3, 3, 3))

In [88]: b = np.arange(9).reshape((3, 3))

In [89]: (a + b)[0]
Out[89]:
array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16]])

In [90]: a[0] + b[0]
Out[90]:
array([[ 0,  2,  4],
       [ 3,  5,  7],
       [ 6,  8, 10]])

Suggestions on how to properly handle this would be appreciated.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 23, 2015

I think we should insert an explicit broadcasting step into elemwise using the dask equivalent of np.broadcast_arrays. We can build this up out of broadcast_to, though we'll need our own code to calculate the target shape. Then, we can simply index into each non-scalar argument, because they will all be the same shape.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 23, 2015

It's possibly worth copying NumPy's unit tests from _broadcast_shape in numpy.lib.stride_tricks: https://github.com/numpy/numpy/blob/941a4e037d394dada43e3c2beef1e650d5505742/numpy/lib/tests/test_stride_tricks.py#L267

However, the current implementation only works on NumPy arrays because it uses the NumPy array iterator: https://github.com/numpy/numpy/blob/941a4e037d394dada43e3c2beef1e650d5505742/numpy/lib/stride_tricks.py#L115

There's some code at does this calculation in an old version of stride_tricks, but I suspect you could simply it considerably:
https://github.com/numpy/numpy/blob/edb902cdc6573553afcf11047ecdfb447e444322/numpy/lib/stride_tricks.py#L81-L118

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Sep 23, 2015

After experimenting with this for a bit, it seems that explicitly calling broadcast_to in the graph is 20% slower on simplistic cases (summing a 500x500x500 array with a 500x500 one, chunked by 250 in each dimension). I tried a few other cases, and saw similar numbers. This may be insignificant in real cases, I'm not certain. I suspect it will have an impact though, as elementwise computations are pretty common.

Another option would be to special case the getitem instead, doing the broadcasting inside the selection after the rewrite. This wouldn't penalize normal operation, and would allow the rewriting you want. Downside is that it's yet-another-get*-function, and may not play nice with the other optimizations. I like this option better though.

Thoughts?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 23, 2015

20% slower at compute or compile time?

On Wed, Sep 23, 2015 at 12:20 PM, Jim Crist notifications@github.com
wrote:

After experimenting with this for a bit, it seems that explicitly calling
broadcast_to in the graph is 20% slower on simplistic cases (summing a
500x500x500 array with a 500x500 one, chunked by 250 in each dimension). I
tried a few other cases, and saw similar numbers. This may be insignificant
in real cases, I'm not certain. I suspect it will have an impact though, as
elementwise computations are pretty common.

Another option would be to special case the getitem instead, doing the
broadcasting inside the selection after the rewrite. This wouldn't penalize
normal operation, and would allow the rewriting you want. Downside is that
it's yet-another-get*-function, and may not play nice with the other
optimizations. I like this option better though.

Thoughts?


Reply to this email directly or view it on GitHub
#746 (comment).

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Sep 23, 2015

In [13]: a = np.arange(500**3).reshape((500, 500, 500))

In [14]: b = np.arange(500**2).reshape((500, 500))

In [15]: t = da.from_array(a, chunks=(250, 250, 250))

In [16]: s = da.from_array(b, chunks=(250, 250))

In [17]: s2 = da.broadcast_to(s, t.shape)

In [18]: o = t + s

In [19]: o2 = t + s2

In [20]: %timeit o.compute()
1 loops, best of 3: 807 ms per loop

In [21]: %timeit o2.compute()
1 loops, best of 3: 1.02 s per loop

Again, toy computation, probably not representative of real world things.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 23, 2015

Hmm. Might be worth a shot trying getitem_from_broadcasted then.

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Sep 23, 2015

So I'm running into a problem - the following won't work with non-array objects (hdf5 files, etc...).

def getitem_from_broadcasted(arg, index, outshape):
    return np.broadcast_to(arg, outshape)[index]

For indexing/slicing operations, I can transform the index appropriately with a little extra logic to remove the need to call broadcast_to. I'm unsure if the same can be done for indexing with boolean or integer arrays.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 23, 2015

As long as you aren't doing vectorized fancy indexing stuff, you can treat each axis independently. Then indexing into a synthetically enlarged axis is pretty simple: just grab the approach number of elements (preferably using broadcast_to), since they will all be duplicated.

On Wed, Sep 23, 2015 at 4:19 PM, Jim Crist notifications@github.com
wrote:

So I'm running into a problem - the following won't work with non-array objects (hdf5 files, etc...).

def getitem_from_broadcasted(arg, index, outshape):
    return np.broadcast_to(arg, outshape)[index]

For indexing/slicing operations, I can transform the index appropriately with a little extra logic to remove the need to call broadcast_to. I'm unsure if the same can be done for indexing with boolean or integer arrays.

Reply to this email directly or view it on GitHub:
#746 (comment)

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Sep 24, 2015

So would it be fine if this optimization pass only worked on indexes composed of slices, ellipsis, or single integers (don't perform the rewrite if not composed of these)?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 24, 2015

Indexing with 1D arrays is also pretty important. I don't think that's much
harder than doing slices. Slices, ellipsis and single integers would be a
good start, though, and I can see if there's an obvious way to extend that.

On Thu, Sep 24, 2015 at 9:56 AM, Jim Crist notifications@github.com wrote:

So would it be fine if this optimization pass only worked on indexes
composed of slices, ellipsis, or single integers?


Reply to this email directly or view it on GitHub
#746 (comment).

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Dec 4, 2017

Have there been any more recent PRs on this topic since the one mentioned in the OP? What are the current issues still facing this issue?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 4, 2017

Even the single argument version of this would be enough to solve xarray's lazy decoding issue, I think. (We would technically need binary arithmetic between an array and a scalar, but that can be mapped into the single argument version.)

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Jul 20, 2018

Do you have any advice, @jcrist, regarding the single argument version? Maybe a PR or write-up that might be a useful starting point?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jul 21, 2018

Over in pydata/xarray#2298, @mrocklin mentioned #2538 as another way to solve this problem, which looks like a great potential solution for this problem. It seems like some of the geoscience/astronomy use-cases could really use the ability to perform this optimization on arrays that are functions of multiple arguments.

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Oct 30, 2018

Given the recent optimizations that went in, is this one still outstanding or has it been addressed?

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Oct 30, 2018

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Nov 5, 2018

As an interesting related case, would submit key = mask.

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Nov 16, 2018

FWIW a colleague voiced interest in solving map_blocks cases with selection using a similar optimization. Expect the use case is not too different from the ones that motivated opening this issue in the first place.

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Nov 19, 2018

Should add map_overlap would be another case where this would be nice to solve. However given the operation is overlapping it's not really a ufunc and is probably different enough to deserve its own issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment