Add optimizations to dask.array to remap ufunc(data)[key] to ufunc(data[key]) #746

Open
shoyer opened this Issue Sep 22, 2015 · 11 comments

Projects

None yet

2 participants

@shoyer
Member
shoyer commented Sep 22, 2015

This is a reminder to finish up/reboot the aborted PR #410

These optimizations would be extremely useful for xray's integration with dask (both internally and for our users), by allowing us to write transformations on entire dask arrays without any cost in unnecessary IO. Once this is working robustly, I'll be able to delete a pretty significant chunk of code from xray that deals with lazy computation.

@jcrist
Member
jcrist commented Sep 23, 2015

I'm working on this. For single argument ufuncs, this is fairly simple. I'm uncertain how to handle functions with multiple arguments (or operators), as the slicing can't be pushed to all arguments due to broadcasting:

In [87]: a = np.arange(27).reshape((3, 3, 3))

In [88]: b = np.arange(9).reshape((3, 3))

In [89]: (a + b)[0]
Out[89]:
array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16]])

In [90]: a[0] + b[0]
Out[90]:
array([[ 0,  2,  4],
       [ 3,  5,  7],
       [ 6,  8, 10]])

Suggestions on how to properly handle this would be appreciated.

@shoyer
Member
shoyer commented Sep 23, 2015

I think we should insert an explicit broadcasting step into elemwise using the dask equivalent of np.broadcast_arrays. We can build this up out of broadcast_to, though we'll need our own code to calculate the target shape. Then, we can simply index into each non-scalar argument, because they will all be the same shape.

@shoyer
Member
shoyer commented Sep 23, 2015

It's possibly worth copying NumPy's unit tests from _broadcast_shape in numpy.lib.stride_tricks: https://github.com/numpy/numpy/blob/941a4e037d394dada43e3c2beef1e650d5505742/numpy/lib/tests/test_stride_tricks.py#L267

However, the current implementation only works on NumPy arrays because it uses the NumPy array iterator: https://github.com/numpy/numpy/blob/941a4e037d394dada43e3c2beef1e650d5505742/numpy/lib/stride_tricks.py#L115

There's some code at does this calculation in an old version of stride_tricks, but I suspect you could simply it considerably:
https://github.com/numpy/numpy/blob/edb902cdc6573553afcf11047ecdfb447e444322/numpy/lib/stride_tricks.py#L81-L118

@jcrist
Member
jcrist commented Sep 23, 2015

After experimenting with this for a bit, it seems that explicitly calling broadcast_to in the graph is 20% slower on simplistic cases (summing a 500x500x500 array with a 500x500 one, chunked by 250 in each dimension). I tried a few other cases, and saw similar numbers. This may be insignificant in real cases, I'm not certain. I suspect it will have an impact though, as elementwise computations are pretty common.

Another option would be to special case the getitem instead, doing the broadcasting inside the selection after the rewrite. This wouldn't penalize normal operation, and would allow the rewriting you want. Downside is that it's yet-another-get*-function, and may not play nice with the other optimizations. I like this option better though.

Thoughts?

@shoyer
Member
shoyer commented Sep 23, 2015

20% slower at compute or compile time?

On Wed, Sep 23, 2015 at 12:20 PM, Jim Crist notifications@github.com
wrote:

After experimenting with this for a bit, it seems that explicitly calling
broadcast_to in the graph is 20% slower on simplistic cases (summing a
500x500x500 array with a 500x500 one, chunked by 250 in each dimension). I
tried a few other cases, and saw similar numbers. This may be insignificant
in real cases, I'm not certain. I suspect it will have an impact though, as
elementwise computations are pretty common.

Another option would be to special case the getitem instead, doing the
broadcasting inside the selection after the rewrite. This wouldn't penalize
normal operation, and would allow the rewriting you want. Downside is that
it's yet-another-get*-function, and may not play nice with the other
optimizations. I like this option better though.

Thoughts?


Reply to this email directly or view it on GitHub
#746 (comment).

@jcrist
Member
jcrist commented Sep 23, 2015
In [13]: a = np.arange(500**3).reshape((500, 500, 500))

In [14]: b = np.arange(500**2).reshape((500, 500))

In [15]: t = da.from_array(a, chunks=(250, 250, 250))

In [16]: s = da.from_array(b, chunks=(250, 250))

In [17]: s2 = da.broadcast_to(s, t.shape)

In [18]: o = t + s

In [19]: o2 = t + s2

In [20]: %timeit o.compute()
1 loops, best of 3: 807 ms per loop

In [21]: %timeit o2.compute()
1 loops, best of 3: 1.02 s per loop

Again, toy computation, probably not representative of real world things.

@shoyer
Member
shoyer commented Sep 23, 2015

Hmm. Might be worth a shot trying getitem_from_broadcasted then.

@jcrist
Member
jcrist commented Sep 23, 2015

So I'm running into a problem - the following won't work with non-array objects (hdf5 files, etc...).

def getitem_from_broadcasted(arg, index, outshape):
    return np.broadcast_to(arg, outshape)[index]

For indexing/slicing operations, I can transform the index appropriately with a little extra logic to remove the need to call broadcast_to. I'm unsure if the same can be done for indexing with boolean or integer arrays.

@shoyer
Member
shoyer commented Sep 23, 2015

As long as you aren't doing vectorized fancy indexing stuff, you can treat each axis independently. Then indexing into a synthetically enlarged axis is pretty simple: just grab the approach number of elements (preferably using broadcast_to), since they will all be duplicated.

On Wed, Sep 23, 2015 at 4:19 PM, Jim Crist notifications@github.com
wrote:

So I'm running into a problem - the following won't work with non-array objects (hdf5 files, etc...).

def getitem_from_broadcasted(arg, index, outshape):
    return np.broadcast_to(arg, outshape)[index]

For indexing/slicing operations, I can transform the index appropriately with a little extra logic to remove the need to call broadcast_to. I'm unsure if the same can be done for indexing with boolean or integer arrays.

Reply to this email directly or view it on GitHub:
#746 (comment)

@jcrist
Member
jcrist commented Sep 24, 2015

So would it be fine if this optimization pass only worked on indexes composed of slices, ellipsis, or single integers (don't perform the rewrite if not composed of these)?

@shoyer
Member
shoyer commented Sep 24, 2015

Indexing with 1D arrays is also pretty important. I don't think that's much
harder than doing slices. Slices, ellipsis and single integers would be a
good start, though, and I can see if there's an obvious way to extend that.

On Thu, Sep 24, 2015 at 9:56 AM, Jim Crist notifications@github.com wrote:

So would it be fine if this optimization pass only worked on indexes
composed of slices, ellipsis, or single integers?


Reply to this email directly or view it on GitHub
#746 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment