Allowing setitem-like operation on dask array #2000

jakirkham · 2017-02-21T18:44:21Z

Even though stack and concatenate are nice for combining arrays, sometimes they don't fit the data I have or require a significant amount of work to use. For instance, combining blocks of different data. In cases like these, it would be nice to be able to use array assignment. While it is true that dask creates graphs of pure operations (with few exceptions) and assignment is unpure, one could imagine creating an array-like object that translates assignments into slicing and stacking/concatenating. This would allow a user to make use of a __setitem__-like syntax, but result in creating a new dask array (or potentially modifying the graph of the existing one) so the net result behaves like assignment while remaining pure.

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-02-21T18:48:30Z

We started adding inplace operations in other cases, notably dask.dataframe.

I think it's sensible to think about this. There are two problems:

Dask.array won't have the same view semantics as numpy (see Support series row mutation #1915)
Dask.array slicing can be quite complex to implement, so this would require a lot of work by someone.

jakirkham · 2017-02-21T18:51:46Z

Interesting, was about to ask if this is already supported and I merely missed the docs. ( #1840 )

That said, maybe you can unpack what you mean by slicing being complex in 2. Are there particular cases that you know of that would be tricky?

Edit: Seems this is restricted to bool arrays.

mrocklin · 2017-02-21T18:52:38Z

x[-5:15::-3] = y

mrocklin · 2017-02-21T18:52:55Z

You should take a look at dask/array/slicing.py

mrocklin · 2017-02-21T18:54:10Z

But there is also the question of how to handle the following:

y = x[0, :]
y[:] = 0
print(x[0, :].compute())

mrocklin · 2017-02-21T18:55:10Z

As written, the current inplace approach used in dask.dataframe differ from how numpy currently works. This could cause confusing errors.

Anyway, I don't have a block to inplace operations as long as they are sensible to users, and don't significantly complicate other parts of the code. I encourage exploration here.

jakirkham · 2017-02-21T19:17:00Z

Thanks for the examples and feedback.

I can see where concatenate/stack may start to go downhill with non-unitary step sizes performance-wise. Using where would definitely be a better approach. Though checking what fits in what region could become complex as well.

As for views, it is not totally clear how to fix this yet, but I can imagine a couple ways one might approach this problem. Not entirely sure how practical they are in Dask currently, but they may be worth some thought.

One option might be possible for parents to register with children for __setitem__ updates. This would allow one to propagate these changes upward as far as they need to go. It also opens the ability for others to listen to this event chain should it be important for some reason.

Another option might be to replace the selection gotten with __getitem__ any time it is called ( e.g. x[0, :] ) in the graph with a variable that points to a subgraph, which y would get. This way any changes performed on y are also available for the correct region in x.

shoyer · 2017-02-24T17:20:46Z

I think maintaining view semantics with dask arrays would be nearly impossible, so I wouldn't even bother. It's safer to treat every operation as producing an entirely new array.

For sanity, we should probably say that chunks for a dask array are immutable, but operations that involve modifying a piece of a chunk instead of a replacing a whole chunk are still tricky and probably should be skipped for now. For example, consider assigning x[i, j] = y in a loop for integer i and j with original chunks consisting of large tiles. We would need to create an temporary buffer array for the chunks if we aren't replacing them entirely, which dask array would need to be aware of in some way to avoid copying chunks entirely every time they are assigned to.

jakirkham · 2017-09-27T19:08:05Z

Since this was raised there have been a lot of changes. It's probably time to do an overview of what is and isn't possible on this front. From that we can evaluate if there is anything still worth doing or if this is effectively resolved.

dionhaefner · 2018-01-21T19:04:54Z

Strong +1. I would love to implement a dask.array-based backend for Veros, a high-performance ocean simulator in Python. We already support NumPy and Bohrium as computational backends, which both use the NumPy API. To handle ghost cells and boundaries, we do a lot of operations like

some_array[1:-2, 2:-2, :] = other_array

I think Dask could really be a key ingredient to bring our model to distributed architectures, but I cannot re-write my code to be compliant with Dask without breaking compatibility with the other backends. A setitem implementation, even if it is not truly in-place, would be tremendously helpful to evaluate whether we can use Dask for distributed simulations.

I would be happy to contribute code if necessary, but I don't know much about the internal workings of dask.array, so I'd need some guidance on that.

mrocklin · 2018-01-21T19:22:18Z

Contributions in this direction would certainly be welcome. You might want to read the following:

The slicing code is a bit ugly today, which is why I list it last. It arose organically as we added support for more and more slicing index types (ints, slices, numpy arrays, dask arrays, ....) Refactoring there would also be welcome.

jakirkham · 2018-01-22T06:50:23Z

To clarify, are the operations that you are doing basically numpy.pad, @dionhaefner? At least that is what the example code snippet makes me think. Might be interested in issues ( #1926 ) and ( #2415 ) if that is the case.

dionhaefner · 2018-01-22T12:20:36Z

To clarify, are the operations that you are doing basically numpy.pad, @dionhaefner?

Often, yes (but not always).

However, even if there was a dask.array.pad, I would probably not want to use it. I am trying to use the exact same code base for several different NumPy-compliant computational backends, and the current formulation is deliberately mutating the array objects for performance reasons. Maybe this is an unusual use case, but I value being able to use the same code with different libraries and have it perform somewhat well. Also, I don't think effectively translating

array[1:-2, 2:-2, 0] = other_array

into

array = np.dstack((
    np.pad(other_array, ((1, 2), (2, 2)), "constant"), array[:, :, 1:]
))

does the code any favors.

jakirkham · 2018-06-18T06:08:26Z

That's understandable. Ultimately this is a larger problem that you are stumbling upon. Namely NumPy is a great library with a nice interface. Many other Python libraries that work with N-D Array data in different domains mimic the NumPy interface. However getting these different NumPy-like libraries to play nice with each other is a difficult problem. In fact it's a problem all of us interested in Dask or other NumPy-like libraries have struggled with. Recently Matt wrote a nice blog outlining this problem and discussing possible solutions. A NEP is being worked on/discussed to figure out how best to handle dispatching from NumPy to NumPy-like libraries.

ref: http://matthewrocklin.com/blog/work/2018/05/27/beyond-numpy
xref: numpy/numpy#11189
ref: https://mail.python.org/pipermail/numpy-discussion/2018-June/078127.html

safijari · 2019-09-13T23:40:02Z

Has there been any movement on this issue since? Having mutable dask arrays would be a huge boon to one of my existing codebases

mrocklin · 2019-09-14T00:01:24Z

No. There has been no activity here. This is non-trivial to do.

shoyer · 2020-07-02T06:08:20Z

There are at least two parts to this issue:

Modifying dask arrays in-place
"Scatter" type operations that perform the NumPy equivalent of z = x.copy(); z[i] = y; return z

Part (1) is arguably the most problematic for dask, because array properties like chunks are expected to be immutable.

Part (2) is the functionality we really need, regardless of how it's spelled. JAX uses the notation z = x.at[i].set(y).

I believe it could be significantly easier to implement (2) in dask without the baggage of mutable __setitem__ syntax, e.g., so we can feel free to change chunk sizes as appropriate.

It is of course always possible to translate __setitem__ into __getitem__ in user code, but there are a number of cases where this syntax is much more natural. Notable examples include:

"Unstacking" a pandas.MultiIndex into multiple dimensions, like what xarray does for NumPy arrays from pandas in Improve the speed of from_dataframe with a MultiIndex (by 40x!) pydata/xarray#4184
Implementing the gradient of indexing in reverse mode autodiff (OK, technically this needs the equivalent of x[i] += y handling repeated indices i, which is implemented in NumPy as np.add.at).
Overriding boundaries of arrays, as noted by @dionhaefner above in Allowing setitem-like operation on dask array #2000 (comment)

mrocklin · 2020-07-02T13:40:32Z

Part (1) is arguably the most problematic for dask, because array properties like chunks are expected to be immutable.

We support some mutation operations today. For example

x[x < 0] = 0

So presumably (1) is in-scope, but what stops us here is the complexity around the full generality of setitem syntax. Getitem was, as you recall, finicky to get right (at least right enough for people to be happy).

Overriding boundaries of arrays is probably an incremental improvement on what we have today.

Something like x[i] += y seems trickier because we may not know the values of i. If we do know the values of i (perhaps it is a numpy array) then this is easier, and not necessarily terribly difficult. I wouldn't expect this to modify chunking.

jakirkham · 2021-04-12T23:06:07Z

This is largely supported after PR ( ~~#7033~~ #7393 ) recently and previous work. While there are always additional things we could do here, it might be easier to track those as new issues. Going to go ahead and close this out. Thanks everyone! 😄

davidhassell · 2021-04-13T07:09:58Z

Thanks, @jakirkham.

This is largely supported after PR ( 7033 )

Just for the record, in case it's useful to those who come across this issue in the future, the PR that finally supported this turned out to be #7393, after flaws in 7033 were uncovered.

jakirkham · 2021-04-13T07:47:27Z

Ah right thanks for the correction David 🙂 Updated my post above

jakirkham mentioned this issue Oct 31, 2017

MAINT: Simplify block implementation numpy/numpy#9667

Merged

jakirkham added the array label Nov 30, 2017

shoyer mentioned this issue Feb 12, 2018

WIP: Zarr backend pydata/xarray#1528

Merged

4 tasks

sofroniewn mentioned this issue Apr 21, 2019

allow for other arrays used with labels napari/napari#228

Merged

6 tasks

ivirshup mentioned this issue Aug 1, 2019

Inplace operators behaviour for dask arrays inconsistent with numpy #5199

Open

uellue mentioned this issue Dec 19, 2019

Run LiberTEM UDFs on Dask arrays LiberTEM/LiberTEM#384

Closed

kgryte mentioned this issue Nov 9, 2020

Add array object specification data-apis/array-api#53

Merged

shoyer mentioned this issue Jan 1, 2021

Faster unstacking pydata/xarray#4746

Merged

5 tasks

davidhassell mentioned this issue Jan 5, 2021

Proposal for a more fully featured __setitem__ #7029

Closed

uellue mentioned this issue Mar 1, 2021

Clean solution for interaction with other code that uses Dask LiberTEM/LiberTEM#922

Closed

9 tasks

jakirkham closed this as completed Apr 12, 2021

ashwinvis mentioned this issue Apr 14, 2021

Comparing two Dask arrays for equality #4399

Closed

rgommers mentioned this issue Mar 9, 2023

RFC: static vs. dynamic shapes and JAX's .at for simulating in-place ops data-apis/array-api#609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing setitem-like operation on dask array #2000

Allowing setitem-like operation on dask array #2000

jakirkham commented Feb 21, 2017

mrocklin commented Feb 21, 2017

jakirkham commented Feb 21, 2017 •

edited

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

jakirkham commented Feb 21, 2017

shoyer commented Feb 24, 2017

jakirkham commented Sep 27, 2017

dionhaefner commented Jan 21, 2018

mrocklin commented Jan 21, 2018

jakirkham commented Jan 22, 2018

dionhaefner commented Jan 22, 2018

jakirkham commented Jun 18, 2018

safijari commented Sep 13, 2019

mrocklin commented Sep 14, 2019

shoyer commented Jul 2, 2020

mrocklin commented Jul 2, 2020

jakirkham commented Apr 12, 2021 •

edited

davidhassell commented Apr 13, 2021

jakirkham commented Apr 13, 2021

Allowing setitem-like operation on dask array #2000

Allowing setitem-like operation on dask array #2000

Comments

jakirkham commented Feb 21, 2017

mrocklin commented Feb 21, 2017

jakirkham commented Feb 21, 2017 • edited

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

mrocklin commented Feb 21, 2017

jakirkham commented Feb 21, 2017

shoyer commented Feb 24, 2017

jakirkham commented Sep 27, 2017

dionhaefner commented Jan 21, 2018

mrocklin commented Jan 21, 2018

jakirkham commented Jan 22, 2018

dionhaefner commented Jan 22, 2018

jakirkham commented Jun 18, 2018

safijari commented Sep 13, 2019

mrocklin commented Sep 14, 2019

shoyer commented Jul 2, 2020

mrocklin commented Jul 2, 2020

jakirkham commented Apr 12, 2021 • edited

davidhassell commented Apr 13, 2021

jakirkham commented Apr 13, 2021

jakirkham commented Feb 21, 2017 •

edited

jakirkham commented Apr 12, 2021 •

edited