Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing setitem-like operation on dask array #2000

Closed
jakirkham opened this issue Feb 21, 2017 · 21 comments
Closed

Allowing setitem-like operation on dask array #2000

jakirkham opened this issue Feb 21, 2017 · 21 comments
Labels

Comments

@jakirkham
Copy link
Member

Even though stack and concatenate are nice for combining arrays, sometimes they don't fit the data I have or require a significant amount of work to use. For instance, combining blocks of different data. In cases like these, it would be nice to be able to use array assignment. While it is true that dask creates graphs of pure operations (with few exceptions) and assignment is unpure, one could imagine creating an array-like object that translates assignments into slicing and stacking/concatenating. This would allow a user to make use of a __setitem__-like syntax, but result in creating a new dask array (or potentially modifying the graph of the existing one) so the net result behaves like assignment while remaining pure.

@mrocklin
Copy link
Member

We started adding inplace operations in other cases, notably dask.dataframe.

I think it's sensible to think about this. There are two problems:

  1. Dask.array won't have the same view semantics as numpy (see Support series row mutation #1915)
  2. Dask.array slicing can be quite complex to implement, so this would require a lot of work by someone.

@jakirkham
Copy link
Member Author

jakirkham commented Feb 21, 2017

Interesting, was about to ask if this is already supported and I merely missed the docs. ( #1840 )

That said, maybe you can unpack what you mean by slicing being complex in 2. Are there particular cases that you know of that would be tricky?

Edit: Seems this is restricted to bool arrays.

@mrocklin
Copy link
Member

x[-5:15::-3] = y

@mrocklin
Copy link
Member

You should take a look at dask/array/slicing.py

@mrocklin
Copy link
Member

But there is also the question of how to handle the following:

y = x[0, :]
y[:] = 0
print(x[0, :].compute())

@mrocklin
Copy link
Member

As written, the current inplace approach used in dask.dataframe differ from how numpy currently works. This could cause confusing errors.

Anyway, I don't have a block to inplace operations as long as they are sensible to users, and don't significantly complicate other parts of the code. I encourage exploration here.

@jakirkham
Copy link
Member Author

Thanks for the examples and feedback.

I can see where concatenate/stack may start to go downhill with non-unitary step sizes performance-wise. Using where would definitely be a better approach. Though checking what fits in what region could become complex as well.

As for views, it is not totally clear how to fix this yet, but I can imagine a couple ways one might approach this problem. Not entirely sure how practical they are in Dask currently, but they may be worth some thought.

One option might be possible for parents to register with children for __setitem__ updates. This would allow one to propagate these changes upward as far as they need to go. It also opens the ability for others to listen to this event chain should it be important for some reason.

Another option might be to replace the selection gotten with __getitem__ any time it is called ( e.g. x[0, :] ) in the graph with a variable that points to a subgraph, which y would get. This way any changes performed on y are also available for the correct region in x.

@shoyer
Copy link
Member

shoyer commented Feb 24, 2017

I think maintaining view semantics with dask arrays would be nearly impossible, so I wouldn't even bother. It's safer to treat every operation as producing an entirely new array.

For sanity, we should probably say that chunks for a dask array are immutable, but operations that involve modifying a piece of a chunk instead of a replacing a whole chunk are still tricky and probably should be skipped for now. For example, consider assigning x[i, j] = y in a loop for integer i and j with original chunks consisting of large tiles. We would need to create an temporary buffer array for the chunks if we aren't replacing them entirely, which dask array would need to be aware of in some way to avoid copying chunks entirely every time they are assigned to.

@jakirkham
Copy link
Member Author

Since this was raised there have been a lot of changes. It's probably time to do an overview of what is and isn't possible on this front. From that we can evaluate if there is anything still worth doing or if this is effectively resolved.

@dionhaefner
Copy link

Strong +1. I would love to implement a dask.array-based backend for Veros, a high-performance ocean simulator in Python. We already support NumPy and Bohrium as computational backends, which both use the NumPy API. To handle ghost cells and boundaries, we do a lot of operations like

some_array[1:-2, 2:-2, :] = other_array

I think Dask could really be a key ingredient to bring our model to distributed architectures, but I cannot re-write my code to be compliant with Dask without breaking compatibility with the other backends. A setitem implementation, even if it is not truly in-place, would be tremendously helpful to evaluate whether we can use Dask for distributed simulations.

I would be happy to contribute code if necessary, but I don't know much about the internal workings of dask.array, so I'd need some guidance on that.

@mrocklin
Copy link
Member

Contributions in this direction would certainly be welcome. You might want to read the following:

The slicing code is a bit ugly today, which is why I list it last. It arose organically as we added support for more and more slicing index types (ints, slices, numpy arrays, dask arrays, ....) Refactoring there would also be welcome.

@jakirkham
Copy link
Member Author

To clarify, are the operations that you are doing basically numpy.pad, @dionhaefner? At least that is what the example code snippet makes me think. Might be interested in issues ( #1926 ) and ( #2415 ) if that is the case.

@dionhaefner
Copy link

To clarify, are the operations that you are doing basically numpy.pad, @dionhaefner?

Often, yes (but not always).

However, even if there was a dask.array.pad, I would probably not want to use it. I am trying to use the exact same code base for several different NumPy-compliant computational backends, and the current formulation is deliberately mutating the array objects for performance reasons. Maybe this is an unusual use case, but I value being able to use the same code with different libraries and have it perform somewhat well. Also, I don't think effectively translating

array[1:-2, 2:-2, 0] = other_array

into

array = np.dstack((
    np.pad(other_array, ((1, 2), (2, 2)), "constant"), array[:, :, 1:]
))

does the code any favors.

@jakirkham
Copy link
Member Author

That's understandable. Ultimately this is a larger problem that you are stumbling upon. Namely NumPy is a great library with a nice interface. Many other Python libraries that work with N-D Array data in different domains mimic the NumPy interface. However getting these different NumPy-like libraries to play nice with each other is a difficult problem. In fact it's a problem all of us interested in Dask or other NumPy-like libraries have struggled with. Recently Matt wrote a nice blog outlining this problem and discussing possible solutions. A NEP is being worked on/discussed to figure out how best to handle dispatching from NumPy to NumPy-like libraries.

ref: http://matthewrocklin.com/blog/work/2018/05/27/beyond-numpy
xref: numpy/numpy#11189
ref: https://mail.python.org/pipermail/numpy-discussion/2018-June/078127.html

@safijari
Copy link

Has there been any movement on this issue since? Having mutable dask arrays would be a huge boon to one of my existing codebases

@mrocklin
Copy link
Member

No. There has been no activity here. This is non-trivial to do.

@shoyer
Copy link
Member

shoyer commented Jul 2, 2020

There are at least two parts to this issue:

  1. Modifying dask arrays in-place
  2. "Scatter" type operations that perform the NumPy equivalent of z = x.copy(); z[i] = y; return z

Part (1) is arguably the most problematic for dask, because array properties like chunks are expected to be immutable.

Part (2) is the functionality we really need, regardless of how it's spelled. JAX uses the notation z = x.at[i].set(y).

I believe it could be significantly easier to implement (2) in dask without the baggage of mutable __setitem__ syntax, e.g., so we can feel free to change chunk sizes as appropriate.

It is of course always possible to translate __setitem__ into __getitem__ in user code, but there are a number of cases where this syntax is much more natural. Notable examples include:

@mrocklin
Copy link
Member

mrocklin commented Jul 2, 2020

Part (1) is arguably the most problematic for dask, because array properties like chunks are expected to be immutable.

We support some mutation operations today. For example

x[x < 0] = 0

So presumably (1) is in-scope, but what stops us here is the complexity around the full generality of setitem syntax. Getitem was, as you recall, finicky to get right (at least right enough for people to be happy).

Overriding boundaries of arrays is probably an incremental improvement on what we have today.

Something like x[i] += y seems trickier because we may not know the values of i. If we do know the values of i (perhaps it is a numpy array) then this is easier, and not necessarily terribly difficult. I wouldn't expect this to modify chunking.

@jakirkham
Copy link
Member Author

jakirkham commented Apr 12, 2021

This is largely supported after PR ( #7033 #7393 ) recently and previous work. While there are always additional things we could do here, it might be easier to track those as new issues. Going to go ahead and close this out. Thanks everyone! 😄

@davidhassell
Copy link
Contributor

Thanks, @jakirkham.

This is largely supported after PR ( 7033 )

Just for the record, in case it's useful to those who come across this issue in the future, the PR that finally supported this turned out to be #7393, after flaws in 7033 were uncovered.

@jakirkham
Copy link
Member Author

Ah right thanks for the correction David 🙂 Updated my post above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants