New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing setitem-like operation on dask array #2000
Comments
We started adding inplace operations in other cases, notably dask.dataframe. I think it's sensible to think about this. There are two problems:
|
Interesting, was about to ask if this is already supported and I merely missed the docs. ( #1840 ) That said, maybe you can unpack what you mean by slicing being complex in 2. Are there particular cases that you know of that would be tricky? Edit: Seems this is restricted to |
|
You should take a look at dask/array/slicing.py |
But there is also the question of how to handle the following: y = x[0, :]
y[:] = 0
print(x[0, :].compute()) |
As written, the current inplace approach used in dask.dataframe differ from how numpy currently works. This could cause confusing errors. Anyway, I don't have a block to inplace operations as long as they are sensible to users, and don't significantly complicate other parts of the code. I encourage exploration here. |
Thanks for the examples and feedback. I can see where As for views, it is not totally clear how to fix this yet, but I can imagine a couple ways one might approach this problem. Not entirely sure how practical they are in Dask currently, but they may be worth some thought. One option might be possible for parents to register with children for Another option might be to replace the selection gotten with |
I think maintaining view semantics with dask arrays would be nearly impossible, so I wouldn't even bother. It's safer to treat every operation as producing an entirely new array. For sanity, we should probably say that |
Since this was raised there have been a lot of changes. It's probably time to do an overview of what is and isn't possible on this front. From that we can evaluate if there is anything still worth doing or if this is effectively resolved. |
Strong +1. I would love to implement a some_array[1:-2, 2:-2, :] = other_array I think Dask could really be a key ingredient to bring our model to distributed architectures, but I cannot re-write my code to be compliant with Dask without breaking compatibility with the other backends. A I would be happy to contribute code if necessary, but I don't know much about the internal workings of |
Contributions in this direction would certainly be welcome. You might want to read the following:
The slicing code is a bit ugly today, which is why I list it last. It arose organically as we added support for more and more slicing index types (ints, slices, numpy arrays, dask arrays, ....) Refactoring there would also be welcome. |
To clarify, are the operations that you are doing basically |
Often, yes (but not always). However, even if there was a array[1:-2, 2:-2, 0] = other_array into array = np.dstack((
np.pad(other_array, ((1, 2), (2, 2)), "constant"), array[:, :, 1:]
)) does the code any favors. |
That's understandable. Ultimately this is a larger problem that you are stumbling upon. Namely NumPy is a great library with a nice interface. Many other Python libraries that work with N-D Array data in different domains mimic the NumPy interface. However getting these different NumPy-like libraries to play nice with each other is a difficult problem. In fact it's a problem all of us interested in Dask or other NumPy-like libraries have struggled with. Recently Matt wrote a nice blog outlining this problem and discussing possible solutions. A NEP is being worked on/discussed to figure out how best to handle dispatching from NumPy to NumPy-like libraries. ref: http://matthewrocklin.com/blog/work/2018/05/27/beyond-numpy |
Has there been any movement on this issue since? Having mutable dask arrays would be a huge boon to one of my existing codebases |
No. There has been no activity here. This is non-trivial to do. |
There are at least two parts to this issue:
Part (1) is arguably the most problematic for dask, because array properties like Part (2) is the functionality we really need, regardless of how it's spelled. JAX uses the notation I believe it could be significantly easier to implement (2) in dask without the baggage of mutable It is of course always possible to translate
|
We support some mutation operations today. For example x[x < 0] = 0 So presumably (1) is in-scope, but what stops us here is the complexity around the full generality of setitem syntax. Getitem was, as you recall, finicky to get right (at least right enough for people to be happy). Overriding boundaries of arrays is probably an incremental improvement on what we have today. Something like |
Thanks, @jakirkham.
Just for the record, in case it's useful to those who come across this issue in the future, the PR that finally supported this turned out to be #7393, after flaws in 7033 were uncovered. |
Ah right thanks for the correction David 🙂 Updated my post above |
Even though
stack
andconcatenate
are nice for combining arrays, sometimes they don't fit the data I have or require a significant amount of work to use. For instance, combining blocks of different data. In cases like these, it would be nice to be able to use array assignment. While it is true thatdask
creates graphs of pure operations (with few exceptions) and assignment is unpure, one could imagine creating an array-like object that translates assignments into slicing and stacking/concatenating. This would allow a user to make use of a__setitem__
-like syntax, but result in creating a new dask array (or potentially modifying the graph of the existing one) so the net result behaves like assignment while remaining pure.The text was updated successfully, but these errors were encountered: