Array creation and __array_function__ #4883

pentschev · 2019-06-05T10:32:32Z

One of the issues that arise when introducing __array_function__ in a NumPy-like library, such as Dask, is array creation. Many functions require some sort of array to be created, either for temporary usage or to hold results. Specifically within Dask, most resulting arrays are simply NumPy-arrays, most of the times wrapped in a Dask array. However, when we deal with another NumPy-like library, be it either CuPy, xarray, Sparse, etc., we need to ensure that the arrays created within Dask match those types.

A very common case is the use of empty (and counterparts, full, ones, zeros), which we can now overcome with the introduction of shape argument in empty_like (and counterparts) in numpy/numpy#13046. This is a great starting point, but doesn't solve all array creation issues. Most notably, we have array and asarray, which are not included in the __array_function__ scope, so we have no simple way to deal with certain situations involving those functions.

With Dask, we could of course introduce mechanisms such as looking up the type of the array and dispatching a call to that library, but this would limit the scope of use to a few libraries that we support and would be an effort, to my understanding, counter-productive towards __array_function__.

So my question here is: are there ideas out there already or discussions already initiated on how to solve such problems? Maybe @shoyer, @mrocklin or @jakirkham know something about this already?

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-06-05T14:39:14Z

I'm not sure I fully understand, what are situations where this goes poorly?

For example, if I can da.asarray(my_cupy_array) do I not get a dask array of cupy arrays?

pentschev · 2019-06-05T15:04:32Z

For example, if I can da.asarray(my_cupy_array) do I not get a dask array of cupy arrays?

That's correct, and if I'm not mistaken, this is the primary reason we have to use asarray=False when creating a Dask array from a CuPy array.

pentschev · 2019-06-05T15:10:06Z

And beyond that, we have cases where we create an array from a list, for example, currently there's no mechanism to ensure that this array will be of same type of some other array which will be operated on. One example is percentile, in which we have an array a that can be a CuPy array for example, and q which is array-like but we need to make a CuPy array from it. See https://github.com/dask/dask/blob/master/dask/array/percentile.py#L90

mrocklin · 2019-06-05T15:16:04Z

If possible I would like to focus first on the da.asarray(my_cupy_array) case. I think that it is likely easier to manage.

Some relevant code is here:

dask/dask/array/core.py

Lines 2316 to 2334 in 165f71e

    
           if type(x) is np.ndarray and all(len(c) == 1 for c in chunks): 
        
               # No slicing needed 
        
               dsk = {(name, ) + (0, ) * x.ndim: x} 
        
           else: 
        
               if getitem is None: 
        
                   if type(x) is np.ndarray and not lock: 
        
                       # simpler and cleaner, but missing all the nuances of getter 
        
                       getitem = operator.getitem 
        
                   elif fancy: 
        
                       getitem = getter 
        
                   else: 
        
                       getitem = getter_nofancy 
        
               dsk = getem(original_name, chunks, getitem=getitem, shape=x.shape, 
        
                           out_name=name, lock=lock, asarray=asarray, 
        
                           dtype=x.dtype) 
        
               dsk[original_name] = x 
        
           return Array(dsk, name, chunks, dtype=x.dtype)

There is some clear special-casing of NumPy here. I wonder if there is some check we can do instead of an explicit type check here that would be valid. cc @shoyer in case he has suggestions. For example I wonder what would happen if we added the check

asarray = not hasattr(x, '__array_function__')

This would change behavior on matrix objects (although maybe that's ok now with _meta).

Also, I noticed that there is an asanyarray function, which isn't as nice as using asarray directly, but might be a short-term standin.

dask/dask/array/core.py

Lines 3163 to 3201 in 165f71e

    
           def asanyarray(a): 
        
               """Convert the input to a dask array. 
        
               Subclasses of ``np.ndarray`` will be passed through as chunks unchanged. 
        
               Parameters 
        
               ---------- 
        
               a : array-like 
        
                   Input data, in any form that can be converted to a dask array. 
        
               Returns 
        
               ------- 
        
               out : dask array 
        
                   Dask array interpretation of a. 
        
               Examples 
        
               -------- 
        
               >>> import dask.array as da 
        
               >>> import numpy as np 
        
               >>> x = np.arange(3) 
        
               >>> da.asanyarray(x) 
        
               dask.array<array, shape=(3,), dtype=int64, chunksize=(3,)> 
        
               >>> y = [[1, 2, 3], [4, 5, 6]] 
        
               >>> da.asanyarray(y) 
        
               dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3)> 
        
               """ 
        
               if isinstance(a, Array): 
        
                   return a 
        
               elif hasattr(a, 'to_dask_array'): 
        
                   return a.to_dask_array() 
        
               elif hasattr(a, 'data') and type(a).__module__.startswith('xarray.'): 
        
                   return asanyarray(a.data) 
        
               elif isinstance(a, (list, tuple)) and any(isinstance(i, Array) for i in a): 
        
                   a = stack(a) 
        
               elif not isinstance(getattr(a, 'shape', None), Iterable): 
        
                   a = np.asanyarray(a) 
        
               return from_array(a, chunks=a.shape, getitem=getter_inline, 
        
                                 asarray=False)

mrocklin · 2019-06-05T15:17:05Z

For the da.asarray(my_list) case that seems a bit harder to resolve. Maybe a like= keyword? This seems a bit odd though.

pentschev · 2019-06-05T15:25:36Z

I don't have a particular order in which I would like to keep, I'm just raising this question in case we already have something ongoing, and to get ideas from other people.

The idea I had was similar to your like= argument suggestion, but with a new function, similar to empty_like. In any case, I don't know long-term what would be a better solution, ideally -- and supposing the idea is to have __array_function__ more widely used -- we would have a way for libraries to specify what a resulting array should look like, e.g., you have an array as input, so your output should be the same type of array, which often requires that other arrays created internally be of that type as well.

Maybe I have incorrect expectations and this is not intended with NEP-18 for a reason, but for now I think it's missing that capability, which is essential in some situations, such as I mentioned before with percentile.

pentschev · 2019-06-05T15:36:50Z

And by the way, something that maybe wasn't clear before, the issue is not with Dask alone, but we can't use numpy.asarray(cupy_array) as well:

>>> import numpy as np
>>> import cupy
>>> a = cupy.empty((2, 2))
>>> b = np.asarray(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/pentschev/.local/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: object __array__ method not producing an array

jakirkham · 2019-06-05T17:29:02Z

For fixing da.asarray, maybe we can go through a deprecation cycle and then default to asarray=False.

That said, IIUC the da.asarray is a finer point. The larger issue is, how do we handle array creation in algorithmic code? Currently we have hard-coded NumPy array creation in some places. How can we relax this to create an array matching the input type when needed?

pentschev · 2019-06-05T17:34:59Z

For fixing da.asarray, maybe we can go through a deprecation cycle and then default to asarray=False.

I think the main problem isn't da.asarray, but np.asarray itself, since we can't dispatch that with a CuPy array, for example. Maybe if that was possible, we wouldn't need to use asarray=False in Dask array creation at all.

That said, IIUC the da.asarray is a finer point. The larger issue is, how do we handle array creation in algorithmic code? Currently we have hard-coded NumPy array creation in some places. How can we relax this to create an array matching the input type when needed?

Correct, and I'm raising this here to see what ideas we have on the Dask side, but this may be a question that we need to raise in NumPy itself.

mrocklin · 2019-06-05T17:39:13Z

Cupy intentionally doesn't want np.asarray(my_cupy_array) to work. Their reasoning here is that this often happens and reduces performance. They would rather that users intentionally call a to_numpy method, or something similar.

…

On Wed, Jun 5, 2019 at 10:35 AM Peter Andreas Entschev < ***@***.***> wrote: For fixing da.asarray, maybe we can go through a deprecation cycle and then default to asarray=False. I think the main problem isn't da.asarray, but np.asarray itself, since we can't dispatch that with a CuPy array, for example. Maybe if that was possible, we wouldn't need to use asarray=False in Dask array creation at all. That said, IIUC the da.asarray is a finer point. The larger issue is, how do we handle array creation in algorithmic code? Currently we have hard-coded NumPy array creation in some places. How can we relax this to create an array matching the input type when needed? Correct, and I'm raising this here to see what ideas we have on the Dask side, but this may be a question that we need to raise in NumPy itself. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4883?email_source=notifications&email_token=AACKZTG4EZT2RQCEW363AJ3PY72MLA5CNFSM4HTU2Q52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXAOSJI#issuecomment-499181861>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTG6VMMT56BENA545LDPY72MLANCNFSM4HTU2Q5Q> .

jakirkham · 2019-06-05T17:50:27Z

IIUC Peter is suggesting that np.asarray should mean cp.asarray in some cases. Perhaps similar to the backend selection idea. Though please correct me if I'm getting this wrong, Peter.

Anyways I don't think we should get too bogged down in asarray in particular. It's only one example of the kinds of issues we still encounter.

mrocklin · 2019-06-05T17:55:08Z

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

pentschev · 2019-06-05T18:11:40Z

IIUC Peter is suggesting that np.asarray should mean cp.asarray in some cases. Perhaps similar to the backend selection idea. Though please correct me if I'm getting this wrong, Peter.

Anyways I don't think we should get too bogged down in asarray in particular. It's only one example of the kinds of issues we still encounter.

You're correct in your entire statement. :)

pentschev · 2019-06-05T18:16:56Z

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

That's right, it isn't possible today, and that's exactly what I want to discuss: how do we make it possible?

It is too often the case that we need to create new arrays, and for __array_function__ to work properly in all (or almost all) cases, we need to deal with that. This is why I used the example of shape in empty_like, we depend on it in several instances, and we need something similar for np.array and np.asarray, and perhaps others as well, but I don't know how many of NumPy's array creation functions are of much relevance for other libraries, such as Dask and CuPy.

shoyer · 2019-06-06T18:22:10Z

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

That's right, it isn't possible today, and that's exactly what I want to discuss: how do we make it possible?

My view is that we need a new coercion function/protocol inside NumPy, something like np.duckarray(). We can't change np.asarray() because that is used in some cases where the user expects real NumPy arrays (e.g., with strides).

Nathaniel Smith and I discussed this in NEP-22, but didn't come to a concrete proposal, mostly because we struggled to come up with a good name.

jakirkham · 2019-06-12T20:53:10Z

Is there an issue on the NumPy repo that we should use to discuss this proposal, @shoyer?

pentschev · 2019-06-12T21:46:48Z

TBH, I've been trying to wrap my head around the NEP-22 proposal, but I still don't understand how we would use that in practice. I'm very sure I don't understand the proposal well, but my little understanding is that duck arrays would automatically convert themselves, for which we will face some rejection, and here I'm thinking particularly of CuPy, where no implicit conversion is allowed. Happy to continue the discussion either here or some more appropriate place.

shoyer · 2019-06-12T22:00:35Z

It wouldn't be a bad idea to discuss this in the NumPy tracker somewhere.

My thinking something like the following implementation for the protocol:

import numpy as np

# hypothetical np.duckarray() function
def duckarray(array_like):
  if hasattr(array_like, '__duckarray__'):
    # return an object that can be substituted for np.ndarray
    return array_like.__duckarray__()
  return np.asarray(array_like)

Example usage:

class SparseArray:
  def __duckarray__(self):
    return self
  def __array__(self):
    raise TypeError

duckarray(SparseArray())  # returns a SparseArray object
np.array(SparseArray())  # raises TypeError

jakirkham · 2019-06-17T21:21:54Z

What you are proposing sounds reasonable, Stephan. Is there a particular NumPy issue we should move to?

jakirkham · 2019-06-25T11:59:25Z

@shoyer, I went ahead and opened issue ( numpy/numpy#13831 ) (copying everyone here) to get the ball rolling on this. Hope that is ok. 🙂

pentschev · 2019-08-01T19:35:49Z

We have now a more appropriate place to discuss this in numpy/numpy#13831, so I'm closing this.

This was referenced Jun 7, 2019

Array._meta missing functionality tracker #4888

Closed

Dask Array's constant pad coerces array to NumPy ndarray #4842

Closed

[BUG] dask.array.linalg.inv fails on CuPy backed arrays #4899

Open

jakirkham added the array label Jun 12, 2019

pentschev mentioned this issue Jun 21, 2019

add dtype and shape kwargs to apply_along_axis #3742

Merged

2 tasks

jakirkham mentioned this issue Jun 25, 2019

Supporting duck array coercion numpy/numpy#13831

Open

pentschev closed this as completed Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array creation and __array_function__ #4883

Array creation and __array_function__ #4883

pentschev commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019 •

edited

mrocklin commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019

jakirkham commented Jun 5, 2019

pentschev commented Jun 5, 2019

mrocklin commented Jun 5, 2019 via email

jakirkham commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019

shoyer commented Jun 6, 2019

jakirkham commented Jun 12, 2019

pentschev commented Jun 12, 2019 •

edited

shoyer commented Jun 12, 2019

jakirkham commented Jun 17, 2019 •

edited

jakirkham commented Jun 25, 2019

pentschev commented Aug 1, 2019

Array creation and __array_function__ #4883

Array creation and __array_function__ #4883

Comments

pentschev commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019 • edited

mrocklin commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019

jakirkham commented Jun 5, 2019

pentschev commented Jun 5, 2019

mrocklin commented Jun 5, 2019 via email

jakirkham commented Jun 5, 2019

mrocklin commented Jun 5, 2019

pentschev commented Jun 5, 2019

pentschev commented Jun 5, 2019

shoyer commented Jun 6, 2019

jakirkham commented Jun 12, 2019

pentschev commented Jun 12, 2019 • edited

shoyer commented Jun 12, 2019

jakirkham commented Jun 17, 2019 • edited

jakirkham commented Jun 25, 2019

pentschev commented Aug 1, 2019

pentschev commented Jun 5, 2019 •

edited

pentschev commented Jun 12, 2019 •

edited

jakirkham commented Jun 17, 2019 •

edited