Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array creation and __array_function__ #4883

Closed
pentschev opened this issue Jun 5, 2019 · 21 comments

Comments

@pentschev
Copy link
Member

commented Jun 5, 2019

One of the issues that arise when introducing __array_function__ in a NumPy-like library, such as Dask, is array creation. Many functions require some sort of array to be created, either for temporary usage or to hold results. Specifically within Dask, most resulting arrays are simply NumPy-arrays, most of the times wrapped in a Dask array. However, when we deal with another NumPy-like library, be it either CuPy, xarray, Sparse, etc., we need to ensure that the arrays created within Dask match those types.

A very common case is the use of empty (and counterparts, full, ones, zeros), which we can now overcome with the introduction of shape argument in empty_like (and counterparts) in numpy/numpy#13046. This is a great starting point, but doesn't solve all array creation issues. Most notably, we have array and asarray, which are not included in the __array_function__ scope, so we have no simple way to deal with certain situations involving those functions.

With Dask, we could of course introduce mechanisms such as looking up the type of the array and dispatching a call to that library, but this would limit the scope of use to a few libraries that we support and would be an effort, to my understanding, counter-productive towards __array_function__.

So my question here is: are there ideas out there already or discussions already initiated on how to solve such problems? Maybe @shoyer, @mrocklin or @jakirkham know something about this already?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

I'm not sure I fully understand, what are situations where this goes poorly?

For example, if I can da.asarray(my_cupy_array) do I not get a dask array of cupy arrays?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

For example, if I can da.asarray(my_cupy_array) do I not get a dask array of cupy arrays?

That's correct, and if I'm not mistaken, this is the primary reason we have to use asarray=False when creating a Dask array from a CuPy array.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

And beyond that, we have cases where we create an array from a list, for example, currently there's no mechanism to ensure that this array will be of same type of some other array which will be operated on. One example is percentile, in which we have an array a that can be a CuPy array for example, and q which is array-like but we need to make a CuPy array from it. See https://github.com/dask/dask/blob/master/dask/array/percentile.py#L90

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

If possible I would like to focus first on the da.asarray(my_cupy_array) case. I think that it is likely easier to manage.

Some relevant code is here:

dask/dask/array/core.py

Lines 2316 to 2334 in 165f71e

if type(x) is np.ndarray and all(len(c) == 1 for c in chunks):
# No slicing needed
dsk = {(name, ) + (0, ) * x.ndim: x}
else:
if getitem is None:
if type(x) is np.ndarray and not lock:
# simpler and cleaner, but missing all the nuances of getter
getitem = operator.getitem
elif fancy:
getitem = getter
else:
getitem = getter_nofancy
dsk = getem(original_name, chunks, getitem=getitem, shape=x.shape,
out_name=name, lock=lock, asarray=asarray,
dtype=x.dtype)
dsk[original_name] = x
return Array(dsk, name, chunks, dtype=x.dtype)

There is some clear special-casing of NumPy here. I wonder if there is some check we can do instead of an explicit type check here that would be valid. cc @shoyer in case he has suggestions. For example I wonder what would happen if we added the check

asarray = not hasattr(x, '__array_function__')

This would change behavior on matrix objects (although maybe that's ok now with _meta).

Also, I noticed that there is an asanyarray function, which isn't as nice as using asarray directly, but might be a short-term standin.

dask/dask/array/core.py

Lines 3163 to 3201 in 165f71e

def asanyarray(a):
"""Convert the input to a dask array.
Subclasses of ``np.ndarray`` will be passed through as chunks unchanged.
Parameters
----------
a : array-like
Input data, in any form that can be converted to a dask array.
Returns
-------
out : dask array
Dask array interpretation of a.
Examples
--------
>>> import dask.array as da
>>> import numpy as np
>>> x = np.arange(3)
>>> da.asanyarray(x)
dask.array<array, shape=(3,), dtype=int64, chunksize=(3,)>
>>> y = [[1, 2, 3], [4, 5, 6]]
>>> da.asanyarray(y)
dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3)>
"""
if isinstance(a, Array):
return a
elif hasattr(a, 'to_dask_array'):
return a.to_dask_array()
elif hasattr(a, 'data') and type(a).__module__.startswith('xarray.'):
return asanyarray(a.data)
elif isinstance(a, (list, tuple)) and any(isinstance(i, Array) for i in a):
a = stack(a)
elif not isinstance(getattr(a, 'shape', None), Iterable):
a = np.asanyarray(a)
return from_array(a, chunks=a.shape, getitem=getter_inline,
asarray=False)

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

For the da.asarray(my_list) case that seems a bit harder to resolve. Maybe a like= keyword? This seems a bit odd though.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

I don't have a particular order in which I would like to keep, I'm just raising this question in case we already have something ongoing, and to get ideas from other people.

The idea I had was similar to your like= argument suggestion, but with a new function, similar to empty_like. In any case, I don't know long-term what would be a better solution, ideally -- and supposing the idea is to have __array_function__ more widely used -- we would have a way for libraries to specify what a resulting array should look like, e.g., you have an array as input, so your output should be the same type of array, which often requires that other arrays created internally be of that type as well.

Maybe I have incorrect expectations and this is not intended with NEP-18 for a reason, but for now I think it's missing that capability, which is essential in some situations, such as I mentioned before with percentile.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

And by the way, something that maybe wasn't clear before, the issue is not with Dask alone, but we can't use numpy.asarray(cupy_array) as well:

>>> import numpy as np
>>> import cupy
>>> a = cupy.empty((2, 2))
>>> b = np.asarray(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/pentschev/.local/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: object __array__ method not producing an array
@jakirkham

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

For fixing da.asarray, maybe we can go through a deprecation cycle and then default to asarray=False.

That said, IIUC the da.asarray is a finer point. The larger issue is, how do we handle array creation in algorithmic code? Currently we have hard-coded NumPy array creation in some places. How can we relax this to create an array matching the input type when needed?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

For fixing da.asarray, maybe we can go through a deprecation cycle and then default to asarray=False.

I think the main problem isn't da.asarray, but np.asarray itself, since we can't dispatch that with a CuPy array, for example. Maybe if that was possible, we wouldn't need to use asarray=False in Dask array creation at all.

That said, IIUC the da.asarray is a finer point. The larger issue is, how do we handle array creation in algorithmic code? Currently we have hard-coded NumPy array creation in some places. How can we relax this to create an array matching the input type when needed?

Correct, and I'm raising this here to see what ideas we have on the Dask side, but this may be a question that we need to raise in NumPy itself.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

@jakirkham

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

IIUC Peter is suggesting that np.asarray should mean cp.asarray in some cases. Perhaps similar to the backend selection idea. Though please correct me if I'm getting this wrong, Peter.

Anyways I don't think we should get too bogged down in asarray in particular. It's only one example of the kinds of issues we still encounter.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

IIUC Peter is suggesting that np.asarray should mean cp.asarray in some cases. Perhaps similar to the backend selection idea. Though please correct me if I'm getting this wrong, Peter.

Anyways I don't think we should get too bogged down in asarray in particular. It's only one example of the kinds of issues we still encounter.

You're correct in your entire statement. :)

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

That's right, it isn't possible today, and that's exactly what I want to discuss: how do we make it possible?

It is too often the case that we need to create new arrays, and for __array_function__ to work properly in all (or almost all) cases, we need to deal with that. This is why I used the example of shape in empty_like, we depend on it in several instances, and we need something similar for np.array and np.asarray, and perhaps others as well, but I don't know how many of NumPy's array creation functions are of much relevance for other libraries, such as Dask and CuPy.

@shoyer

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

My guess is that that's not possible today. Too much of the ecosystem probably depends on np.asarray to produce exactly a numpy array.

That's right, it isn't possible today, and that's exactly what I want to discuss: how do we make it possible?

My view is that we need a new coercion function/protocol inside NumPy, something like np.duckarray(). We can't change np.asarray() because that is used in some cases where the user expects real NumPy arrays (e.g., with strides).

Nathaniel Smith and I discussed this in NEP-22, but didn't come to a concrete proposal, mostly because we struggled to come up with a good name.

@jakirkham

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

Is there an issue on the NumPy repo that we should use to discuss this proposal, @shoyer?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

TBH, I've been trying to wrap my head around the NEP-22 proposal, but I still don't understand how we would use that in practice. I'm very sure I don't understand the proposal well, but my little understanding is that duck arrays would automatically convert themselves, for which we will face some rejection, and here I'm thinking particularly of CuPy, where no implicit conversion is allowed. Happy to continue the discussion either here or some more appropriate place.

@shoyer

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

It wouldn't be a bad idea to discuss this in the NumPy tracker somewhere.

My thinking something like the following implementation for the protocol:

import numpy as np

# hypothetical np.duckarray() function
def duckarray(array_like):
  if hasattr(array_like, '__duckarray__'):
    # return an object that can be substituted for np.ndarray
    return array_like.__duckarray__()
  return np.asarray(array_like)

Example usage:

class SparseArray:
  def __duckarray__(self):
    return self
  def __array__(self):
    raise TypeError

duckarray(SparseArray())  # returns a SparseArray object
np.array(SparseArray())  # raises TypeError
@jakirkham

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

What you are proposing sounds reasonable, Stephan. Is there a particular NumPy issue we should move to?

@jakirkham

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

@shoyer, I went ahead and opened issue ( numpy/numpy#13831 ) (copying everyone here) to get the ball rolling on this. Hope that is ok. 🙂

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

We have now a more appropriate place to discuss this in numpy/numpy#13831, so I'm closing this.

@pentschev pentschev closed this Aug 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.