Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-18: mean_chunk() object __array__ method not producing an array #4481

Closed
pentschev opened this issue Feb 13, 2019 · 15 comments

Comments

Projects
None yet
3 participants
@pentschev
Copy link
Member

commented Feb 13, 2019

Several Dask operations that utilize mean_chunk() from dask/array/reductions.py fail for Dask arrays created from non-NumPy (e.g., CuPy, sparse) arrays.

Some of the operations confirmed to fail are (including other non-core Dask projects):

  • dask.mean()
  • dask.glm.algorithms.*

A sample to reproduce the issue is given below, followed by its traceback.

import cupy
import dask.array as da

x = cupy.random.random((5000, 1000))

d = da.from_array(x, chunks=(1000, 1000), asarray=False)

d.mean().compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 459, in get_async
    raise_exception(exc, tb)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/optimization.py", line 942, in __call__
    dict(zip(self.inkeys, args)))
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/compatibility.py", line 93, in apply
    return func(*args, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/array/reductions.py", line 336, in mean_chunk
    result['n'] = n
ValueError: object __array__ method not producing an array
@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Hrm, @jcrist do you have any suggestions here? The challenge is that we're creating an empty numpy array and then assigning into it. This makes things hard for non-Numpy implementations like sparse or cupy.

Two thoughts:

  1. We could see if dask array is returning a tuple rather than a record array. My guess is that it will get upset
  2. We could use empty_like for our dispatch rather than empty. This would allow the __array_function__ protocol to work as desired.
@jcrist

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

We already have a dispatch for empty, found in dask.array.reductions, could cupy just register an implementation for that?

empty_like won't work, since you can't override the shape (unfortunately).

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Right, following @jcrist 's suggestion this should be easy.

@pentschev you probably want something like the following in dask/array/backends.py

from .reductions import empty_lookup
empty_lookup.register(cupy.ndarray, np.cupy)
@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

from .reductions import empty_lookup
empty_lookup.register(cupy.ndarray, np.cupy)

Yes, that probably would solve the problem, but cupy.empty() doesn't support a list of tuples as dtype. Just as means of reproduction consider the following:

In [25]: n = cupy.array([[1000000.]])

In [26]: n
Out[26]: array([[1000000.]])

In [27]: n.shape
Out[27]: (1, 1)

In [28]: cupy.empty(n.shape, dtype=[('total', total.dtype), ('n', total.dtype)])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-ea9bfab84d63> in <module>
----> 1 cupy.empty(n.shape, dtype=[('total', total.dtype), ('n', total.dtype)])

~/.local/lib/python3.5/site-packages/cupy/creation/basic.py in empty(shape,
dtype, order)
     20 
     21     """
---> 22     return cupy.ndarray(shape, dtype, order=order)
     23 
     24 

~/src/cupy/cupy/core/core.pyx in cupy.core.core.ndarray.__init__()

~/src/cupy/cupy/core/_dtype.pyx in cupy.core._dtype.get_dtype_with_itemsize()

TypeError: unhashable type: 'list'

I'm not sure whether this is a bug or a missing feature in CuPy. But as far as I can tell, this empty array is only used for internal Dask computing, so an alternative would be to replace it by something else, since it's only used to hold two values by their names, just like a hash table.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

@mrocklin I'm opening the issue on CuPy's tracker. If you find an appropriate way of avoiding the use of record arrays, I could also work on the implementation if you'd like, just let me know.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Raised here: cupy/cupy#2031

It looks like we can resolve this on our end just by adding a np.dtype(...) call around the list.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

So the following change is probably needed and welcome on the dask side

-   result = empty(n.shape, dtype=[('total', total.dtype), ('n', n.dtype)])
+   result = empty(n.shape, dtype=np.dtype([('total', total.dtype), ('n', n.dtype)]))
@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Yes, that should work. I can try it out now and submit a PR, or are you working on this already @mrocklin ?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

All yours

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Well, there's yet another CuPy problem when assigning a value to result, I guess it can only handle indices with the bracket operator:

Traceback (most recent call last):
  File "empty.py", line 9, in <module>
    result['n'] = n
  File "cupy/core/core.pyx", line 1183, in cupy.core.core.ndarray.__setitem__
  File "cupy/core/_routines_indexing.pyx", line 49, in cupy.core._routines_indexing._ndarray_setitem
  File "cupy/core/_routines_indexing.pyx", line 726, in cupy.core._routines_indexing._scatter_op
  File "cupy/core/_routines_indexing.pyx", line 309, in cupy.core._routines_indexing._simple_getitem
ValueError: invalid literal for int() with base 10: 'n'

Looking at cupy.core._routines_indexing._simple_getitem(), this is where it fails:

308         elif numpy.isscalar(s):
309             ind = int(s)

According to NumPy's documentation a string is a scalar, and CuPy immediately tries to convert to an int, as it seems to be the case that only indices are expected there.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Agreed. Perhaps this could be part of your recently opened issue cupy/cupy#2031, since this is more of a follow-up case? If some fix like the one you suggested, this will definitely be next issue, so I see no reason to fix one without the other.

mrocklin added a commit to mrocklin/dask that referenced this issue Feb 20, 2019

Modify mean chunk functions to return dicts rather than arrays
These functions return two array chunks each:

1.  The sum of the array
2.  The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it.  This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays.  This is nice in a few ways,
but does mean that we need to replicate the `concatenate=`
functionality within our chunk aggregation and combine functions, which
can be tricky.  This commit fails to do it correctly, but may be a good
start.

After this we would like to implement the same fix for other functions,
notably the `moment_*` functions in this same reductions.py file.

Fixes dask#4481
@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

I've made an attempt at this in #4513 it's not complete though. If anyone has time to investigate it further that would be welcome.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

I had a quick look here, but it really seems that it's a more complicated issue than it seemed at first. I will try to investigate a little further.

By the way, I would say this is probably the highest priority bug on the Dask side regarding the CuPy integration. I've seen many functions breaking due to mean_chunk(), and this is also the breaking point for Dask GLM algorithms as well. Most likely we'll uncover other bugs after fixing this as well, but we should really fix this soon.

mrocklin added a commit that referenced this issue Feb 25, 2019

Modify mean chunk functions to return dicts rather than arrays (#4513)
These functions return two array chunks each:

1.  The sum of the array
2.  The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it.  This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays.  This is nice in a few ways,
but does mean that we need to replicate the `concatenate=`
functionality within our chunk aggregation and combine functions, which
can be tricky.  This commit fails to do it correctly, but may be a good
start.

After this we would like to implement the same fix for other functions,
notably the `moment_*` functions in this same reductions.py file.

Fixes #4481

@pentschev pentschev referenced this issue Apr 24, 2019

Open

NEP-18 Issue Tracking #4731

9 of 17 tasks complete

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019

Modify mean chunk functions to return dicts rather than arrays (dask#…
…4513)

These functions return two array chunks each:

1.  The sum of the array
2.  The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it.  This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays.  This is nice in a few ways,
but does mean that we need to replicate the `concatenate=`
functionality within our chunk aggregation and combine functions, which
can be tricky.  This commit fails to do it correctly, but may be a good
start.

After this we would like to implement the same fix for other functions,
notably the `moment_*` functions in this same reductions.py file.

Fixes dask#4481
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.