Fixed mean() and moment() on sparse arrays by pentschev · Pull Request #4525 · dask/dask

pentschev · 2019-02-25T13:53:40Z

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes #4523 and adds the tests
suggested in that issue.

@mrocklin

Tests added / passed
Passes flake8 dask

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has been added. However, sparse arrays created from ones_like cause sparse reductions to fail, as it would result in a dense array. To fix that, the arrays have first to be converted to dense before, which ends up requiring a new dispatcher for that purpose, and for already dense arrays must be bypassed, that is here implemented as numpy.ndarray.view() to simply return a shallow copy of the array. This commit fixes dask#4523 and adds the tests suggested in that issue.

pentschev · 2019-02-25T14:10:30Z

@mrocklin please let me know what you think of this. I'm not particularly happy with the need of the todense() dispatcher I added, but it doesn't seem like a terrible solution either.

mrocklin · 2019-02-25T15:08:04Z

I think that we shouldn't ever have to create the original ones array. Instead we should figure out what the output looks like and create that.

In [2]: from dask.array.reductions import numel

In [3]: import numpy as np

In [4]: x = np.ones((2, 3, 4))

In [5]: numel(x, axis=(2,), keepdims=False)
Out[5]:
array([[4., 4., 4.],
       [4., 4., 4.]])

In [6]: numel(x, axis=(2,), keepdims=False).shape
Out[6]: (2, 3)

In [7]: np.ones(shape=x.shape[:2]) * x.shape[2]
Out[7]:
array([[4., 4., 4.],
       [4., 4., 4.]])

This gets a little bit tricky for complex values of axis= and keepdims=, but avoids us having to create a large dense array on the CPU. We only have to create the much smaller reduced array.

mrocklin · 2019-02-25T15:09:14Z

And actually, going even further, we may not need to track numel at all. I think that we may be able to just track a single integer throughout all of the cases where numel is used. If so, this would be preferable.

So in general I think that instead of improving numel we should take a hard look at how it is used today, and see if we can come up with a better algorithm (if you're up for that).

pentschev · 2019-02-25T17:34:16Z

I agree that not creating a the original ones array would be a better solution. However, if I'm not missing something, we still can't create a ones() array of arbitrary shape here, because the backend of x is unknown (i.e., think of CuPy). Calculating the final array is tricky but doable, but we still depend on the ones_like() for that case, until we have a solution on creating arrays with the appropriate backend.

mrocklin · 2019-02-25T18:13:13Z

It may be that we never need to create the array of ones at all. It may be that we're passing way more information than we need to. I think that we only need the single value stored in the array, along with the shape of the array (which happens to be the same as the shape of the other arrays currently passed in the same dictionary).

…oment

pentschev · 2019-03-10T12:06:56Z

@mrocklin I changed the numel() function not to create an array of ones, but compute them with the input parameters. Exceptions are masked arrays and the nannumel() that can't really be computed without going through the x array anyway.

This still doesn't solve returning only the value and shape, but it's a start.

Also, I'm sure the numel() implementation is not in its best "Pythonic" form, feel free to criticize it.

pentschev · 2019-03-10T12:08:09Z

And I almost forgot, but the todense() has to remain in moment_chunk() for now, since sparse doesn't implement sum().

mrocklin · 2019-03-10T17:09:13Z

This looks great to me. I'm glad to see this come together.

There appears to be a failure in the Python 2.7 build, it looks like behavior might have changed in one of the relevant libraries.

dask/array/tests/test_reductions.py

mrocklin · 2019-03-11T01:56:34Z

This looks great to me. Thanks @pentschev ! Merging.

mrocklin · 2019-03-11T01:59:18Z

Actually, can we safely remove the ones_like_lookup now?

pentschev · 2019-03-11T08:55:59Z

Actually, can we safely remove the ones_like_lookup now?

Yes, I thought I would leave it as it could be useful, but probably creating a ones() sparse matrix isn't that useful, I'll remove that.

mrocklin · 2019-03-11T17:12:04Z

So, I think that we can avoid implementing the todense dispatch (I'd really like to avoid having to implement these if we can) if sparse supports sum on arrays with fill values. I've reported this upstream at pydata/sparse#237

mrocklin · 2019-03-11T17:19:17Z

I hope you don't mind, but I pushed a commit to your branch that removes the todense_lookup. I think that long term it will be better to fix this upstream in sparse. My guess is that we can get other folks to handle that though.

pentschev · 2019-03-11T17:47:25Z

So, I think that we can avoid implementing the todense dispatch (I'd really like to avoid having to implement these if we can) if sparse supports sum on arrays with fill values. I've reported this upstream at pydata/sparse#237

That's also what I had in mind.

I hope you don't mind, but I pushed a commit to your branch that removes the todense_lookup. I think that long term it will be better to fix this upstream in sparse. My guess is that we can get other folks to handle that though.

No worries, I see you marked it xfail for now. Hopefully when sum() is supported, this will automatically work.

hameerabbasi · 2019-03-11T19:25:47Z

You can remove the xfail, as soon as pydata/sparse#238 is merged, all will be fine. Tests are green, just waiting for CI to pass.

pentschev · 2019-03-11T20:59:30Z

@hameerabbasi that was fast, thanks!

Didn't push yet, but now it fails when calling np.stack() on a sparse array, which means, no __array_function__. Should we also add it to sparse?

hameerabbasi · 2019-03-11T21:01:07Z

That will have to wait for a day or so. Of course, you're free to make a PR and I'll review it.

pentschev · 2019-03-11T21:04:55Z

I can send a PR, the question was more if that's something we should do, but I got my answer now. :)

pentschev · 2019-03-12T13:55:56Z

dask/array/reductions.py

    diff = A - u
-    todense = todense_lookup.dispatch(type(diff))
-    xs = [sum(todense(A - u)**i, dtype=dtype, **kwargs) for i in range(2, order + 1)]
+    xs = [((A - u)**i).sum(dtype=dtype, **kwargs) for i in range(2, order + 1)]


@mrocklin this is the line that causes nanvar test to fail. Please note that sum() here is not Python's nor NumPy's sum(), but it's one of moment_chunk()'s arguments. I'm now wondering third things:

Should we rename the sum argument to something like sum_func to avoid confusion?

Is it possible there's a bug in the sum() function in dask/array/reductions.py? I don't think they should give us different results anyway.

Did we accidentally changed behavior in some reduction functions when sum() was replaced by _concatenate2().sum()?

pentschev · 2019-03-12T13:58:04Z

As @hameerabbasi pointed out, we can now remove the xfail, but for that we'll need to use upstream sparse or wait for a new release. How should we handle that?

…oment

pentschev · 2019-04-02T10:01:20Z

Failures are test_parquet, therefore, not related to this PR.

@jakirkham are we good to merge this?

…oment

pentschev · 2019-04-04T14:41:25Z

Everything passes again.

pentschev · 2019-04-04T20:53:00Z

Earlier today, @jakirkham and I discussed this PR offline, and I understood what was his intent regarding upstream libraries. I therefore did now what we agreed to try, which is to reenable the development build that uses upstream libraries.

jakirkham · 2019-04-04T21:48:12Z

Thanks @pentschev! LGTM

@jcrist and @martindurant, thoughts?

jakirkham · 2019-04-08T14:57:20Z

Am seeing a few errors coming from Sparse/Numba in the dev tests that look like this

tp = <class 'numba.errors.TypingError'>
value = TypingError('Failed in nopython mode pipeline (step: nopython frontend)\nFailed in nopython mode pipeline (step: nopyt...rt the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n')
tb = None
    def reraise(tp, value, tb=None):
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
>           raise value.with_traceback(tb)
E           numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<function searchsorted at 0x7f37444bb840>) with argument(s) of type(s): (array(int64, 1d, C), int64, side=Literal[str](left))
E            * parameterized
E           In definition 0:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f37048f27b8>) with argument(s) of type(s): (array(int64, 1d, C), int64)
E            * parameterized
E           In definition 0:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
E            * parameterized
E           In definition 0:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           In definition 1:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<ufunc 'isnan'>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
E               def searchsorted_inner(a, v):
E                   <source elided>
E                   n = len(a)
E                   if np.isnan(v):
E                   ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           In definition 1:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
E            * parameterized
E           In definition 0:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           In definition 1:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<ufunc 'isnan'>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
E               def searchsorted_inner(a, v):
E                   <source elided>
E                   n = len(a)
E                   if np.isnan(v):
E                   ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f37048f27b8>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2939)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2939:
E                   def searchsorted_impl(a, v, side='left'):
E                       return loop_impl(a, v)
E                       ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           In definition 1:
E               ValueError: Invalid value given for 'side': unicode_type
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py:2917
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<function searchsorted at 0x7f37444bb840>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py (451)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py", line 451:
E           def _get_mask_pairs(starts_old, stops_old, c, idx):  # pragma: no cover
E               <source elided>
E                   for p_match in range(idx[0], idx[1], idx[2]):
E                       start = np.searchsorted(c[starts_old[j]:stops_old[j]], p_match, side='left') + starts_old[j]
E                       ^
E           
E           [1] During: resolving callee type: type(CPUDispatcher(<function _get_mask_pairs at 0x7f37263d8268>))
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py (387)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py", line 387:
E           def _compute_mask(coords, indices):  # pragma: no cover
E               <source elided>
E                   # Which would come out of indexing a single integer.
E                   starts, stops, n_matches = _get_mask_pairs(starts, stops, coords[i], indices[i])
E                   ^
E           
E           This is not usually a problem with Numba itself but instead often caused by
E           the use of unsupported features or an issue in resolving types.
E           
E           To see Python/NumPy features supported by the latest release of Numba visit:
E           http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
E           and
E           http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
E           
E           For more information about typing errors and how to debug them visit:
E           http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
E           
E           If you think your code should work with Numba, please report the error message
E           and traceback, along with a minimal reproducer at:
E           https://github.com/numba/numba/issues/new
../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/six.py:658: TypingError

ref: https://travis-ci.org/dask/dask/jobs/515911552

jcrist · 2019-04-08T14:57:25Z

The development builds often have periodic failures due to upstream issues that have nothing to do with us. As such, I'd prefer to either:

Mark the development build as an allowed failure. This will still run the tests (so you can see if things broke when working on things that may affect upstream packages), but won't result in a red x on the PR.
Make the development build an optional build, switched on some string in the pull request. This allows that build to be optionally turned on when changing functionality that may affect these features. We do this for the hdfs tests here:

dask/.travis.yml

Line 71 in 631c5a4

if: type != pull_request OR commit_message =~ test-hdfs # Skip on PRS unless the commit message contains "test-hdfs"

This may be less suited here, as the dev versions of dependencies may affect much of dask, but it's nice for features that rarely need to be tested.

hameerabbasi · 2019-04-08T15:17:25Z

I've reported the issue upstream in Numba (numba/numba#3953)

pentschev · 2019-04-08T18:03:39Z

I agree that we should find a middle-ground, like @jcrist suggested.

I would really like to get this PR merged soon, we're starting to stack different problems, preventing fixes from getting merged (like the FFT MKL workaround, which is by the way already stacked here, which isn't nice). So my proposal is the following:

Revert my last commit and let Dask build against sparse release for Python 3.6 and sparse upstream for Python 3.7, to ensure we're properly testing this PR;
Open another PR with the changes from my last commit so we can discuss and find a good solution.

After both items above are done, we can get this PR merged before it becomes a big snowball in an attempt to fix several partially-unrelated problems into one.

Do you agree with my proposal @jakirkham @mrocklin @jcrist @hameerabbasi ?

jakirkham · 2019-04-08T18:45:26Z

Thanks @jcrist. This seems like a reasonable request to me. @pentschev, any thoughts on this? (Sorry Peter. GitHub didn't show your latest comment.).

Thanks for following up on that, @hameerabbasi. I see that issue is closed. What is the takeaway?

hameerabbasi · 2019-04-08T18:46:13Z

It was a temporary break on Numba's end.

jakirkham · 2019-04-08T18:47:15Z

Is there a known working version of Numba? Should we pin it to something?

hameerabbasi · 2019-04-08T18:48:00Z

It was only in master AFAICT.

jakirkham · 2019-04-08T18:49:58Z

Sorry I think I'm missing something. How would that cause the CI failure we saw? Are we getting a dev version of Numba somewhere that I missed?

pentschev · 2019-04-08T19:30:00Z

@jakirkham I agree with @jcrist suggestions for that particular test build. In general I tend not to like this sort of test because it's likely to get ignored/forgotten, even when they're necessary. However, I'd still like to test against sparse upstream for every PR for a while, to ensure this keeps on working well.

Nevertheless, what do you think of my proposal @jakirkham from my last comment? I think we can and should think through the way we build, but I would prefer to have another PR for that as it's likely to be an issue that will have diverging opinions while holding off merging this PR.

hameerabbasi · 2019-04-08T20:01:23Z

I just checked, apparently UPSTREAM_DEV doesn't update Numba and I did nothing in sparse to cause this. 🤔

pentschev · 2019-04-08T20:11:54Z

UPSTREAM_DEV is probably updating Numba as a dependency to some other package. Passing builds are using Numba 0.41.0, whereas the failing UPSTREAM_DEV build is using Numba 0.43.1.

hameerabbasi · 2019-04-08T20:13:07Z

Anyway, I isolated the issue (as it was essentially a compilation issue, which is easy to isolate) and made sure it's not there in Numba master.

pentschev · 2019-04-08T20:19:47Z

Anyway, I isolated the issue (as it was essentially a compilation issue, which is easy to isolate) and made sure it's not there in Numba master.

Thanks so much for doing that @hameerabbasi.

One alternative we have now is to try using Numba master for UPSTREAM_DEV as well, but I think this may make us go in the direction of digging into further problems.

This reverts commit 1705689.

pentschev · 2019-04-12T08:36:34Z

Since there were no objections to my proposal, I reverted the last commit so that in this PR we can focus on the fixes that we aim to have while moving the discussion on how to move forward with the test of upstream libraries to #4696.

Please, let me know if there are any other suggestions for the fixes here, otherwise, I think this was already good for merging a couple of weeks ago.

jakirkham · 2019-04-12T20:26:24Z

Thanks @pentschev!

* Fixed mean() and moment() on sparse arrays For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has been added. However, sparse arrays created from ones_like cause sparse reductions to fail, as it would result in a dense array. To fix that, the arrays have first to be converted to dense before, which ends up requiring a new dispatcher for that purpose, and for already dense arrays must be bypassed, that is here implemented as numpy.ndarray.view() to simply return a shallow copy of the array. This commit fixes dask#4523 and adds the tests suggested in that issue. * Reductions' numel() won't create ones() for unmasked arrays * Add tests for new numel() implementation * Fix numel() test, previously failing in Python 2.7. * Remove ones_like_lookup() for sparse matrices * remove todense_lookup * Call correct sum() function in moment_chunk() * Remove xfail from mean() sparse test * Add sparse std() test back * Test sparse moment() * Test sparse var() * Build also against sparse upstream * Fix condition for CI upstream sparse installation * Attempt to fix upstream sparse installation once more * Enable __array_function__ in Python 3.7 build * Remove leftover export from run_tests.sh * Workaround for mkl.fft failures in test_array_function.py * Minor reductions readability/code consistency changes * Increase coverage of numel() * Remove unnecessary for loop in numel() test * Reenable development build, uses upstream libraries * Revert "Reenable development build, uses upstream libraries" This reverts commit 1705689.

pentschev mentioned this pull request Mar 6, 2019

Add Dask Array._meta attribute #4543

Merged

34 tasks

pentschev added 3 commits March 10, 2019 12:45

Reductions' numel() won't create ones() for unmasked arrays

b1497e9

Add tests for new numel() implementation

0d7661f

Merge remote-tracking branch 'upstream/master' into fix-sparse-mean-m…

4c718d3

…oment

pentschev commented Mar 10, 2019

View reviewed changes

dask/array/tests/test_reductions.py Outdated Show resolved Hide resolved

Fix numel() test, previously failing in Python 2.7.

ebf8bfa

Remove ones_like_lookup() for sparse matrices

61a2f13

remove todense_lookup

33f6ada

pentschev commented Mar 12, 2019

View reviewed changes

Call correct sum() function in moment_chunk()

ccdafa3

Merge remote-tracking branch 'upstream/master' into fix-sparse-mean-m…

416315a

…oment

pentschev mentioned this pull request Apr 3, 2019

Parquet tests fail in PRs #4666

Closed

Merge remote-tracking branch 'upstream/master' into fix-sparse-mean-m…

e479778

…oment

Reenable development build, uses upstream libraries

1705689

hameerabbasi mentioned this pull request Apr 8, 2019

numpy searchsorted doesn't work on integer arrays numba/numba#3953

Closed

Revert "Reenable development build, uses upstream libraries"

b700ddc

This reverts commit 1705689.

jakirkham merged commit c9285c5 into dask:master Apr 12, 2019

jakirkham mentioned this pull request Apr 12, 2019

Reenable development build, uses upstream libraries #4696

Merged

2 tasks

pentschev deleted the fix-sparse-mean-moment branch April 17, 2019 08:12

Uh oh!

Conversation

pentschev commented Feb 25, 2019

Uh oh!

pentschev commented Feb 25, 2019

Uh oh!

mrocklin commented Feb 25, 2019

Uh oh!

mrocklin commented Feb 25, 2019

Uh oh!

pentschev commented Feb 25, 2019

Uh oh!

mrocklin commented Feb 25, 2019

Uh oh!

pentschev commented Mar 10, 2019

Uh oh!

pentschev commented Mar 10, 2019

Uh oh!

mrocklin commented Mar 10, 2019

Uh oh!

Uh oh!

mrocklin commented Mar 11, 2019

Uh oh!

mrocklin commented Mar 11, 2019

Uh oh!

pentschev commented Mar 11, 2019

Uh oh!

mrocklin commented Mar 11, 2019

Uh oh!

mrocklin commented Mar 11, 2019

Uh oh!

pentschev commented Mar 11, 2019

Uh oh!

hameerabbasi commented Mar 11, 2019

Uh oh!

pentschev commented Mar 11, 2019

Uh oh!

hameerabbasi commented Mar 11, 2019

Uh oh!

pentschev commented Mar 11, 2019

Uh oh!

pentschev Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

pentschev commented Mar 12, 2019

Uh oh!

pentschev commented Apr 2, 2019

Uh oh!

pentschev commented Apr 4, 2019

Uh oh!

pentschev commented Apr 4, 2019

Uh oh!

jakirkham commented Apr 4, 2019

Uh oh!

jakirkham commented Apr 8, 2019

Uh oh!

jcrist commented Apr 8, 2019

Uh oh!

hameerabbasi commented Apr 8, 2019

Uh oh!

pentschev commented Apr 8, 2019

Uh oh!

jakirkham commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Apr 8, 2019

Uh oh!

jakirkham commented Apr 8, 2019

Uh oh!

hameerabbasi commented Apr 8, 2019

Uh oh!

jakirkham commented Apr 8, 2019

Uh oh!

pentschev commented Apr 8, 2019

Uh oh!

hameerabbasi commented Apr 8, 2019

Uh oh!

pentschev commented Apr 8, 2019

Uh oh!

hameerabbasi commented Apr 8, 2019

jakirkham commented Apr 8, 2019 •

edited

Loading