Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed mean() and moment() on sparse arrays #4525

Merged
merged 28 commits into from Apr 12, 2019

Conversation

Projects
None yet
6 participants
@pentschev
Copy link
Member

commented Feb 25, 2019

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes #4523 and adds the tests
suggested in that issue.

@mrocklin

  • Tests added / passed
  • Passes flake8 dask
Fixed mean() and moment() on sparse arrays
For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes #4523 and adds the tests
suggested in that issue.
@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

@mrocklin please let me know what you think of this. I'm not particularly happy with the need of the todense() dispatcher I added, but it doesn't seem like a terrible solution either.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

I think that we shouldn't ever have to create the original ones array. Instead we should figure out what the output looks like and create that.

In [2]: from dask.array.reductions import numel

In [3]: import numpy as np

In [4]: x = np.ones((2, 3, 4))

In [5]: numel(x, axis=(2,), keepdims=False)
Out[5]:
array([[4., 4., 4.],
       [4., 4., 4.]])

In [6]: numel(x, axis=(2,), keepdims=False).shape
Out[6]: (2, 3)

In [7]: np.ones(shape=x.shape[:2]) * x.shape[2]
Out[7]:
array([[4., 4., 4.],
       [4., 4., 4.]])

This gets a little bit tricky for complex values of axis= and keepdims=, but avoids us having to create a large dense array on the CPU. We only have to create the much smaller reduced array.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

And actually, going even further, we may not need to track numel at all. I think that we may be able to just track a single integer throughout all of the cases where numel is used. If so, this would be preferable.

So in general I think that instead of improving numel we should take a hard look at how it is used today, and see if we can come up with a better algorithm (if you're up for that).

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

I agree that not creating a the original ones array would be a better solution. However, if I'm not missing something, we still can't create a ones() array of arbitrary shape here, because the backend of x is unknown (i.e., think of CuPy). Calculating the final array is tricky but doable, but we still depend on the ones_like() for that case, until we have a solution on creating arrays with the appropriate backend.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

It may be that we never need to create the array of ones at all. It may be that we're passing way more information than we need to. I think that we only need the single value stored in the array, along with the shape of the array (which happens to be the same as the shape of the other arrays currently passed in the same dictionary).

@pentschev pentschev referenced this pull request Mar 6, 2019

Open

[WIP]: Dask Array _meta attribute #4543

10 of 33 tasks complete
@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2019

@mrocklin I changed the numel() function not to create an array of ones, but compute them with the input parameters. Exceptions are masked arrays and the nannumel() that can't really be computed without going through the x array anyway.

This still doesn't solve returning only the value and shape, but it's a start.

Also, I'm sure the numel() implementation is not in its best "Pythonic" form, feel free to criticize it.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2019

And I almost forgot, but the todense() has to remain in moment_chunk() for now, since sparse doesn't implement sum().

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

This looks great to me. I'm glad to see this come together.

There appears to be a failure in the Python 2.7 build, it looks like behavior might have changed in one of the relevant libraries.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

This looks great to me. Thanks @pentschev ! Merging.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

Actually, can we safely remove the ones_like_lookup now?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2019

Actually, can we safely remove the ones_like_lookup now?

Yes, I thought I would leave it as it could be useful, but probably creating a ones() sparse matrix isn't that useful, I'll remove that.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

So, I think that we can avoid implementing the todense dispatch (I'd really like to avoid having to implement these if we can) if sparse supports sum on arrays with fill values. I've reported this upstream at pydata/sparse#237

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

I hope you don't mind, but I pushed a commit to your branch that removes the todense_lookup. I think that long term it will be better to fix this upstream in sparse. My guess is that we can get other folks to handle that though.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2019

So, I think that we can avoid implementing the todense dispatch (I'd really like to avoid having to implement these if we can) if sparse supports sum on arrays with fill values. I've reported this upstream at pydata/sparse#237

That's also what I had in mind.

I hope you don't mind, but I pushed a commit to your branch that removes the todense_lookup. I think that long term it will be better to fix this upstream in sparse. My guess is that we can get other folks to handle that though.

No worries, I see you marked it xfail for now. Hopefully when sum() is supported, this will automatically work.

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

You can remove the xfail, as soon as pydata/sparse#238 is merged, all will be fine. Tests are green, just waiting for CI to pass.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2019

@hameerabbasi that was fast, thanks!

Didn't push yet, but now it fails when calling np.stack() on a sparse array, which means, no __array_function__. Should we also add it to sparse?

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

That will have to wait for a day or so. Of course, you're free to make a PR and I'll review it.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2019

I can send a PR, the question was more if that's something we should do, but I got my answer now. :)

@@ -410,8 +409,7 @@ def moment_chunk(A, order=2, sum=chunk.sum, numel=numel, dtype='f8', **kwargs):
n = numel(A, **kwargs).astype(np.int64)
u = total / n
diff = A - u
todense = todense_lookup.dispatch(type(diff))
xs = [sum(todense(A - u)**i, dtype=dtype, **kwargs) for i in range(2, order + 1)]
xs = [((A - u)**i).sum(dtype=dtype, **kwargs) for i in range(2, order + 1)]

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 12, 2019

Author Member

@mrocklin this is the line that causes nanvar test to fail. Please note that sum() here is not Python's nor NumPy's sum(), but it's one of moment_chunk()'s arguments. I'm now wondering third things:

  1. Should we rename the sum argument to something like sum_func to avoid confusion?
  2. Is it possible there's a bug in the sum() function in dask/array/reductions.py? I don't think they should give us different results anyway.
  3. Did we accidentally changed behavior in some reduction functions when sum() was replaced by _concatenate2().sum()?
@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

As @hameerabbasi pointed out, we can now remove the xfail, but for that we'll need to use upstream sparse or wait for a new release. How should we handle that?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Everything passes again.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Earlier today, @jakirkham and I discussed this PR offline, and I understood what was his intent regarding upstream libraries. I therefore did now what we agreed to try, which is to reenable the development build that uses upstream libraries.

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

Thanks @pentschev! LGTM

@jcrist and @martindurant, thoughts?

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Am seeing a few errors coming from Sparse/Numba in the dev tests that look like this

tp = <class 'numba.errors.TypingError'>
value = TypingError('Failed in nopython mode pipeline (step: nopython frontend)\nFailed in nopython mode pipeline (step: nopyt...rt the error message\nand traceback, along with a minimal reproducer at:\nhttps://github.com/numba/numba/issues/new\n')
tb = None
    def reraise(tp, value, tb=None):
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
>           raise value.with_traceback(tb)
E           numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<function searchsorted at 0x7f37444bb840>) with argument(s) of type(s): (array(int64, 1d, C), int64, side=Literal[str](left))
E            * parameterized
E           In definition 0:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f37048f27b8>) with argument(s) of type(s): (array(int64, 1d, C), int64)
E            * parameterized
E           In definition 0:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
E            * parameterized
E           In definition 0:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           In definition 1:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<ufunc 'isnan'>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
E               def searchsorted_inner(a, v):
E                   <source elided>
E                   n = len(a)
E                   if np.isnan(v):
E                   ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           In definition 1:
E               TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
E            * parameterized
E           In definition 0:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           In definition 1:
E               TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typing/npydecl.py:114
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<ufunc 'isnan'>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
E               def searchsorted_inner(a, v):
E                   <source elided>
E                   n = len(a)
E                   if np.isnan(v):
E                   ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f37048f27b8>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py (2939)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2939:
E                   def searchsorted_impl(a, v, side='left'):
E                       return loop_impl(a, v)
E                       ^
E           
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/typeinfer.py:861
E           In definition 1:
E               ValueError: Invalid value given for 'side': unicode_type
E               raised from /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/numba/targets/arraymath.py:2917
E           This error is usually caused by passing an argument of a type that is unsupported by the named function.
E           [1] During: resolving callee type: Function(<function searchsorted at 0x7f37444bb840>)
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py (451)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py", line 451:
E           def _get_mask_pairs(starts_old, stops_old, c, idx):  # pragma: no cover
E               <source elided>
E                   for p_match in range(idx[0], idx[1], idx[2]):
E                       start = np.searchsorted(c[starts_old[j]:stops_old[j]], p_match, side='left') + starts_old[j]
E                       ^
E           
E           [1] During: resolving callee type: type(CPUDispatcher(<function _get_mask_pairs at 0x7f37263d8268>))
E           [2] During: typing of call at /home/travis/miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py (387)
E           
E           
E           File "../../../miniconda/envs/test-environment/lib/python3.7/site-packages/sparse/coo/indexing.py", line 387:
E           def _compute_mask(coords, indices):  # pragma: no cover
E               <source elided>
E                   # Which would come out of indexing a single integer.
E                   starts, stops, n_matches = _get_mask_pairs(starts, stops, coords[i], indices[i])
E                   ^
E           
E           This is not usually a problem with Numba itself but instead often caused by
E           the use of unsupported features or an issue in resolving types.
E           
E           To see Python/NumPy features supported by the latest release of Numba visit:
E           http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
E           and
E           http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
E           
E           For more information about typing errors and how to debug them visit:
E           http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
E           
E           If you think your code should work with Numba, please report the error message
E           and traceback, along with a minimal reproducer at:
E           https://github.com/numba/numba/issues/new
../../../miniconda/envs/test-environment/lib/python3.7/site-packages/numba/six.py:658: TypingError

ref: https://travis-ci.org/dask/dask/jobs/515911552

@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

The development builds often have periodic failures due to upstream issues that have nothing to do with us. As such, I'd prefer to either:

  • Mark the development build as an allowed failure. This will still run the tests (so you can see if things broke when working on things that may affect upstream packages), but won't result in a red x on the PR.
  • Make the development build an optional build, switched on some string in the pull request. This allows that build to be optionally turned on when changing functionality that may affect these features. We do this for the hdfs tests here:
    if: type != pull_request OR commit_message =~ test-hdfs # Skip on PRS unless the commit message contains "test-hdfs"

    This may be less suited here, as the dev versions of dependencies may affect much of dask, but it's nice for features that rarely need to be tested.
@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

I've reported the issue upstream in Numba (numba/numba#3953)

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

I agree that we should find a middle-ground, like @jcrist suggested.

I would really like to get this PR merged soon, we're starting to stack different problems, preventing fixes from getting merged (like the FFT MKL workaround, which is by the way already stacked here, which isn't nice). So my proposal is the following:

  • Revert my last commit and let Dask build against sparse release for Python 3.6 and sparse upstream for Python 3.7, to ensure we're properly testing this PR;
  • Open another PR with the changes from my last commit so we can discuss and find a good solution.

After both items above are done, we can get this PR merged before it becomes a big snowball in an attempt to fix several partially-unrelated problems into one.

Do you agree with my proposal @jakirkham @mrocklin @jcrist @hameerabbasi ?

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Thanks @jcrist. This seems like a reasonable request to me. @pentschev, any thoughts on this? (Sorry Peter. GitHub didn't show your latest comment.).

Thanks for following up on that, @hameerabbasi. I see that issue is closed. What is the takeaway?

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

It was a temporary break on Numba's end.

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Is there a known working version of Numba? Should we pin it to something?

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

It was only in master AFAICT.

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Sorry I think I'm missing something. How would that cause the CI failure we saw? Are we getting a dev version of Numba somewhere that I missed?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

@jakirkham I agree with @jcrist suggestions for that particular test build. In general I tend not to like this sort of test because it's likely to get ignored/forgotten, even when they're necessary. However, I'd still like to test against sparse upstream for every PR for a while, to ensure this keeps on working well.

Nevertheless, what do you think of my proposal @jakirkham from my last comment? I think we can and should think through the way we build, but I would prefer to have another PR for that as it's likely to be an issue that will have diverging opinions while holding off merging this PR.

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

I just checked, apparently UPSTREAM_DEV doesn't update Numba and I did nothing in sparse to cause this. 🤔

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

UPSTREAM_DEV is probably updating Numba as a dependency to some other package. Passing builds are using Numba 0.41.0, whereas the failing UPSTREAM_DEV build is using Numba 0.43.1.

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Anyway, I isolated the issue (as it was essentially a compilation issue, which is easy to isolate) and made sure it's not there in Numba master.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Anyway, I isolated the issue (as it was essentially a compilation issue, which is easy to isolate) and made sure it's not there in Numba master.

Thanks so much for doing that @hameerabbasi.

One alternative we have now is to try using Numba master for UPSTREAM_DEV as well, but I think this may make us go in the direction of digging into further problems.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 12, 2019

Since there were no objections to my proposal, I reverted the last commit so that in this PR we can focus on the fixes that we aim to have while moving the discussion on how to move forward with the test of upstream libraries to #4696.

Please, let me know if there are any other suggestions for the fixes here, otherwise, I think this was already good for merging a couple of weeks ago.

@jakirkham jakirkham merged commit c9285c5 into dask:master Apr 12, 2019

4 checks passed

codecov/patch 100% of diff hit (target 91.2%)
Details
codecov/project 91.24% (+0.03%) compared to 0780ca5
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

Thanks @pentschev!

@jakirkham jakirkham referenced this pull request Apr 12, 2019

Merged

Reenable development build, uses upstream libraries #4696

0 of 2 tasks complete

@pentschev pentschev deleted the pentschev:fix-sparse-mean-moment branch Apr 17, 2019

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

Fixed mean() and moment() on sparse arrays (dask#4525)
* Fixed mean() and moment() on sparse arrays

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

* Reductions' numel() won't create ones() for unmasked arrays

* Add tests for new numel() implementation

* Fix numel() test, previously failing in Python 2.7.

* Remove ones_like_lookup() for sparse matrices

* remove todense_lookup

* Call correct sum() function in moment_chunk()

* Remove xfail from mean() sparse test

* Add sparse std() test back

* Test sparse moment()

* Test sparse var()

* Build also against sparse upstream

* Fix condition for CI upstream sparse installation

* Attempt to fix upstream sparse installation once more

* Enable __array_function__ in Python 3.7 build

* Remove leftover export from run_tests.sh

* Workaround for mkl.fft failures in test_array_function.py

* Minor reductions readability/code consistency changes

* Increase coverage of numel()

* Remove unnecessary for loop in numel() test

* Reenable development build, uses upstream libraries

* Revert "Reenable development build, uses upstream libraries"

This reverts commit 1705689.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Fixed mean() and moment() on sparse arrays (dask#4525)
* Fixed mean() and moment() on sparse arrays

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

* Reductions' numel() won't create ones() for unmasked arrays

* Add tests for new numel() implementation

* Fix numel() test, previously failing in Python 2.7.

* Remove ones_like_lookup() for sparse matrices

* remove todense_lookup

* Call correct sum() function in moment_chunk()

* Remove xfail from mean() sparse test

* Add sparse std() test back

* Test sparse moment()

* Test sparse var()

* Build also against sparse upstream

* Fix condition for CI upstream sparse installation

* Attempt to fix upstream sparse installation once more

* Enable __array_function__ in Python 3.7 build

* Remove leftover export from run_tests.sh

* Workaround for mkl.fft failures in test_array_function.py

* Minor reductions readability/code consistency changes

* Increase coverage of numel()

* Remove unnecessary for loop in numel() test

* Reenable development build, uses upstream libraries

* Revert "Reenable development build, uses upstream libraries"

This reverts commit 1705689.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.