Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some functions fail on Dask arrays from sparse #4523

Closed
pentschev opened this issue Feb 22, 2019 · 12 comments

Comments

Projects
None yet
3 participants
@pentschev
Copy link
Member

commented Feb 22, 2019

Short sample for reproduction and traceback follow.

import numpy as np
import sparse
import dask.array as da
from numpy.testing import assert_array_equal

x = da.random.random((2, 3, 4), chunks=(1, 2, 2))
x[x < 0.8] = 0

y = x.map_blocks(sparse.COO.from_numpy)

xx = x.mean()
yy = y.mean()

assert_array_equal(xx, yy)
Traceback (most recent call last):
  File "dask_sparse.py", line 14, in <module>
    assert_array_equal(xx, yy)
  File "/home/pentschev/.local/lib/python3.5/site-packages/numpy/testing/_private/utils.py", line 894, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/home/pentschev/.local/lib/python3.5/site-packages/numpy/testing/_private/utils.py", line 697, in assert_array_compare
    y = array(y, copy=False, subok=True)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/array/core.py", line 998, in __array__
    x = self.compute()
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 475, in get_async
    finish(dsk, state, not succeeded)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/callbacks.py", line 99, in local_callbacks
    yield callbacks or ()
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 460, in get_async
    raise_exception(exc, tb)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc 
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/optimization.py", line 942, in __call__
    dict(zip(self.inkeys, args)))
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/compatibility.py", line 93, in apply
    return func(*args, **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/array/reductions.py", line 332, in mean_chunk
    n = numel(x, dtype=dtype, **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/dask/array/reductions.py", line 323, in numel
    return chunk.sum(np.ones_like(x), **kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/numpy/core/overrides.py", line 151, in public_api
    implementation, public_api, relevant_args, args, kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/numpy/core/numeric.py", line 289, in ones_like
    res = empty_like(a, dtype=dtype, order=order, subok=subok)
  File "/home/pentschev/.local/lib/python3.5/site-packages/numpy/core/overrides.py", line 151, in public_api
    implementation, public_api, relevant_args, args, kwargs)
  File "/home/pentschev/.local/lib/python3.5/site-packages/sparse/sparse_array.py", line 213, in __array__
    raise RuntimeError('Cannot convert a sparse array to dense automatically. '
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.

Another function that fails with same error besides mean() is std(). After fixing the error, we have to make sure to add both functions to sparse tests, inside the functions list.

pentschev added a commit to pentschev/dask that referenced this issue Feb 25, 2019

Fixed mean() and moment() on sparse arrays
For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

pentschev added a commit to pentschev/dask that referenced this issue Feb 25, 2019

Fixed mean() and moment() on sparse arrays
For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

@pentschev pentschev referenced this issue Feb 25, 2019

Merged

Fixed mean() and moment() on sparse arrays #4525

0 of 2 tasks complete
@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

Also not working: y.compute()

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

@Hoeze that works for me, could you post a traceback?

@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

@pentschev
Dask v1.1.4:

x = sparse.COO.from_numpy(np.random.randint(1, size=[12, 30]))
y = da.from_array(x, chunks=x.shape)
y.compute()
Traceback (most recent call last):
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-45-ef0b14f31685>", line 1, in <module>
    y.compute()
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/base.py", line 143, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/base.py", line 392, in compute
    results = get(dsk, keys, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/array/core.py", line 76, in getter
    c = np.asarray(c)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/numpy/core/numeric.py", line 501, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sparse/sparse_array.py", line 212, in __array__
    raise RuntimeError('Cannot convert a sparse array to dense automatically. '
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

The sad thing here is the fact that I got no idea how to execute todense() on that:

y.map_blocks(lambda b: b.todense(), dtype=y.dtype).compute()
Traceback (most recent call last):
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-14-b21bc5f418cb>", line 1, in <module>
    y.map_blocks(lambda b: b.todense(), dtype=x.dtype).compute()
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/local.py", line 462, in get_async
    raise_exception(exc, tb)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/dask/array/core.py", line 82, in getter
    c = np.asarray(c)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/numpy/core/numeric.py", line 501, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sparse/sparse_array.py", line 212, in __array__
    raise RuntimeError('Cannot convert a sparse array to dense automatically. '
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

Maybe related?
Autmatic densification seems like an important feature to use sparse with dask, especially when mixing different types of arrays...

EDIT:
The problem seems to be that np.asarray() fails on sparse matrices.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

You can try this:

x = np.random.randint(1, size=[12, 30])
y = da.from_array(x, chunks=x.shape)
z = y.map_blocks(sparse.COO.from_numpy)
z.compute()

or, alternatively without a numpy array:

x = da.random.randint(1, size=[12, 30])
y = x.map_blocks(sparse.COO.from_numpy)
y.compute()

Would that be an option for your use case?

@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

Unfortunately, this does not work for me.
I need to load a huge sparse matrix from disk and combine this with another dense matrix.
Therefore, I planned something like that:

sparse_array = da.from_array(some_sparse_array)
dense_array = da.from_array(some_dense_array)
full_data = da.concat([sparse_array, dense_array])

My hope was that Dask then hides all the if data is sparse; then stuff from me.

@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

@pentschev
@hameerabbasi just stated that disallowing np.asarray was intentionally.
This means that Dask needs to do its own check at every point whether the current chunk is a sparse matrix (e.g. to calculate the concatenation of a sparse and a dense array). Is this doable?

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

My hope was that Dask then hides all the if data is sparse; then stuff from me.

Yes, I see your point, unfortunately, this won't be the case, due to Dask's design to prevent accidental densification of a sparse matrix.

I think the only alternative for your case is to densify your matrix manually, from your prior example, you would need to do the following:

full_data = da.concat([sparse_array.todense(), dense_array])
@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2019

Copying relevant comment of mine from elsewhere: pydata/sparse#10 (comment)

@Hoeze

This comment has been minimized.

Copy link

commented Mar 25, 2019

Setting SPARSE_AUTO_DENSIFY=1 works!
Of course, it's a quick and dirty way, but that's still the only option in my case, since I cannot convert the full dataset into a dense format.

Thanks again @hameerabbasi and @pentschev

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2019

@Hoeze I added std and var in sparse master (pydata/sparse#244). 🙂 You should be able to use those without any kind of densification now. mean was already there.

jakirkham added a commit that referenced this issue Apr 12, 2019

Fixed mean() and moment() on sparse arrays (#4525)
* Fixed mean() and moment() on sparse arrays

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes #4523 and adds the tests
suggested in that issue.

* Reductions' numel() won't create ones() for unmasked arrays

* Add tests for new numel() implementation

* Fix numel() test, previously failing in Python 2.7.

* Remove ones_like_lookup() for sparse matrices

* remove todense_lookup

* Call correct sum() function in moment_chunk()

* Remove xfail from mean() sparse test

* Add sparse std() test back

* Test sparse moment()

* Test sparse var()

* Build also against sparse upstream

* Fix condition for CI upstream sparse installation

* Attempt to fix upstream sparse installation once more

* Enable __array_function__ in Python 3.7 build

* Remove leftover export from run_tests.sh

* Workaround for mkl.fft failures in test_array_function.py

* Minor reductions readability/code consistency changes

* Increase coverage of numel()

* Remove unnecessary for loop in numel() test

* Reenable development build, uses upstream libraries

* Revert "Reenable development build, uses upstream libraries"

This reverts commit 1705689.

asmith26 added a commit to asmith26/dask that referenced this issue Apr 22, 2019

Fixed mean() and moment() on sparse arrays (dask#4525)
* Fixed mean() and moment() on sparse arrays

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

* Reductions' numel() won't create ones() for unmasked arrays

* Add tests for new numel() implementation

* Fix numel() test, previously failing in Python 2.7.

* Remove ones_like_lookup() for sparse matrices

* remove todense_lookup

* Call correct sum() function in moment_chunk()

* Remove xfail from mean() sparse test

* Add sparse std() test back

* Test sparse moment()

* Test sparse var()

* Build also against sparse upstream

* Fix condition for CI upstream sparse installation

* Attempt to fix upstream sparse installation once more

* Enable __array_function__ in Python 3.7 build

* Remove leftover export from run_tests.sh

* Workaround for mkl.fft failures in test_array_function.py

* Minor reductions readability/code consistency changes

* Increase coverage of numel()

* Remove unnecessary for loop in numel() test

* Reenable development build, uses upstream libraries

* Revert "Reenable development build, uses upstream libraries"

This reverts commit 1705689.

@pentschev pentschev referenced this issue Apr 24, 2019

Open

NEP-18 Issue Tracking #4731

9 of 17 tasks complete

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019

Fixed mean() and moment() on sparse arrays (dask#4525)
* Fixed mean() and moment() on sparse arrays

For proper creation of sparse auxiliary arrays, a dispatcher for ones_like has
been added. However, sparse arrays created from ones_like cause sparse
reductions to fail, as it would result in a dense array. To fix that, the arrays
have first to be converted to dense before, which ends up requiring a new
dispatcher for that purpose, and for already dense arrays must be bypassed, that
is here implemented as numpy.ndarray.view() to simply return a shallow copy of
the array.

This commit fixes dask#4523 and adds the tests
suggested in that issue.

* Reductions' numel() won't create ones() for unmasked arrays

* Add tests for new numel() implementation

* Fix numel() test, previously failing in Python 2.7.

* Remove ones_like_lookup() for sparse matrices

* remove todense_lookup

* Call correct sum() function in moment_chunk()

* Remove xfail from mean() sparse test

* Add sparse std() test back

* Test sparse moment()

* Test sparse var()

* Build also against sparse upstream

* Fix condition for CI upstream sparse installation

* Attempt to fix upstream sparse installation once more

* Enable __array_function__ in Python 3.7 build

* Remove leftover export from run_tests.sh

* Workaround for mkl.fft failures in test_array_function.py

* Minor reductions readability/code consistency changes

* Increase coverage of numel()

* Remove unnecessary for loop in numel() test

* Reenable development build, uses upstream libraries

* Revert "Reenable development build, uses upstream libraries"

This reverts commit 1705689.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.