Bincount fix slicing #7391

GenevieveBuckley · 2021-03-15T07:35:48Z

We've found problems trying to slice the output of dask.array.bincount() following #7183

Passes black dask / flake8 dask
Tests added / passed
- Added an assertion to the test in dask/array/test/test_routines.py to check the output of da.bincount can be sliced
- Passes previously failing tests in skimage: pytest skimage/filters/tests/test_thresholding.py::test_thresholds_dask_compatibility -vv
- Passes previously failing cupy tests
- Passes previously failing tests in dask-ml

Relevant issues:

Closes test_thresholding fails with last dask version (2021.3.0) scikit-image/scikit-image#5266
Closes IndexError("Too many indices for array") raised when attempting to run the K-Means|| example dask-ml#803
Partially addresses Failing CuPy tests #7324 (some other cupy test failures are unrelated to bincount)
Closes Compliance with Dask version 2021.03 ? xgcm/xhistogram#27

Summary:
When dask.array.bincount was modified in #7183 we lost the ability to slice the output returned by this function. The output shape and chunks attributes were empty tuples instead of what we expected.

We want to keep the benefits of PR 7183 because it's a lot better at managing memory when given large datasets.

So to add back in the ability to slice the output from bincount, I've added some of the extra logic normally handled by dask.array.reductions.reduction() to keep track of the array shape & chunk sizes.

GenevieveBuckley · 2021-03-16T04:15:54Z

The CI is currently failing because in my attempt to fix the failing cupy test, I've used the numpy array creation like= kwarg, which was only introduced in numpy version 1.20 (whereas the dask minimum dependency is currently numpy 1.15.1. Discussion here #7324 (comment)

pentschev

Thanks @GenevieveBuckley , I can confirm the test passes on my end as well. The fact that we now need NumPy>=1.20 is technically a regression, but probably a small one, I'm fine with telling users they need to upgrade. But could you add the line below to test_bincount to ensure it doesn't fail on older NumPy versions?

@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")

dask/array/routines.py

…s using older versions of numpy

rabernat · 2021-03-17T02:23:18Z

Thanks @GenevieveBuckley for finding this! In xhistogram (xgcm/xhistogram#27) we are experiencing problems with bincount with dask 2021.03 too.

Specifically:

import dask.array as dsa
ones = dsa.ones(shape=(100,), chunks=(110), dtype='int')
print(dsa.bincount(ones, minlength=2))
# -> dask.array<_bincount_agg-aggregate, shape=(), dtype=int64, chunksize=(), chunktype=numpy.ndarray>

the output of bincount no longer has a shape, so it can't be reshaped (which is what we need to do).

It looks like this PR will fix that. So 🙌 for your work!

GenevieveBuckley · 2021-03-17T05:07:46Z

It's nice to hear that @rabernat - this has been an unexpectedly high-impact change. I only knew about the problems in scikit-image when I started looking at it.

GenevieveBuckley · 2021-03-17T05:25:36Z

The cupy bincount test passes locally now (I have cupy version 9.0.0b3 and numpy version 1.20.1

pytest dask/array/tests/test_cupy.py::test_bincount -v

GenevieveBuckley · 2021-03-17T05:26:39Z

I think this is ready to merge @dask/maintenance

pentschev

@GenevieveBuckley thanks for the latest changes, I added another suggestion on making this a bit safer, not sure if that's really possible to happen in Dask but I guess there's no harm in doing so.

dask/array/routines.py

jrbourbeau

Thanks for your work on this @GenevieveBuckley!

Since axis is always (0,) we can streamline some of the logic around constructing chunks tuples (see the diff below). My hope is that will make reasoning about the logic here easier in the future.

Diff:

diff --git a/dask/array/routines.py b/dask/array/routines.py
index 8d53541e..69f56b63 100644
--- a/dask/array/routines.py
+++ b/dask/array/routines.py
@@ -645,7 +645,6 @@ def bincount(x, weights=None, minlength=0, split_every=None):
         if weights.chunks != x.chunks:
             raise ValueError("Chunks of input array x and weights must match.")

-    axis = (0,)
     token = tokenize(x, weights, minlength)
     args = [x, "i"]
     if weights is not None:
@@ -655,32 +654,27 @@ def bincount(x, weights=None, minlength=0, split_every=None):
         meta = array_safe(np.bincount([]), x._meta)

     if minlength == 0:
-        output_size = np.nan
+        output_size = (np.nan,)
     else:
-        output_size = minlength
+        output_size = (minlength,)

     chunked_counts = blockwise(
         partial(np.bincount, minlength=minlength), "i", *args, token=token, meta=meta
     )
-    chunked_counts._chunks = tuple(
-        (output_size,) * len(c) if i in axis else c
-        for i, c in enumerate(chunked_counts.chunks)
-    )
+    chunked_counts._chunks = (output_size * len(chunked_counts.chunks[0]), *chunked_counts.chunks[1:])

     from .reductions import _tree_reduce

     output = _tree_reduce(
         chunked_counts,
         aggregate=partial(_bincount_agg, dtype=meta.dtype),
-        axis=axis,
+        axis=(0,),
         keepdims=True,
         dtype=meta.dtype,
         split_every=split_every,
         concatenate=False,
     )
-    output._chunks = tuple(
-        (output_size,) if i in axis else c for i, c in enumerate(chunked_counts.chunks)
-    )
+    output._chunks = (output_size, *chunked_counts.chunks[1:])
     output._meta = meta
     return output

dask/array/tests/test_routines.py

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

GenevieveBuckley · 2021-03-18T01:09:44Z

Thank you for the feedback @pentschev and @jrbourbeau
Those are very helpful suggestions, and I've updated this PR to include them.

pentschev

LGTM, thanks @GenevieveBuckley ! 🙂

jrbourbeau

Thanks for all your work on this @GenevieveBuckley and for reviewing @pentschev! This is in

jrbourbeau · 2021-03-18T17:29:58Z

The fact that we now need NumPy>=1.20

@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping test_bincount in the CuPy tests

GenevieveBuckley · 2021-03-19T01:53:08Z

The fact that we now need NumPy>=1.20

@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping test_bincount in the CuPy tests

@jrbourbeau I hope this helps explain it:

If we use array_safe in a function and the cupy test uses assert_eq, then numpy<=1.20 is required for cupy users. Regular users of Dask are ok with any numpy version.

If numpy>=1.20 then the like= keyword argument is used in the array creation, so everything works properly for everyone (cupy and numpy Dask users alike). However this keyword argument is only available from numpy 1.20 onwards.
If numpy<1.20 then array_safe ensures that dask doesn't create an error by falling back to using array creation without the like= keyword argument. That means we get a numpy array by default instead of a cupy one. That's no good for cupy users, but it works just fine for people using Dask with numpy array chunks. That's why we need to skip the cupy test is the numpy version is too low.

jrbourbeau · 2021-03-19T03:16:22Z

Great, thanks for that very clear explanation @GenevieveBuckley

* Add VDS creation command-line tool * Add Dask image binner command-line tool, with options for single-image, multi-image sweep and multi-image pump-probe binning. * Add command line tool to inspect cue messages * Avoid Dask v2021.03.0 due to a regression in that release — the dask.array.bincount function does not permit slicing in that version (see dask/dask#7391). * Add more cue message info and tidy up the functions for discovering the timestamps of chosen cues

GenevieveBuckley added 3 commits March 15, 2021 18:15

Fix bincount slicing error found in scikit-image, cupy, and dask-ml

ea6cbe2

Test bincount has expected shape and chunks

1b42445

Merge branch 'main' into bincount-fix-slicing

b31d606

This was referenced Mar 15, 2021

IndexError("Too many indices for array") raised when attempting to run the K-Means|| example dask/dask-ml#803

Closed

Failing CuPy tests #7324

Closed

GenevieveBuckley added 2 commits March 15, 2021 19:12

Explicit test we can slice the output of bincount

5562b7e

Preserve meta array type (for cupy input, etc)

6bf5340

pentschev reviewed Mar 16, 2021

View reviewed changes

dask/array/routines.py Outdated Show resolved Hide resolved

GenevieveBuckley added 2 commits March 17, 2021 12:13

Use array_safe instead of np.array like= kwarg so we don't break test…

c162438

…s using older versions of numpy

Skip cupy test if numpy is <1.20, since NEP-35 is required

1ed7819

raybellwaves mentioned this pull request Mar 17, 2021

Compliance with Dask version 2021.03 ? xgcm/xhistogram#27

Closed

pentschev reviewed Mar 17, 2021

View reviewed changes

dask/array/routines.py Outdated Show resolved Hide resolved

dask/array/routines.py Outdated Show resolved Hide resolved

jrbourbeau reviewed Mar 17, 2021

View reviewed changes

dask/array/tests/test_routines.py Outdated Show resolved Hide resolved

GenevieveBuckley and others added 3 commits March 18, 2021 12:01

Update dask/array/tests/test_routines.py

82c9c0a

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

James' suggestions to simplify logic

2e12451

Make meta robust if a non-dask array is passed

0c67526

jakirkham requested review from pentschev and jrbourbeau March 18, 2021 06:17

pentschev approved these changes Mar 18, 2021

View reviewed changes

jrbourbeau approved these changes Mar 18, 2021

View reviewed changes

jrbourbeau merged commit bd37996 into dask:main Mar 18, 2021

pentschev mentioned this pull request May 6, 2021

NEP 35: Finalize like= argument behaviour before 1.21 release numpy/numpy#17075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bincount fix slicing #7391

Bincount fix slicing #7391

GenevieveBuckley commented Mar 15, 2021 •

edited

Loading

GenevieveBuckley commented Mar 16, 2021

pentschev left a comment

rabernat commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

pentschev left a comment

jrbourbeau left a comment

GenevieveBuckley commented Mar 18, 2021

pentschev left a comment

jrbourbeau left a comment

jrbourbeau commented Mar 18, 2021

GenevieveBuckley commented Mar 19, 2021

jrbourbeau commented Mar 19, 2021

Bincount fix slicing #7391

Bincount fix slicing #7391

Conversation

GenevieveBuckley commented Mar 15, 2021 • edited Loading

GenevieveBuckley commented Mar 16, 2021

pentschev left a comment

Choose a reason for hiding this comment

rabernat commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

GenevieveBuckley commented Mar 17, 2021

pentschev left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

GenevieveBuckley commented Mar 18, 2021

pentschev left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau commented Mar 18, 2021

GenevieveBuckley commented Mar 19, 2021

jrbourbeau commented Mar 19, 2021

GenevieveBuckley commented Mar 15, 2021 •

edited

Loading