Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bincount fix slicing #7391

Merged
merged 10 commits into from
Mar 18, 2021
Merged

Conversation

GenevieveBuckley
Copy link
Contributor

@GenevieveBuckley GenevieveBuckley commented Mar 15, 2021

We've found problems trying to slice the output of dask.array.bincount() following #7183

  • Passes black dask / flake8 dask

  • Tests added / passed

    • Added an assertion to the test in dask/array/test/test_routines.py to check the output of da.bincount can be sliced
    • Passes previously failing tests in skimage: pytest skimage/filters/tests/test_thresholding.py::test_thresholds_dask_compatibility -vv
    • Passes previously failing cupy tests
    • Passes previously failing tests in dask-ml

Relevant issues:

Summary:
When dask.array.bincount was modified in #7183 we lost the ability to slice the output returned by this function. The output shape and chunks attributes were empty tuples instead of what we expected.

We want to keep the benefits of PR 7183 because it's a lot better at managing memory when given large datasets.

So to add back in the ability to slice the output from bincount, I've added some of the extra logic normally handled by dask.array.reductions.reduction() to keep track of the array shape & chunk sizes.

@GenevieveBuckley
Copy link
Contributor Author

The CI is currently failing because in my attempt to fix the failing cupy test, I've used the numpy array creation like= kwarg, which was only introduced in numpy version 1.20 (whereas the dask minimum dependency is currently numpy 1.15.1. Discussion here #7324 (comment)

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GenevieveBuckley , I can confirm the test passes on my end as well. The fact that we now need NumPy>=1.20 is technically a regression, but probably a small one, I'm fine with telling users they need to upgrade. But could you add the line below to test_bincount to ensure it doesn't fail on older NumPy versions?

@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")

dask/array/routines.py Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor

Thanks @GenevieveBuckley for finding this! In xhistogram (xgcm/xhistogram#27) we are experiencing problems with bincount with dask 2021.03 too.

Specifically:

import dask.array as dsa
ones = dsa.ones(shape=(100,), chunks=(110), dtype='int')
print(dsa.bincount(ones, minlength=2))
# -> dask.array<_bincount_agg-aggregate, shape=(), dtype=int64, chunksize=(), chunktype=numpy.ndarray>

the output of bincount no longer has a shape, so it can't be reshaped (which is what we need to do).

It looks like this PR will fix that. So 🙌 for your work!

@GenevieveBuckley
Copy link
Contributor Author

It's nice to hear that @rabernat - this has been an unexpectedly high-impact change. I only knew about the problems in scikit-image when I started looking at it.

@GenevieveBuckley
Copy link
Contributor Author

The cupy bincount test passes locally now (I have cupy version 9.0.0b3 and numpy version 1.20.1

pytest dask/array/tests/test_cupy.py::test_bincount -v

@GenevieveBuckley
Copy link
Contributor Author

I think this is ready to merge @dask/maintenance

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GenevieveBuckley thanks for the latest changes, I added another suggestion on making this a bit safer, not sure if that's really possible to happen in Dask but I guess there's no harm in doing so.

dask/array/routines.py Outdated Show resolved Hide resolved
dask/array/routines.py Outdated Show resolved Hide resolved
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @GenevieveBuckley!

Since axis is always (0,) we can streamline some of the logic around constructing chunks tuples (see the diff below). My hope is that will make reasoning about the logic here easier in the future.

Diff:
diff --git a/dask/array/routines.py b/dask/array/routines.py
index 8d53541e..69f56b63 100644
--- a/dask/array/routines.py
+++ b/dask/array/routines.py
@@ -645,7 +645,6 @@ def bincount(x, weights=None, minlength=0, split_every=None):
         if weights.chunks != x.chunks:
             raise ValueError("Chunks of input array x and weights must match.")

-    axis = (0,)
     token = tokenize(x, weights, minlength)
     args = [x, "i"]
     if weights is not None:
@@ -655,32 +654,27 @@ def bincount(x, weights=None, minlength=0, split_every=None):
         meta = array_safe(np.bincount([]), x._meta)

     if minlength == 0:
-        output_size = np.nan
+        output_size = (np.nan,)
     else:
-        output_size = minlength
+        output_size = (minlength,)

     chunked_counts = blockwise(
         partial(np.bincount, minlength=minlength), "i", *args, token=token, meta=meta
     )
-    chunked_counts._chunks = tuple(
-        (output_size,) * len(c) if i in axis else c
-        for i, c in enumerate(chunked_counts.chunks)
-    )
+    chunked_counts._chunks = (output_size * len(chunked_counts.chunks[0]), *chunked_counts.chunks[1:])

     from .reductions import _tree_reduce

     output = _tree_reduce(
         chunked_counts,
         aggregate=partial(_bincount_agg, dtype=meta.dtype),
-        axis=axis,
+        axis=(0,),
         keepdims=True,
         dtype=meta.dtype,
         split_every=split_every,
         concatenate=False,
     )
-    output._chunks = tuple(
-        (output_size,) if i in axis else c for i, c in enumerate(chunked_counts.chunks)
-    )
+    output._chunks = (output_size, *chunked_counts.chunks[1:])
     output._meta = meta
     return output

dask/array/tests/test_routines.py Outdated Show resolved Hide resolved
@GenevieveBuckley
Copy link
Contributor Author

Thank you for the feedback @pentschev and @jrbourbeau
Those are very helpful suggestions, and I've updated this PR to include them.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @GenevieveBuckley ! 🙂

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @GenevieveBuckley and for reviewing @pentschev! This is in

@jrbourbeau jrbourbeau merged commit bd37996 into dask:main Mar 18, 2021
@jrbourbeau
Copy link
Member

The fact that we now need NumPy>=1.20

@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping test_bincount in the CuPy tests

@GenevieveBuckley
Copy link
Contributor Author

The fact that we now need NumPy>=1.20

@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping test_bincount in the CuPy tests

@jrbourbeau I hope this helps explain it:

If we use array_safe in a function and the cupy test uses assert_eq, then numpy<=1.20 is required for cupy users. Regular users of Dask are ok with any numpy version.

  • If numpy>=1.20 then the like= keyword argument is used in the array creation, so everything works properly for everyone (cupy and numpy Dask users alike). However this keyword argument is only available from numpy 1.20 onwards.
  • If numpy<1.20 then array_safe ensures that dask doesn't create an error by falling back to using array creation without the like= keyword argument. That means we get a numpy array by default instead of a cupy one. That's no good for cupy users, but it works just fine for people using Dask with numpy array chunks. That's why we need to skip the cupy test is the numpy version is too low.

@jrbourbeau
Copy link
Member

Great, thanks for that very clear explanation @GenevieveBuckley

benjaminhwilliams added a commit to DiamondLightSource/python-tristan that referenced this pull request Mar 31, 2021
* Add VDS creation command-line tool
* Add Dask image binner command-line tool, with options for single-image, multi-image sweep and multi-image pump-probe binning.
* Add command line tool to inspect cue messages
* Avoid Dask v2021.03.0 due to a regression in that release — the dask.array.bincount function does not permit slicing in that version (see dask/dask#7391).
* Add more cue message info and tidy up the functions for discovering the timestamps of chosen cues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants