Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
dask.array.bincount() - make 'minlength' keyword argument optional #4684
This PR would make the
If we can think of a nicer way to sum a bunch of jagged arrays together, I'm all ears. Right now I'm zero-padding things to the same length, but maybe there's a neater way?
referenced this pull request
Apr 10, 2019
Right, so the second graph is nicer than the previous PR because we can compute
I think that we can probably simplify this further by moving some of this logic into a custom binlength_sum Python function, that does all of the zeropadding and such in Python, rather than as separate tasks in Dask. We tend to move things away from Dask and into Python when they are complex, and pretty fast anyway.
Maybe something like this ...
def bincount_sum(bincounts): n = max(map(len, bincounts)) out = np.zeros(n, dtype=int) for b in bincounts: pad = n - len(b) out[pad:] += b return out dsk['bincount-total-' + token] = (bincount_sum, [('bincount-' + token, i) for i in range(...)])
I think that this will remove tasks from the graph and run at the same speed anyway, even if it's not in parallel.
Here's an implementation of your suggestion.
It looks like numpy needs to be bumped up to version 1.13.0 for the python 3.5 travis job, so I'm trying that now.
It's nice, the only thing is that I think I have to make the output array shape equal to
import dask.array as da x = da.from_array([0, 1, 1, 1, 3, 7, 5], chunks=2) result = da.bincount(x) result.compute()
mrocklin left a comment
A few minor suggestions on code style below. In general this looks great to me though.
I'm curious, what was breaking before? We like to be intentional of when we stop supporting older versions of those libraries.
Just an update, I wouldn't worry about this point because we recently decided to bump the minimum required NumPy version to 1.13.0 anyways. ( #4720 ) In general we should be a bit more cautious about bumping requirements to fix test failures. Though we don't need to worry about it in this case.
Of course, thanks for that!
I don't have any other questions for this PR. You and @mrocklin have been very helpful, I feel like I understand the internals of dask a lot better now.