Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask.array.bincount() - make 'minlength' keyword argument optional #4684

Merged
merged 18 commits into from Apr 24, 2019

Conversation

Projects
None yet
3 participants
@GenevieveBuckley
Copy link
Contributor

commented Apr 10, 2019

This PR would make the minlength keyword argument to dask.array.bincount() optional, instead of required.

It implements @mrocklin 's suggested approach:

As an alternative, what if we just called np.bincount on all of the chunks without specifying a minlength value and only afterwards, once we had all of them together, would we make a new array of the correct length and then add all of them into that array? This might require a bit more book keeping as we add things together (it's no longer a simple np.sum I think) but would probably improve our ability to keep this running smoothly in low memory.

If we can think of a nicer way to sum a bunch of jagged arrays together, I'm all ears. Right now I'm zero-padding things to the same length, but maybe there's a neater way?

  • Tests added / passed
  • Passes flake8 dask

Closes #4556

@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

import dask.array as da
x = da.from_array([0, 1, 1, 1, 3, 7, 5], chunks=2)

Here is the computation graph where minlength is specified:

image

And here is the computation graph if we don't specify minlength:

image

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

Right, so the second graph is nicer than the previous PR because we can compute binlength and release the array pieces (which are probably large) quickly rather than wait on the length computation.

I think that we can probably simplify this further by moving some of this logic into a custom binlength_sum Python function, that does all of the zeropadding and such in Python, rather than as separate tasks in Dask. We tend to move things away from Dask and into Python when they are complex, and pretty fast anyway.

Maybe something like this ...

def bincount_sum(bincounts):
    n = max(map(len, bincounts))
    out = np.zeros(n, dtype=int)
    for b in bincounts:
        pad = n - len(b)
        out[pad:] += b
    return out

dsk['bincount-total-' + token] = (bincount_sum, [('bincount-' + token, i) for i in range(...)])

I think that this will remove tasks from the graph and run at the same speed anyway, even if it's not in parallel.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

That code is untested and will probably need to be modified. Hopefully it gets a point across though.

@jakirkham jakirkham added the array label Apr 16, 2019

@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2019

Here's an implementation of your suggestion.

It looks like numpy needs to be bumped up to version 1.13.0 for the python 3.5 travis job, so I'm trying that now.

It's nice, the only thing is that I think I have to make the output array shape equal to (nan,) if minlength is not specified. There's no way to tell dask "I'm not exactly sure how big this will be, but I do know it will be relatively small and should all go in one chunk", is there?

import dask.array as da
x = da.from_array([0, 1, 1, 1, 3, 7, 5], chunks=2)
result = da.bincount(x)
result.compute()
array([1, 3, 0, 1, 0, 1, 0, 1])

dask_bincount

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

Setting the chunk size to a single nan is the way to do that. 😉

@mrocklin
Copy link
Member

left a comment

A few minor suggestions on code style below. In general this looks great to me though.

It looks like numpy needs to be bumped up to version 1.13.0 for the python 3.5 travis job, so I'm trying that now.

I'm curious, what was breaking before? We like to be intentional of when we stop supporting older versions of those libraries.

Show resolved Hide resolved dask/array/routines.py Outdated
Show resolved Hide resolved dask/array/routines.py Outdated

# Sum up all of the intermediate bincounts per block
bincount_list = [i for i in list(dsk) if i[0] == 'bincount-' + token]
name = 'bincount-sum-' + token

This comment has been minimized.

Copy link
@mrocklin

mrocklin Apr 17, 2019

Member

Maybe we can move the definition of name up by tokenize again (maybe even removing the token intermediate variable)

This comment has been minimized.

Copy link
@GenevieveBuckley

GenevieveBuckley Apr 21, 2019

Author Contributor

Sure. I don't really understand what what token is doing here, tbh

This comment has been minimized.

Copy link
@GenevieveBuckley

GenevieveBuckley Apr 24, 2019

Author Contributor

token variable has been removed.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Apr 24, 2019

Member

Ah, sorry, the tokenize process is still improtant. Every task is uniquely named in Dask, similar to git commits, to help with deduplication. I've reverted this and changed things a little bit in 41c6b00 and added tests to show why it's used.

This comment has been minimized.

Copy link
@GenevieveBuckley

GenevieveBuckley Apr 26, 2019

Author Contributor

Aaah, ok. I had assumed that was what it was for, but then got confused about what you were trying to say with this comment. Thanks for clarifying!

@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 20, 2019

It looks like numpy needs to be bumped up to version 1.13.0 for the python 3.5 travis job, so I'm trying that now.

I'm curious, what was breaking before? We like to be intentional of when we stop supporting older versions of those libraries.

Just an update, I wouldn't worry about this point because we recently decided to bump the minimum required NumPy version to 1.13.0 anyways. ( #4720 ) In general we should be a bit more cautious about bumping requirements to fix test failures. Though we don't need to worry about it in this case.

mrocklin and others added some commits Apr 21, 2019

Update dask/array/routines.py
Replace unnecessary variable bincount_list

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>
Update dask/array/routines.py
Remove bincount_list variable

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>
@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

Have taken the liberty of fixing the merge conflict above. Hope that is ok 🙂

Show resolved Hide resolved dask/array/routines.py Outdated
Show resolved Hide resolved dask/array/routines.py Outdated
Show resolved Hide resolved dask/array/routines.py Outdated
@jakirkham

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

Thanks for working on this @GenevieveBuckley. This looks like a nice addition. Had a couple of questions and suggestions above. Please ping me if you have questions or need another review 🙂

jakirkham and others added some commits Apr 24, 2019

Update dask/array/routines.py
Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>
@GenevieveBuckley

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

Have taken the liberty of fixing the merge conflict above. Hope that is ok 🙂

Of course, thanks for that!

I don't have any other questions for this PR. You and @mrocklin have been very helpful, I feel like I understand the internals of dask a lot better now. 😄

@mrocklin mrocklin merged commit 78feda7 into dask:master Apr 24, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

This is in. Thanks for your work here @GenevieveBuckley ! This is definitely an improvement over what was here before. It'll be nice not to have to ever explain this limitation again :)

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

dask.array.bincount() - make 'minlength' keyword argument optional (d…
…ask#4684)

* da.bincount() automatic calculation of 'minlength' kwarg

* Test for da.bincount unspecified minlength kwarg

* _bincount_sum() helper function for bincount

* Make sure dtype is propagated through computation

* (nan,) shape if minlength not specified, must compute to find len()

* Remove unused import from dask/array/routines.py

* Not necessary to specify dtype of bincount input array

* Bump numpy version to 1.13.0 for travis python 3.5 builds

* Bump pandas to 0.20.2 for python 3.5 travis build

* Update dask/array/routines.py

Replace unnecessary variable bincount_list

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Update dask/array/routines.py

Remove bincount_list variable

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Move name definition next to tokenize for dask.array.bincount

* Update dask/array/routines.py

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Return to using dictionary comprehensions

* Raise ValueError instead of using assert statements in bincount function

* check names and add back token

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

dask.array.bincount() - make 'minlength' keyword argument optional (d…
…ask#4684)

* da.bincount() automatic calculation of 'minlength' kwarg

* Test for da.bincount unspecified minlength kwarg

* _bincount_sum() helper function for bincount

* Make sure dtype is propagated through computation

* (nan,) shape if minlength not specified, must compute to find len()

* Remove unused import from dask/array/routines.py

* Not necessary to specify dtype of bincount input array

* Bump numpy version to 1.13.0 for travis python 3.5 builds

* Bump pandas to 0.20.2 for python 3.5 travis build

* Update dask/array/routines.py

Replace unnecessary variable bincount_list

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Update dask/array/routines.py

Remove bincount_list variable

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Move name definition next to tokenize for dask.array.bincount

* Update dask/array/routines.py

Co-Authored-By: GenevieveBuckley <30920819+GenevieveBuckley@users.noreply.github.com>

* Return to using dictionary comprehensions

* Raise ValueError instead of using assert statements in bincount function

* check names and add back token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.