Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify mean chunk functions to return dicts rather than arrays #4513

Merged
merged 2 commits into from Feb 25, 2019

Conversation

Projects
None yet
2 participants
@mrocklin
Copy link
Member

commented Feb 20, 2019

These functions return two array chunks each:

  1. The sum of the array
  2. The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it. This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays. This is nice in a few ways,
but does mean that we need to replicate the concatenate=
functionality within our chunk aggregation and combine functions, which
can be tricky. This commit fails to do it correctly, but may be a good
start.

This is also somewhat suboptimal in a distributed setting because we no longer get special-cased handling of numpy arrays like we used to. The scheduler will probably miscount the number of bytes that these objects hold and will probably not use ideal serialization and compression. I think that this is probably ok because they're likely to be small, but it's worth pointing out.

After this we would like to implement the same fix for other functions,
notably the moment_* functions in this same reductions.py file.

Fixes #4481

Help or ideas with this would be welcome. cc @pentschev @jakirkham

  • Tests added / passed
  • Passes flake8 dask
Modify mean chunk functions to return dicts rather than arrays
These functions return two array chunks each:

1.  The sum of the array
2.  The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it.  This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays.  This is nice in a few ways,
but does mean that we need to replicate the `concatenate=`
functionality within our chunk aggregation and combine functions, which
can be tricky.  This commit fails to do it correctly, but may be a good
start.

After this we would like to implement the same fix for other functions,
notably the `moment_*` functions in this same reductions.py file.

Fixes #4481
total = _concatenate2(totals, axes=axis).sum(axis=axis, **kwargs)

return divide(total.sum(dtype=dtype, **kwargs),
n.sum(dtype=dtype, **kwargs), dtype=dtype)

This comment has been minimized.

Copy link
@pentschev

pentschev Feb 21, 2019

Member

Found the mistake, you're calling .sum() twice for both total and n. Get rid of one and you should be good to go!

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

By the way, @mrocklin if you want to get your commit fixed and merge it, feel free to leave moment_* to me, I just found out it's the next one in line for Dask-GLM+CuPY.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

I don't mind taking it over, but the fix is to literally remove the double sum(). All tests passed locally after I did so. If you prefer that I take over, then just close this PR and I'll commit my changes to a new PR.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Pushed. @jakirkham if you're around to review that would be welcome.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Thinking about extending this to moment, it looks like there is a little bit of a challenge here:

    empty = empty_lookup.dispatch(type(n))
    M = empty(n.shape + (order - 1,), dtype=dtype)

    for o in range(2, order + 1):
        M[..., o - 2] = _moment_helper(Ms, ns, inner_term, o, sum, kwargs)

Here we construct an empty array and then assign into it.

Instead, we might construct many arrays and then concatenate them together.

xs = [_moment_helper(Ms, ns, inner_term, o, sum, kwargs) for o in range(2, order + 1)]
M = np.stack(x2, axis=-1)

(I might have the axis= keyword above wrong. Also depending on the shape of the output of _moment_helper we might want np.concatenate rather than np.stack.

cc @pentschev

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

@mrocklin that might work, but couldn't that cause a potential performance issue? I guess the stacked blocks will all remain different memory blocks, right? My concern is more on the case if stacking doesn't cause all the blocks to be moved to a single memory block, thus involving allocation and copy.

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

And also, I'd like to point out that this will still require a lookup for empty_like, which is simple to add.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

but couldn't that cause a potential performance issue?

I suspect that the data at this stage of the computation will probably be pretty small (it has already gone through a reduction along at least one axis), so I'm not too concerned about performance.

And also, I'd like to point out that this will still require a lookup for empty_like, which is simple to add.

I think that we can avoid this. We no longer construct an empty array and then assign into it. Instead we create many arrays and then call np.concatenate. It's true that np.concatenate will now need to be dispatchable, but I think that that is already handled by the __array_function__ protocol.

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Nevermind, you're right, I misinterpreted your suggestion on how to do the stacking. I will work on it, thanks for the suggestions.

pentschev added a commit to pentschev/dask that referenced this pull request Feb 22, 2019

@pentschev pentschev referenced this pull request Feb 22, 2019

Merged

Modify moment chunk functions to return dicts #4519

0 of 2 tasks complete
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2019

Merging this later today if there are no further comments

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2019

Nope, not merging. Adding a failing test. We still use np.ones_like here as described in #4519 (comment)

To be honest I'm hoping that @pentschev is able to come up with a solution that I'll be able to steal here :)

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 22, 2019

Nope, not merging. Adding a failing test. We still use np.ones_like here as described in #4519 (comment)

To be honest I'm hoping that @pentschev is able to come up with a solution that I'll be able to steal here :)

I just tried this new test you added and it fails also in current master (i.e., before the changes in this PR). I think this is something we could (maybe even should) tackle in a different issue?

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 22, 2019

Opened new issue to tackle the sparse error in #4523.

@mrocklin I would suggest reverting the test you added here and later adding it to the issue I just mentioned.

mrocklin added a commit that referenced this pull request Feb 25, 2019

@mrocklin mrocklin force-pushed the mrocklin:mean-dict branch from e33a301 to b2f3394 Feb 25, 2019

@pentschev

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

I think this is good to be merged now @mrocklin

@mrocklin mrocklin merged commit 119950b into dask:master Feb 25, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:mean-dict branch Feb 25, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Modify mean chunk functions to return dicts rather than arrays (dask#…
…4513)

These functions return two array chunks each:

1.  The sum of the array
2.  The counts of the array

We need to return these as a single intermediate value.

Previously we did this by constructing an empty numpy array and then
assigning into it.  This doesn't work as well with numpy-like arrays
like cupy and sparse.

Now we return a dict containing two arrays.  This is nice in a few ways,
but does mean that we need to replicate the `concatenate=`
functionality within our chunk aggregation and combine functions, which
can be tricky.  This commit fails to do it correctly, but may be a good
start.

After this we would like to implement the same fix for other functions,
notably the `moment_*` functions in this same reductions.py file.

Fixes dask#4481
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.