Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: shuffle dask array #3901

Merged
merged 13 commits into from Aug 8, 2019

Conversation

@TomAugspurger
Copy link
Member

commented Aug 24, 2018

Closes #3409

Some brief timings on

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)
task old new
build graph 419 ms 28.7 ms
compute 14.5 s 48.9 ms
@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Aug 24, 2018

IIUC, In #3409, @mrocklin mentioned adjusting slicing_plan to detect when we should use this slicing method. I've not attempted that. So a "naive" shuffle of a dask array with

index = np.arange(len(arr))
np.random.shuffle(index)
arr[index]

is still going to be very slow. But the use cases I have in mind (#3409, lmcinnes/umap#62, approximate nearest neighbors) can opt into the faster slicing, when we know we have the right kind of index array.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Aug 24, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2018

Thanks, I had forgotten about that. I'll take a look.


offsets = np.roll(np.cumsum(chunks[0]), 1)
offsets[0] = 0
offsets

This comment has been minimized.

Copy link
@jakirkham

jakirkham Aug 27, 2018

Member

Is this needed?

@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@TomAugspurger, what's left to be done here?

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented May 1, 2019

Sorry, forgot about this. I think that this is useful, so I've fixed the merge conflicts. I haven't re-reviewed the implementation though.

@jakirkham

This comment has been minimized.

Copy link
Member

commented May 29, 2019

@shoyer, any thoughts on this implementation of shuffle for Dask Arrays?

@martindurant

This comment has been minimized.

Copy link
Member

commented Jun 19, 2019

Ping: this seems to have been left to go stale

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Jun 19, 2019

The difference in inplace vs. a new Array is the main thing concerning me. Do we have other places in dask.array that differ from NumPy like this?

@martindurant

This comment has been minimized.

Copy link
Member

commented Jun 19, 2019

So long as the doc is clear, it should be OK - we are different from numpy in a number of ways in a number of places; I expect that includes in-place behaviour somewhere, although I don't know for sure.

dask/array/random.py Outdated Show resolved Hide resolved
dask/array/tests/test_random.py Outdated Show resolved Hide resolved
@jcrist

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

The difference in inplace vs. a new Array is the main thing concerning me.

There are other places in the api where we mutate an existing array/dataframe object inplace (this is fine as long as we don't mutate the graph). I'd prefer we match numpy's mutating api here if possible, as I suspect differing will lead to user issues in the future.

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2019

Merging later today if there aren't any objections.

@TomAugspurger TomAugspurger merged commit 51ff4e6 into dask:master Aug 8, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@TomAugspurger TomAugspurger added the array label Aug 8, 2019

@jakirkham

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

Thanks for working on this Tom! 😄

@stsievert

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

Thanks for working on this @TomAugspurger!

I've rerun the timing comparison to see how the new implementation works and show how to use it.

Implementation Graph build Computation
shuffle_blocks (this PR) 77.8ms 67.1 ms
Naive indexing 721ms 13.007s

Here's the code I used to generate it:

import dask.array as da
import numpy as np
import dask
from time import time
from dask.array.slicing import shuffle_slice

if __name__ == "__main__":
    x = da.random.random((100_000, 10), chunks=10_000)
    index = np.arange(len(x))
    np.random.shuffle(index)

    start = time()
    y2 = shuffle_slice(x, index)  # 0.07785s
    print(time() - start)
    start = time()
    z2 = y2.compute()  # 0.06716
    print(time() - start)

    start = time()
    y1 = x[index]  # 0.721
    print(time() - start)
    start = time()
    z1 = y1.compute()  # 13.0067
    print(time() - start)

The core of this code is

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)

y1 = shuffle_blocks(x, index)  # shuffle_blocks
y2 = x[index]  # naive indexing

@TomAugspurger TomAugspurger deleted the TomAugspurger:ndarray-shuffle branch Aug 30, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.