Skip to content

Speed up dask.bag.Bag.random_sample#10356

Merged
crusaderky merged 6 commits intodask:mainfrom
crusaderky:random_state_python
Jul 13, 2023
Merged

Speed up dask.bag.Bag.random_sample#10356
crusaderky merged 6 commits intodask:mainfrom
crusaderky:random_state_python

Conversation

@crusaderky
Copy link
Copy Markdown
Collaborator

@crusaderky crusaderky commented Jun 14, 2023

Closes #10351

%timeit random_state_data_python(1, 0)
%timeit random_state_data_python(10, 0)
%timeit random_state_data_python(100, 0)
%timeit random_state_data_python(1000, 0)

Before:

265 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.64 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.5 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
272 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

131 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
234 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.26 ms ± 9.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
20.4 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@crusaderky crusaderky self-assigned this Jun 14, 2023
@crusaderky crusaderky force-pushed the random_state_python branch from 7ffee40 to fc970c8 Compare June 14, 2023 11:05
dask/bag/core.py Outdated
[0, 1, 4]
>>> list(b.random_sample(0.5, 43))
[0, 3, 4]
[0, 1, 4]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly this doctest will fail if a user hasn't installed numpy. I don't think it's a big problem.

dask/bag/core.py Outdated

if isinstance(random_state, Random):
random_state = random_state.randint(0, maxuint32)
np_rng = np.random.RandomState(random_state)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy (and dask) RandomState is frozen and wont be getting any updates. Cant we use the new Generator interface here?

And if all we want is a bunch of reproducible parallel streams, perhaps we can move _spawn_bitgens from da.random to dask.utils and use that. Bonus would be getting rid of all those 624s (Mersenne Twister detail) in the code.

Do let me know if I am not making any sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look, perhaps just using default_rng instead of RandomState in line 2557 is enough. 624 is too deep in the stack to handle.

Feel free to ignore. Sorry

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait. I replaced RandomState with default_rng. Do you think this is sufficient to merge this PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good for merge afaic. thank you

@crusaderky crusaderky merged commit 484afe2 into dask:main Jul 13, 2023
@crusaderky crusaderky deleted the random_state_python branch July 13, 2023 18:10
j-bennet pushed a commit to j-bennet/dask that referenced this pull request Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel random_state_data_python

2 participants