Speed up dask.bag.Bag.random_sample#10356
Conversation
7ffee40 to
fc970c8
Compare
dask/bag/core.py
Outdated
| [0, 1, 4] | ||
| >>> list(b.random_sample(0.5, 43)) | ||
| [0, 3, 4] | ||
| [0, 1, 4] |
There was a problem hiding this comment.
Annoyingly this doctest will fail if a user hasn't installed numpy. I don't think it's a big problem.
dask/bag/core.py
Outdated
|
|
||
| if isinstance(random_state, Random): | ||
| random_state = random_state.randint(0, maxuint32) | ||
| np_rng = np.random.RandomState(random_state) |
There was a problem hiding this comment.
numpy (and dask) RandomState is frozen and wont be getting any updates. Cant we use the new Generator interface here?
And if all we want is a bunch of reproducible parallel streams, perhaps we can move _spawn_bitgens from da.random to dask.utils and use that. Bonus would be getting rid of all those 624s (Mersenne Twister detail) in the code.
Do let me know if I am not making any sense.
There was a problem hiding this comment.
On second look, perhaps just using default_rng instead of RandomState in line 2557 is enough. 624 is too deep in the stack to handle.
Feel free to ignore. Sorry
There was a problem hiding this comment.
Sorry for the wait. I replaced RandomState with default_rng. Do you think this is sufficient to merge this PR?
There was a problem hiding this comment.
yes, good for merge afaic. thank you
Closes #10351
Before:
After: