Random Choice on Bags (#4799)#6208
Conversation
|
I also wrote a description of the proposed solution, here. |
|
Hi @eracle. Is there a reason why you wouldn't want to use |
|
Hey @jsignell , thanks for the point. |
|
As #6205 issue also have (3.8) Windows CI build failing for the same reason. |
jsignell
left a comment
There was a problem hiding this comment.
Thanks for making the change to take out numpy! This looks pretty good. I have a few comments about code style, but the biggest point is that sample does not work properly as written.
dask/bag/random.py
Outdated
| - list of k samples; | ||
| - total number of elements from where the sample was drawn; | ||
| - total number of elements to be sampled; | ||
| - boolean which is True whether the sample is with or without replacement. |
There was a problem hiding this comment.
The return docstring here could be clearer and should be in the style:
Returns
-------
There was a problem hiding this comment.
Done on latest commit, thanks!
dask/bag/random.py
Outdated
| return sampled, lx, k, replace | ||
|
|
||
|
|
||
| def _sample_reduce(reduce_iter): |
There was a problem hiding this comment.
I think this function should take k and reduce directly. It seems a little convoluted to include them on the response from each map partition when they are globally defined.
There was a problem hiding this comment.
Done on latest commit, thanks!
dask/bag/random.py
Outdated
| >>> from dask.bag import random | ||
| >>> | ||
| >>> b = db.from_sequence(range(5), npartitions=2) | ||
| >>> len(list(random.sample(b, 3).compute())) |
There was a problem hiding this comment.
I understand that you are showing len here because it'll always be 3, but it'd be more useful to show an actual response and just add the skip docktest flag.
dask/bag/tests/test_random.py
Outdated
| """ | ||
| a = db.from_sequence(range(10), partition_size=9) | ||
| s = random.sample(a, k=10) | ||
| assert len(list(s.compute())) == 10 |
There was a problem hiding this comment.
You should have some tests for the actual contents of the returned values. At least to check that there is replacement or not depending on whether using choices or sample.
There was a problem hiding this comment.
I removed the sample method, right now. I had some troubles on testing the exact output from the choices method because of reproducibility problems.
I evaluated the possibility to add the seed function to bag.random: in order to mimic python random seed.
It could be done as a further development, but is not super straightforward since there are issues that must be resolved with it.
There was a problem hiding this comment.
Yeah that makes sense. I just meant something like asserting that all the values in the output are in the input bag.
There was a problem hiding this comment.
Ok, I just added more specific tests on latest commit. Thanks for the advice
dask/bag/core.py
Outdated
|
|
||
| def reduction( | ||
| self, perpartition, aggregate, split_every=None, out_type=Item, name=None | ||
| self, perpartition, aggregate, split_every=None, out_type=Item, name=None, |
There was a problem hiding this comment.
I don't think you mean to change this file.
There was a problem hiding this comment.
Done on latest commit, thanks!
|
Hey @jsignell Thanks a lot for the review, I will soon push a change regarding some of those, for the other, I will reply on a per line basis. |
Co-authored-by: Julia Signell <jsignell@gmail.com>
Hey, after this message, I was asking myself if we already have the choice feature on dask, and so I tried the one from array.random you suggested: >>> import dask.array.random as random
>>> import dask.bag as db
>>> b = db.from_sequence(range(5), npartitions=2)
>>> random.choice(b, size=5, replace=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../dask/array/random.py", line 230, in choice
raise ValueError("a must be one dimensional")
ValueError: a must be one dimensionalWhich led to an error, so it cannot be used on bag instances. As a wrap up, by implementing python random library on bag.random and numpy's one on array.random makes sense. Unfortunately there is a lot of work to do in order to fully mimic python random, a simple example is the seed function, which would led me to implement fully reproducible unit tests on the choices outputs :-) |
jsignell
left a comment
There was a problem hiding this comment.
Just a small cleanup of an errant replace
dask/bag/random.py
Outdated
| Number of elements on the partition. | ||
| k : int | ||
| Number of elements to sample. | ||
| replace: boolean |
There was a problem hiding this comment.
Done on latest commit, thanks!
|
Right I don't think I was clear about what I meant. You can't use a bag with >>> import numpy as np
>>> import dask.array as da
>>>
>>> a = da.from_array(np.arange(5), 2)
>>> da.random.choice(a, size=5, replace=True) So yeah, I agree that this PR is a useful addition. I don't think you need to open a new issue for adding choices on bags. We'll just leave #4799 open for when someone figures out how to implement sampling. Thanks for your work on this! |
black dask/flake8 dask