Random Choice on Bags (#4799) by eracle · Pull Request #6208 · dask/dask

eracle · 2020-05-14T13:19:05Z

Tests added / passed
Passes black dask / flake8 dask

eracle · 2020-05-14T13:24:26Z

I also wrote a description of the proposed solution, here.

jsignell · 2020-05-14T16:30:26Z

Hi @eracle. Is there a reason why you wouldn't want to use dask.array.random.choice? It would be a very big and undesirable change to include numpy as a dependency of dask bag.

jsignell · 2020-05-14T16:36:28Z

Ah sorry, @eracle. I missed the linked issue (#4799) where there was more discussion. It seems like the consensus in that issue was to try to mimic random.sample rather than numpy.random.choice

eracle · 2020-05-14T20:00:09Z

Hey @jsignell , thanks for the point.
I just realized there exists a weighted sampling function outside numpy: the choices function (here's a link).
I will get rid of numpy from the bags, you're are definitely right.
Regarding which API to mimic, as you already pointed out, it doesn't make sense to re-discuss it since we already had consensus on mimic sample , btw, I made it as such since it was cool to have the replace parameter, which allowed to distinguish between with or without replacement.
At this point, since I will get rid of numpy from bags, in order to have with/without replacement functionality I will implement two functions (on a separated module, named bag.random): sample and choices.
By doing so, I will mimic 100% its API.
Thanks for the heads up.

…e, added tests

eracle · 2020-05-15T12:15:54Z

As #6205 issue also have (3.8) Windows CI build failing for the same reason.
I suppose something related with how numpy dependencies are managed, probably some not fully backward-compatible change in latest releases of numpy.
I should investigate more on this.

jsignell

Thanks for making the change to take out numpy! This looks pretty good. I have a few comments about code style, but the biggest point is that sample does not work properly as written.

dask/bag/random.py

jsignell · 2020-05-15T19:12:45Z

dask/bag/random.py

+    - list of k samples;
+    - total number of elements from where the sample was drawn;
+    - total number of elements to be sampled;
+    - boolean which is True whether the sample is with or without replacement.


The return docstring here could be clearer and should be in the style:

Returns -------

Done on latest commit, thanks!

jsignell · 2020-05-15T20:09:30Z

dask/bag/random.py

+    return sampled, lx, k, replace
+
+
+def _sample_reduce(reduce_iter):


I think this function should take k and reduce directly. It seems a little convoluted to include them on the response from each map partition when they are globally defined.

Done on latest commit, thanks!

dask/bag/random.py

jsignell · 2020-05-15T20:14:36Z

dask/bag/random.py

+    >>> from dask.bag import random
+    >>>
+    >>> b = db.from_sequence(range(5), npartitions=2)
+    >>> len(list(random.sample(b, 3).compute()))


I understand that you are showing len here because it'll always be 3, but it'd be more useful to show an actual response and just add the skip docktest flag.

uhh thanks! :-D

dask/bag/tests/test_random.py

jsignell · 2020-05-15T20:20:43Z

dask/bag/tests/test_random.py

+    """
+    a = db.from_sequence(range(10), partition_size=9)
+    s = random.sample(a, k=10)
+    assert len(list(s.compute())) == 10


You should have some tests for the actual contents of the returned values. At least to check that there is replacement or not depending on whether using choices or sample.

I removed the sample method, right now. I had some troubles on testing the exact output from the choices method because of reproducibility problems.
I evaluated the possibility to add the seed function to bag.random: in order to mimic python random seed.
It could be done as a further development, but is not super straightforward since there are issues that must be resolved with it.

Yeah that makes sense. I just meant something like asserting that all the values in the output are in the input bag.

Ok, I just added more specific tests on latest commit. Thanks for the advice

jsignell · 2020-05-15T20:21:08Z

dask/bag/core.py


    def reduction(
-        self, perpartition, aggregate, split_every=None, out_type=Item, name=None
+        self, perpartition, aggregate, split_every=None, out_type=Item, name=None,


I don't think you mean to change this file.

Done on latest commit, thanks!

dask/bag/random.py

eracle · 2020-05-17T11:10:05Z

Hey @jsignell Thanks a lot for the review, I will soon push a change regarding some of those, for the other, I will reply on a per line basis.

Co-authored-by: Julia Signell <jsignell@gmail.com>

eracle · 2020-05-18T10:02:58Z

Hi @eracle. Is there a reason why you wouldn't want to use dask.array.random.choice? It would be a very big and undesirable change to include numpy as a dependency of dask bag.

Hey, after this message, I was asking myself if we already have the choice feature on dask, and so I tried the one from array.random you suggested:

>>> import dask.array.random as random
>>> import dask.bag as db
>>> b = db.from_sequence(range(5), npartitions=2)
>>> random.choice(b, size=5, replace=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../dask/array/random.py", line 230, in choice
    raise ValueError("a must be one dimensional")
ValueError: a must be one dimensional

Which led to an error, so it cannot be used on bag instances.

As a wrap up, by implementing python random library on bag.random and numpy's one on array.random makes sense. Unfortunately there is a lot of work to do in order to fully mimic python random, a simple example is the seed function, which would led me to implement fully reproducible unit tests on the choices outputs :-)
In order to implement it, a similar anatomy of random must be implemented. But I would go a step at the time.
Another point here is the sample function (sampling without replacement). As I already pointed out, the distributed algorithm I implemented can do it using sampling functions without replacement internally. Unfortunately I haven't found any weighted sampling functions outside numpy yet.
So this is another point for further improvements, which would then imply closing #4799.
Should I open a dedicated issue for choices function on bags? Because, right now I am ok for merging this, but don't know what's your usual workflow.
Thanks

jsignell

Just a small cleanup of an errant replace

jsignell · 2020-05-20T19:26:00Z

dask/bag/random.py

+        Number of elements on the partition.
+    k : int
+        Number of elements to sample.
+    replace: boolean


This is gone now right?

Done on latest commit, thanks!

dask/bag/tests/test_random.py

jsignell · 2020-05-20T19:44:14Z

Right I don't think I was clear about what I meant. You can't use a bag with dask.array.random.choice you would have to use an array:

>>> import numpy as np
>>> import dask.array as da 
>>>
>>> a = da.from_array(np.arange(5), 2) 
>>> da.random.choice(a, size=5, replace=True)

So yeah, I agree that this PR is a useful addition. I don't think you need to open a new issue for adding choices on bags. We'll just leave #4799 open for when someone figures out how to implement sampling. Thanks for your work on this!

jsignell

Thanks for your patience @eracle. @dask/maintenance this looks good to me!

Choice method on Bags implemented (dask#4799)

3cd0712

eracle mentioned this pull request May 14, 2020

Random sampling of k elements from a dask bag #4799

Closed

eracle added 4 commits May 14, 2020 15:33

fixed doctest

9ec589d

fixing CI testing dependencies

bbd666a

fixed bug with numpy on python3.6

d01ca06

fixing black

88a0b44

eracle changed the title ~~Choice method on Bags implemented (#4799)~~ Random Choice on Bags (#4799) May 14, 2020

removing numpy from bags, separating bag sampling in bag.random modul…

3809ada

…e, added tests

jsignell self-requested a review May 15, 2020 15:45

jsignell requested changes May 15, 2020

View reviewed changes

eracle and others added 4 commits May 17, 2020 13:47

Update dask/bag/random.py

f751832

Co-authored-by: Julia Signell <jsignell@gmail.com>

fixes after code review, deleted sample function

f31bf97

merged

33b6fbc

fixing comma

0f41d98

jsignell requested changes May 20, 2020

View reviewed changes

improving tests, fixing replace parameter docstring

c3af402

jsignell approved these changes May 21, 2020

View reviewed changes

martindurant merged commit bd19848 into dask:master May 21, 2020

eracle deleted the features/bag_choice branch May 25, 2020 21:51

		return sampled, lx, k, replace


		def _sample_reduce(reduce_iter):

Uh oh!

Conversation

eracle commented May 14, 2020

Uh oh!

eracle commented May 14, 2020

Uh oh!

jsignell commented May 14, 2020

Uh oh!

jsignell commented May 14, 2020

Uh oh!

eracle commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eracle commented May 15, 2020

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eracle commented May 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eracle commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsignell commented May 20, 2020

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eracle commented May 14, 2020 •

edited

Loading

eracle commented May 17, 2020 •

edited

Loading

eracle commented May 18, 2020 •

edited

Loading