Skip to content

Random Choice on Bags (#4799)#6208

Merged
martindurant merged 11 commits intodask:masterfrom
eracle:features/bag_choice
May 21, 2020
Merged

Random Choice on Bags (#4799)#6208
martindurant merged 11 commits intodask:masterfrom
eracle:features/bag_choice

Conversation

@eracle
Copy link
Copy Markdown
Contributor

@eracle eracle commented May 14, 2020

  • Tests added / passed
  • Passes black dask / flake8 dask

@eracle
Copy link
Copy Markdown
Contributor Author

eracle commented May 14, 2020

I also wrote a description of the proposed solution, here.

@jsignell
Copy link
Copy Markdown
Member

Hi @eracle. Is there a reason why you wouldn't want to use dask.array.random.choice? It would be a very big and undesirable change to include numpy as a dependency of dask bag.

@jsignell
Copy link
Copy Markdown
Member

Ah sorry, @eracle. I missed the linked issue (#4799) where there was more discussion. It seems like the consensus in that issue was to try to mimic random.sample rather than numpy.random.choice

@eracle eracle changed the title Choice method on Bags implemented (#4799) Random Choice on Bags (#4799) May 14, 2020
@eracle
Copy link
Copy Markdown
Contributor Author

eracle commented May 14, 2020

Hey @jsignell , thanks for the point.
I just realized there exists a weighted sampling function outside numpy: the choices function (here's a link).
I will get rid of numpy from the bags, you're are definitely right.
Regarding which API to mimic, as you already pointed out, it doesn't make sense to re-discuss it since we already had consensus on mimic sample , btw, I made it as such since it was cool to have the replace parameter, which allowed to distinguish between with or without replacement.
At this point, since I will get rid of numpy from bags, in order to have with/without replacement functionality I will implement two functions (on a separated module, named bag.random): sample and choices.
By doing so, I will mimic 100% its API.
Thanks for the heads up.

@eracle
Copy link
Copy Markdown
Contributor Author

eracle commented May 15, 2020

As #6205 issue also have (3.8) Windows CI build failing for the same reason.
I suppose something related with how numpy dependencies are managed, probably some not fully backward-compatible change in latest releases of numpy.
I should investigate more on this.

@jsignell jsignell self-requested a review May 15, 2020 15:45
Copy link
Copy Markdown
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change to take out numpy! This looks pretty good. I have a few comments about code style, but the biggest point is that sample does not work properly as written.

- list of k samples;
- total number of elements from where the sample was drawn;
- total number of elements to be sampled;
- boolean which is True whether the sample is with or without replacement.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return docstring here could be clearer and should be in the style:

Returns
-------

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on latest commit, thanks!

return sampled, lx, k, replace


def _sample_reduce(reduce_iter):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should take k and reduce directly. It seems a little convoluted to include them on the response from each map partition when they are globally defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on latest commit, thanks!

>>> from dask.bag import random
>>>
>>> b = db.from_sequence(range(5), npartitions=2)
>>> len(list(random.sample(b, 3).compute()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you are showing len here because it'll always be 3, but it'd be more useful to show an actual response and just add the skip docktest flag.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh thanks! :-D

"""
a = db.from_sequence(range(10), partition_size=9)
s = random.sample(a, k=10)
assert len(list(s.compute())) == 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have some tests for the actual contents of the returned values. At least to check that there is replacement or not depending on whether using choices or sample.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the sample method, right now. I had some troubles on testing the exact output from the choices method because of reproducibility problems.
I evaluated the possibility to add the seed function to bag.random: in order to mimic python random seed.
It could be done as a further development, but is not super straightforward since there are issues that must be resolved with it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I just meant something like asserting that all the values in the output are in the input bag.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just added more specific tests on latest commit. Thanks for the advice

dask/bag/core.py Outdated

def reduction(
self, perpartition, aggregate, split_every=None, out_type=Item, name=None
self, perpartition, aggregate, split_every=None, out_type=Item, name=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you mean to change this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on latest commit, thanks!

@eracle
Copy link
Copy Markdown
Contributor Author

eracle commented May 17, 2020

Hey @jsignell Thanks a lot for the review, I will soon push a change regarding some of those, for the other, I will reply on a per line basis.

@eracle
Copy link
Copy Markdown
Contributor Author

eracle commented May 18, 2020

Hi @eracle. Is there a reason why you wouldn't want to use dask.array.random.choice? It would be a very big and undesirable change to include numpy as a dependency of dask bag.

Hey, after this message, I was asking myself if we already have the choice feature on dask, and so I tried the one from array.random you suggested:

>>> import dask.array.random as random
>>> import dask.bag as db
>>> b = db.from_sequence(range(5), npartitions=2)
>>> random.choice(b, size=5, replace=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../dask/array/random.py", line 230, in choice
    raise ValueError("a must be one dimensional")
ValueError: a must be one dimensional

Which led to an error, so it cannot be used on bag instances.

As a wrap up, by implementing python random library on bag.random and numpy's one on array.random makes sense. Unfortunately there is a lot of work to do in order to fully mimic python random, a simple example is the seed function, which would led me to implement fully reproducible unit tests on the choices outputs :-)
In order to implement it, a similar anatomy of random must be implemented. But I would go a step at the time.
Another point here is the sample function (sampling without replacement). As I already pointed out, the distributed algorithm I implemented can do it using sampling functions without replacement internally. Unfortunately I haven't found any weighted sampling functions outside numpy yet.
So this is another point for further improvements, which would then imply closing #4799.
Should I open a dedicated issue for choices function on bags? Because, right now I am ok for merging this, but don't know what's your usual workflow.
Thanks

Copy link
Copy Markdown
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small cleanup of an errant replace

Number of elements on the partition.
k : int
Number of elements to sample.
replace: boolean
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gone now right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on latest commit, thanks!

@jsignell
Copy link
Copy Markdown
Member

Right I don't think I was clear about what I meant. You can't use a bag with dask.array.random.choice you would have to use an array:

>>> import numpy as np
>>> import dask.array as da 
>>>
>>> a = da.from_array(np.arange(5), 2) 
>>> da.random.choice(a, size=5, replace=True) 

So yeah, I agree that this PR is a useful addition. I don't think you need to open a new issue for adding choices on bags. We'll just leave #4799 open for when someone figures out how to implement sampling. Thanks for your work on this!

Copy link
Copy Markdown
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @eracle. @dask/maintenance this looks good to me!

@martindurant martindurant merged commit bd19848 into dask:master May 21, 2020
@eracle eracle deleted the features/bag_choice branch May 25, 2020 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants