-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add key argument to Bag.distinct #4423
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @daniel-severo ! I've added a few comments below. I'm looking forward to seeing this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good . Some additional feedback below.
dask/bag/core.py
Outdated
@@ -1630,8 +1648,14 @@ def from_delayed(values): | |||
return Bag(graph, name, len(values)) | |||
|
|||
|
|||
def merge_distinct(seqs): | |||
return set().union(*seqs) | |||
@curry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to avoid curry
here if possible. Currying can easily generate complex situations. I recommend either using functools.partial
or passing keyword arguments explicitly (this would require changes to reduction
to optionally support new perpartition_kwargs
and aggregate_kwargs
keyword arguments. Short term I recommend going with partial because it's probably easier.
dask/bag/core.py
Outdated
""" | ||
return self.reduction(set, merge_distinct, out_type=Bag, | ||
key_func = key if callable(key) or key is None else lambda x: x[key] | ||
perpartition = lambda seq: list(toolz.unique(seq, key=key_func)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to avoid the dynamic creation of functions here. They are somewhat costly to serialize. I recommend making another top level function like merge_distinct
dask/bag/core.py
Outdated
""" | ||
return self.reduction(set, merge_distinct, out_type=Bag, | ||
key_func = key if callable(key) or key is None else lambda x: x[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly I would handle this within the perpartition
or merge_distinct
functions in order to avoid passing around lambdas in the graph.
dask/bag/core.py
Outdated
@curry | ||
def merge_distinct(seqs, key=None): | ||
if key is None: | ||
return list(toolz.unique(toolz.concat(seqs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cytoolz is around then we'll probably want to use cytoolz.unique instead. See the toolz and cytoolz imports at the top of this file.
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
Any thoughts on the other comments @daniel-severo ? No time pressure, I just wanted to make sure that you had seen them. |
@dsevero any update here? Are you comfortable making the changes here that would reduce the creation of dynamic functions? |
^ ping should have been to @dsevero |
I pushed some updates (hope you don't mind @dsevero). Should be good to go on tests pass. |
Sorry guys, I had to change my github username (hence the mix-up) and I was away from development for a while due to personal reasons (not related to the name change). Not at all @jcrist, thanks for the help! Sorry to keep you guys waiting. |
* Comply with flake8 * Add docs to Bag.distinct * Add toolz.compose to Bag.distinct * Attempt to fix doctests * Attempt to fix doctests * Use assert_eq in bag tests * Removed toolz drom merge_distinct and Bag.distinct * Removed unused import * Update dask/bag/core.py Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com> * Update dask/bag/core.py Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com> * fixups
flake8 dask
closes #2493
A few things to discuss:
fold
instead ofreduction
?key
totoolz.identity
instead ofNone
+ conditional logic?