Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key argument to Bag.distinct #4423

Merged
merged 11 commits into from
May 1, 2019
Merged

Add key argument to Bag.distinct #4423

merged 11 commits into from
May 1, 2019

Conversation

dsevero
Copy link
Contributor

@dsevero dsevero commented Jan 25, 2019

  • Tests added / passed
  • Passes flake8 dask

closes #2493

A few things to discuss:

  1. Should we change the implementation to use fold instead of reduction?
  2. Is it ok to default key to toolz.identity instead of None + conditional logic?

@dsevero
Copy link
Contributor Author

dsevero commented Jan 26, 2019

@mrocklin can't quite get my head around why AppVeyor is failing (also happened in #4427).

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @daniel-severo ! I've added a few comments below. I'm looking forward to seeing this in.

dask/bag/tests/test_bag.py Outdated Show resolved Hide resolved
dask/bag/core.py Outdated Show resolved Hide resolved
dask/bag/core.py Outdated Show resolved Hide resolved
Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good . Some additional feedback below.

dask/bag/core.py Outdated Show resolved Hide resolved
dask/bag/core.py Outdated Show resolved Hide resolved
dask/bag/core.py Outdated
@@ -1630,8 +1648,14 @@ def from_delayed(values):
return Bag(graph, name, len(values))


def merge_distinct(seqs):
return set().union(*seqs)
@curry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to avoid curry here if possible. Currying can easily generate complex situations. I recommend either using functools.partial or passing keyword arguments explicitly (this would require changes to reduction to optionally support new perpartition_kwargs and aggregate_kwargs keyword arguments. Short term I recommend going with partial because it's probably easier.

dask/bag/core.py Outdated
"""
return self.reduction(set, merge_distinct, out_type=Bag,
key_func = key if callable(key) or key is None else lambda x: x[key]
perpartition = lambda seq: list(toolz.unique(seq, key=key_func))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to avoid the dynamic creation of functions here. They are somewhat costly to serialize. I recommend making another top level function like merge_distinct

dask/bag/core.py Outdated
"""
return self.reduction(set, merge_distinct, out_type=Bag,
key_func = key if callable(key) or key is None else lambda x: x[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I would handle this within the perpartition or merge_distinct functions in order to avoid passing around lambdas in the graph.

dask/bag/core.py Outdated
@curry
def merge_distinct(seqs, key=None):
if key is None:
return list(toolz.unique(toolz.concat(seqs)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cytoolz is around then we'll probably want to use cytoolz.unique instead. See the toolz and cytoolz imports at the top of this file.

mrocklin and others added 2 commits January 28, 2019 20:40
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
@mrocklin
Copy link
Member

Any thoughts on the other comments @daniel-severo ? No time pressure, I just wanted to make sure that you had seen them.

@mrocklin
Copy link
Member

@dsevero any update here? Are you comfortable making the changes here that would reduce the creation of dynamic functions?

@martindurant
Copy link
Member

^ ping should have been to @dsevero

@jcrist jcrist self-assigned this Apr 30, 2019
@jcrist
Copy link
Member

jcrist commented Apr 30, 2019

I pushed some updates (hope you don't mind @dsevero). Should be good to go on tests pass.

@jcrist jcrist merged commit 78c412f into dask:master May 1, 2019
@dsevero dsevero deleted the issue-2493 branch May 10, 2019 16:52
@dsevero
Copy link
Contributor Author

dsevero commented May 10, 2019

Sorry guys, I had to change my github username (hence the mix-up) and I was away from development for a while due to personal reasons (not related to the name change).

Not at all @jcrist, thanks for the help! Sorry to keep you guys waiting.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
* Comply with flake8

* Add docs to Bag.distinct

* Add toolz.compose to Bag.distinct

* Attempt to fix doctests

* Attempt to fix doctests

* Use assert_eq in bag tests

* Removed toolz drom merge_distinct and Bag.distinct

* Removed unused import

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* fixups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to drop duplicates on Bag based on subset of data
4 participants