Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key argument to Bag.distinct #4423

Merged
merged 11 commits into from May 1, 2019

Conversation

Projects
None yet
4 participants
@dsevero
Copy link
Contributor

commented Jan 25, 2019

  • Tests added / passed
  • Passes flake8 dask

closes #2493

A few things to discuss:

  1. Should we change the implementation to use fold instead of reduction?
  2. Is it ok to default key to toolz.identity instead of None + conditional logic?

dsevero added some commits Jan 25, 2019

@dsevero

This comment has been minimized.

Copy link
Contributor Author

commented Jan 26, 2019

@mrocklin can't quite get my head around why AppVeyor is failing (also happened in #4427).

@mrocklin
Copy link
Member

left a comment

Thanks for the contribution @daniel-severo ! I've added a few comments below. I'm looking forward to seeing this in.

Show resolved Hide resolved dask/bag/tests/test_bag.py Outdated
Show resolved Hide resolved dask/bag/core.py Outdated
Show resolved Hide resolved dask/bag/core.py Outdated

dsevero added some commits Jan 28, 2019

@mrocklin
Copy link
Member

left a comment

Looking good . Some additional feedback below.

Show resolved Hide resolved dask/bag/core.py Outdated
Show resolved Hide resolved dask/bag/core.py Outdated
@@ -1630,8 +1648,14 @@ def from_delayed(values):
return Bag(graph, name, len(values))


def merge_distinct(seqs):
return set().union(*seqs)
@curry

This comment has been minimized.

Copy link
@mrocklin

mrocklin Jan 28, 2019

Member

It would be good to avoid curry here if possible. Currying can easily generate complex situations. I recommend either using functools.partial or passing keyword arguments explicitly (this would require changes to reduction to optionally support new perpartition_kwargs and aggregate_kwargs keyword arguments. Short term I recommend going with partial because it's probably easier.

"""
return self.reduction(set, merge_distinct, out_type=Bag,
key_func = key if callable(key) or key is None else lambda x: x[key]
perpartition = lambda seq: list(toolz.unique(seq, key=key_func))

This comment has been minimized.

Copy link
@mrocklin

mrocklin Jan 28, 2019

Member

We'd like to avoid the dynamic creation of functions here. They are somewhat costly to serialize. I recommend making another top level function like merge_distinct

"""
return self.reduction(set, merge_distinct, out_type=Bag,
key_func = key if callable(key) or key is None else lambda x: x[key]

This comment has been minimized.

Copy link
@mrocklin

mrocklin Jan 28, 2019

Member

Similarly I would handle this within the perpartition or merge_distinct functions in order to avoid passing around lambdas in the graph.

@curry
def merge_distinct(seqs, key=None):
if key is None:
return list(toolz.unique(toolz.concat(seqs)))

This comment has been minimized.

Copy link
@mrocklin

mrocklin Jan 28, 2019

Member

If cytoolz is around then we'll probably want to use cytoolz.unique instead. See the toolz and cytoolz imports at the top of this file.

mrocklin and others added some commits Jan 28, 2019

Update dask/bag/core.py
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
Update dask/bag/core.py
Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>
@mrocklin

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

Any thoughts on the other comments @daniel-severo ? No time pressure, I just wanted to make sure that you had seen them.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@dsevero any update here? Are you comfortable making the changes here that would reduce the creation of dynamic functions?

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

^ ping should have been to @dsevero

@jcrist jcrist self-assigned this Apr 30, 2019

@jcrist

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I pushed some updates (hope you don't mind @dsevero). Should be good to go on tests pass.

@jcrist jcrist merged commit 78c412f into dask:master May 1, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dsevero dsevero deleted the dsevero:issue-2493 branch May 10, 2019

@dsevero

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

Sorry guys, I had to change my github username (hence the mix-up) and I was away from development for a while due to personal reasons (not related to the name change).

Not at all @jcrist, thanks for the help! Sorry to keep you guys waiting.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Add key argument to Bag.distinct (dask#4423)
* Comply with flake8

* Add docs to Bag.distinct

* Add toolz.compose to Bag.distinct

* Attempt to fix doctests

* Attempt to fix doctests

* Use assert_eq in bag tests

* Removed toolz drom merge_distinct and Bag.distinct

* Removed unused import

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* fixups

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

Add key argument to Bag.distinct (dask#4423)
* Comply with flake8

* Add docs to Bag.distinct

* Add toolz.compose to Bag.distinct

* Attempt to fix doctests

* Attempt to fix doctests

* Use assert_eq in bag tests

* Removed toolz drom merge_distinct and Bag.distinct

* Removed unused import

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* Update dask/bag/core.py

Co-Authored-By: daniel-severo <danielsouzasevero@gmail.com>

* fixups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.