Fix `drop_duplicates` with `split_out` by jcrist · Pull Request #1828 · dask/dask

jcrist · 2016-12-01T20:35:39Z

The split_out kwarg was added to aca in 303d038. This allows for
outputs to be split by hashing the rows, improving efficiency for large
outputs. However, this defaulted to hashing all columns in the output
dataframe, which didn't play well with the subset kwarg to
drop_duplicates. This was resulting in not all duplicate rows being
dropped, as they'd hash to different partitions since the hashed columns
weren't being subset.

To fix this, we replace the split_index kwarg with split_out_setup,
which optionally takes a function to apply to each chunk before being
hashed. split_out_setup_kwargs is also available to pass keywords to
this function. drop_duplicates was modified to adjust for this change.

The `split_out` kwarg was added to `aca` in 303d038. This allows for outputs to be split by hashing the rows, improving efficiency for large outputs. However, this defaulted to hashing all columns in the output dataframe, which didn't play well with the `subset` kwarg to `drop_duplicates`. This was resulting in not all duplicate rows being dropped, as they'd hash to different partitions since the hashed columns weren't being subset. To fix this, we replace the `split_index` kwarg with `split_out_setup`, which optionally takes a function to apply to each chunk before being hashed. `split_out_setup_kwargs` is also available to pass keywords to this function. `drop_duplicates` was modified to adjust for this change.

jcrist · 2016-12-01T20:36:29Z

cc @jreback.

mrocklin

+1

mrocklin · 2016-12-01T20:37:39Z

dask/dataframe/core.py

+        else:
+            split_out_setup = split_out_setup_kwargs = None
+
+        if 'keep' in kwargs and kwargs['keep'] is False:


kwargs.get('keep', True) is False?

mrocklin · 2016-12-01T20:38:14Z

dask/dataframe/core.py

                   token='drop-duplicates', split_every=split_every,
-                   split_out=split_out, split_index=False, **kwargs)
+                   split_out=split_out, split_out_setup=split_out_setup,
+                   split_out_setup_kwargs=split_out_setup_kwargs, **kwargs)


oof, split_out_setup_kwargs

It is a bit verbose, but I doubt it'll be used that much. I think I prefer the clarity over the need to type a few more characters.

mrocklin · 2016-12-01T20:40:15Z

dask/dataframe/tests/test_dataframe.py

    assert_eq(res2, sol)
    assert res._name != res2._name

+    pytest.raises(NotImplementedError, lambda: d.drop_duplicates(keep=False))


style nit: slight preference for context manager

with pytest.raises(NotImplementedError): d.drop_duplicates(keep=False)

mrocklin · 2016-12-01T20:50:27Z

OK

…

On Thu, Dec 1, 2016 at 3:47 PM, Jim Crist ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/dataframe/core.py <#1828>: > chunk = M.drop_duplicates return aca(self, chunk=chunk, aggregate=chunk, meta=self._meta, token='drop-duplicates', split_every=split_every, - split_out=split_out, split_index=False, **kwargs) + split_out=split_out, split_out_setup=split_out_setup, + split_out_setup_kwargs=split_out_setup_kwargs, **kwargs) It is a bit verbose, but I doubt it'll be used that much. I think I prefer the clarity over the need to type a few more characters. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1828>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLV2JytdC5wXRDKtJok658BQap0Gks5rDzJlgaJpZM4LB3Gh> .

mrocklin reviewed Dec 1, 2016

View reviewed changes

Address comments

e4bc641

jcrist merged commit e73546a into dask:master Dec 1, 2016

jcrist deleted the drop-duplicates-split-out branch December 1, 2016 21:26

sinhrks added this to the 0.13.0 milestone Jan 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `drop_duplicates` with `split_out`#1828

Fix `drop_duplicates` with `split_out`#1828
jcrist merged 2 commits intodask:masterfrom
jcrist:drop-duplicates-split-out

jcrist commented Dec 1, 2016

Uh oh!

jcrist commented Dec 1, 2016

Uh oh!

mrocklin left a comment

Uh oh!

mrocklin Dec 1, 2016

Uh oh!

mrocklin Dec 1, 2016

Uh oh!

jcrist Dec 1, 2016

Uh oh!

mrocklin Dec 1, 2016

Uh oh!

mrocklin commented Dec 1, 2016 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jcrist commented Dec 1, 2016

Uh oh!

jcrist commented Dec 1, 2016

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Dec 1, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Dec 1, 2016

Choose a reason for hiding this comment

Uh oh!

jcrist Dec 1, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Dec 1, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Dec 1, 2016 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants