Fix drop_duplicates with split_out#1828
Merged
jcrist merged 2 commits intodask:masterfrom Dec 1, 2016
Merged
Conversation
The `split_out` kwarg was added to `aca` in 303d038. This allows for outputs to be split by hashing the rows, improving efficiency for large outputs. However, this defaulted to hashing all columns in the output dataframe, which didn't play well with the `subset` kwarg to `drop_duplicates`. This was resulting in not all duplicate rows being dropped, as they'd hash to different partitions since the hashed columns weren't being subset. To fix this, we replace the `split_index` kwarg with `split_out_setup`, which optionally takes a function to apply to each chunk before being hashed. `split_out_setup_kwargs` is also available to pass keywords to this function. `drop_duplicates` was modified to adjust for this change.
Member
Author
|
cc @jreback. |
mrocklin
reviewed
Dec 1, 2016
dask/dataframe/core.py
Outdated
| else: | ||
| split_out_setup = split_out_setup_kwargs = None | ||
|
|
||
| if 'keep' in kwargs and kwargs['keep'] is False: |
Member
There was a problem hiding this comment.
kwargs.get('keep', True) is False?
| token='drop-duplicates', split_every=split_every, | ||
| split_out=split_out, split_index=False, **kwargs) | ||
| split_out=split_out, split_out_setup=split_out_setup, | ||
| split_out_setup_kwargs=split_out_setup_kwargs, **kwargs) |
Member
Author
There was a problem hiding this comment.
It is a bit verbose, but I doubt it'll be used that much. I think I prefer the clarity over the need to type a few more characters.
| assert_eq(res2, sol) | ||
| assert res._name != res2._name | ||
|
|
||
| pytest.raises(NotImplementedError, lambda: d.drop_duplicates(keep=False)) |
Member
There was a problem hiding this comment.
style nit: slight preference for context manager
with pytest.raises(NotImplementedError):
d.drop_duplicates(keep=False)
Member
|
OK
…On Thu, Dec 1, 2016 at 3:47 PM, Jim Crist ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/dataframe/core.py <#1828>:
> chunk = M.drop_duplicates
return aca(self, chunk=chunk, aggregate=chunk, meta=self._meta,
token='drop-duplicates', split_every=split_every,
- split_out=split_out, split_index=False, **kwargs)
+ split_out=split_out, split_out_setup=split_out_setup,
+ split_out_setup_kwargs=split_out_setup_kwargs, **kwargs)
It is a bit verbose, but I doubt it'll be used that much. I think I prefer
the clarity over the need to type a few more characters.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1828>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszLV2JytdC5wXRDKtJok658BQap0Gks5rDzJlgaJpZM4LB3Gh>
.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
split_outkwarg was added toacain 303d038. This allows foroutputs to be split by hashing the rows, improving efficiency for large
outputs. However, this defaulted to hashing all columns in the output
dataframe, which didn't play well with the
subsetkwarg todrop_duplicates. This was resulting in not all duplicate rows beingdropped, as they'd hash to different partitions since the hashed columns
weren't being subset.
To fix this, we replace the
split_indexkwarg withsplit_out_setup,which optionally takes a function to apply to each chunk before being
hashed.
split_out_setup_kwargsis also available to pass keywords tothis function.
drop_duplicateswas modified to adjust for this change.