Skip to content

Fix drop_duplicates with split_out#1828

Merged
jcrist merged 2 commits intodask:masterfrom
jcrist:drop-duplicates-split-out
Dec 1, 2016
Merged

Fix drop_duplicates with split_out#1828
jcrist merged 2 commits intodask:masterfrom
jcrist:drop-duplicates-split-out

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Dec 1, 2016

The split_out kwarg was added to aca in 303d038. This allows for
outputs to be split by hashing the rows, improving efficiency for large
outputs. However, this defaulted to hashing all columns in the output
dataframe, which didn't play well with the subset kwarg to
drop_duplicates. This was resulting in not all duplicate rows being
dropped, as they'd hash to different partitions since the hashed columns
weren't being subset.

To fix this, we replace the split_index kwarg with split_out_setup,
which optionally takes a function to apply to each chunk before being
hashed. split_out_setup_kwargs is also available to pass keywords to
this function. drop_duplicates was modified to adjust for this change.

The `split_out` kwarg was added to `aca` in 303d038. This allows for
outputs to be split by hashing the rows, improving efficiency for large
outputs. However, this defaulted to hashing all columns in the output
dataframe, which didn't play well with the `subset` kwarg to
`drop_duplicates`. This was resulting in not all duplicate rows being
dropped, as they'd hash to different partitions since the hashed columns
weren't being subset.

To fix this, we replace the `split_index` kwarg with `split_out_setup`,
which optionally takes a function to apply to each chunk before being
hashed. `split_out_setup_kwargs` is also available to pass keywords to
this function. `drop_duplicates` was modified to adjust for this change.
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Dec 1, 2016

cc @jreback.

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

else:
split_out_setup = split_out_setup_kwargs = None

if 'keep' in kwargs and kwargs['keep'] is False:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs.get('keep', True) is False?

token='drop-duplicates', split_every=split_every,
split_out=split_out, split_index=False, **kwargs)
split_out=split_out, split_out_setup=split_out_setup,
split_out_setup_kwargs=split_out_setup_kwargs, **kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, split_out_setup_kwargs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit verbose, but I doubt it'll be used that much. I think I prefer the clarity over the need to type a few more characters.

assert_eq(res2, sol)
assert res._name != res2._name

pytest.raises(NotImplementedError, lambda: d.drop_duplicates(keep=False))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: slight preference for context manager

with pytest.raises(NotImplementedError):
    d.drop_duplicates(keep=False)

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Dec 1, 2016 via email

@jcrist jcrist merged commit e73546a into dask:master Dec 1, 2016
@jcrist jcrist deleted the drop-duplicates-split-out branch December 1, 2016 21:26
@sinhrks sinhrks added this to the 0.13.0 milestone Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants