Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement min_count for the DataFrame methods sum and prod #4090

Merged
merged 6 commits into from
Oct 15, 2018

Conversation

bartbroere
Copy link
Contributor

@bartbroere bartbroere commented Oct 12, 2018

As mentioned in #3981 dask DataFrames do not support the relatively new min_count parameter on the methods sum and prod. I've tried to address that with this pull request.

  • Tests passed
  • Tests added
  • Passes flake8 dask

@bartbroere
Copy link
Contributor Author

Tests for Python 3.5 have failed, but it does not seem to be directly related to changes in this pull request. It was test_num_workers_config[processes-3] that failed.

@mrocklin
Copy link
Member

Thanks for the contribution @bartbroere .

CC'ing @jrbourbeau for the failing test, I suspect that this is due to a recent PR.

Speaking of tests ... @bartbroere would you be willing to add tests that verify the functionality that you've added in this PR? There should be plenty of examples in dask/dataframe/tests/test_dataframe.py

@jrbourbeau
Copy link
Member

Thanks for the PR @bartbroere! The failing test is not related to changes in this PR.

Hmmmm looks like the failing test was just added in #4086. I'll open up a new issue for this.

@bartbroere
Copy link
Contributor Author

Thanks for the quick reply and for looking into the failing tests. When developing I had a scratch file that I can convert to a proper test for the new functionality.

for axis in axes:
for min_count in [0, 1, 2, 3]:
assert_eq(df.sum(min_count=min_count, axis=axis),
ddf.sum(min_count=min_count, axis=axis).compute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to call .compute() here. See the bottom section of http://docs.dask.org/en/latest/develop.html#test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that you also test prod here, perhaps just with a second assert_eq statement in the same for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the hint. I have made the tests without the call to .compute() now.

@mrocklin
Copy link
Member

This looks fine to me. Merging in a few hours if there are no further comments (cc @TomAugspurger )

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame changes look good. I'm not sure why the changes from #4093 are still showing up in this diff though.

@mrocklin
Copy link
Member

I've rebased on master and removed the errant commit. I've force-pushed to your branch @bartbroere . I hope that you don't mind. Merging on a green light from CI.

@bartbroere
Copy link
Contributor Author

I don't mind the force-push of course. The diff now makes more sense. Good to hear it's probably ready to merge.

@TomAugspurger TomAugspurger merged commit 692572c into dask:master Oct 15, 2018
@TomAugspurger
Copy link
Member

Thanks @bartbroere!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants