Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add docstring for split_out and split_every in dask groupby-aggregate API #6386

Open
VibhuJawa opened this issue Jul 9, 2020 · 3 comments
Labels
dataframe documentation Improve or add to documentation good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@VibhuJawa
Copy link
Contributor

I think it might be helpful to add docstrings for split_out and split_every in the dask groupby-aggregate API

We can probably add something

`split_out`:  Number of output results in group-by like aggergations (defaults to 1)

And use below for split_every

split_every: int >= 2 or dict(axis: int), optional
Determines the depth of the recursive aggregation. If set to or more
than the number of input chunks, the aggregation will be performed in
two steps, one ``chunk`` function per input chunk and a single
``aggregate`` function at the end. If set to less than that, an
intermediate ``combine`` function will be used, so that any one
``combine`` or ``aggregate`` function has no more than ``split_every``
inputs. The depth of the aggregation graph will be
:math:`log_{split_every}(input chunks along reduced axes)`. Setting to
a low value can reduce cache size and network transfers, at the cost of
more CPU and a larger dask graph.

Happy to do a pr if split_every's docstring above is correct for group-by and we feel it's fine to add both in places where split_every/split_out is present.

@gforsyth
Copy link
Contributor

Hey @VibhuJawa -- I'd recommend you open a PR. It's easier for us to leave inline comments that way, rather than iterating within an issue.

@VibhuJawa
Copy link
Contributor Author

Hey @VibhuJawa -- I'd recommend you open a PR. It's easier for us to leave inline comments that way, rather than iterating within an issue.

Sure, I will do that.

@adonig
Copy link

adonig commented Apr 22, 2021

The same applies to drop_duplicates.

@GenevieveBuckley GenevieveBuckley added dataframe documentation Improve or add to documentation good first issue Clearly described and easy to accomplish. Good for beginners to the project. labels Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe documentation Improve or add to documentation good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

No branches or pull requests

4 participants