Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Add cardinality argument to GroupBy.aggregate #9446

Closed
wants to merge 2 commits into from

Conversation

rjzamora
Copy link
Member

Adds an optional cardinality argument to GroupBy.aggregate:

"""
cardinality : float or "infer", optional
    Approximate ratio of aggregated data size with respect to the
    initial data size. If specified, this ratio will be used to override
    the defaults for ``split_every``, ``split_out``, and ``shuffle``.
    If ``"infer"`` is specified, the first non-empty partition will be
    used to estimate the global cardinality ratio.
"""

This PR is related to the discussion in #9406 (on setting good defaults automatically). Note that the specific heuristics used to set split_every and split_out in this PR may need to be tweaked - More benchmarking is necessary.

  • Tests added / passed
  • Passes pre-commit run --all-files

@rjzamora rjzamora added dataframe enhancement Improve existing functionality or make things work better labels Aug 31, 2022
),
self.obj.npartitions,
)
split_every = split_every or min(max(int(1.0 / cardinality), 2), 32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to allow split_every to be one (cf #9406 (comment)), so that there isn't any repartitioning to fewer partitions before shuffling. I found this makes a significant difference for the case when there are fewer, larger partitions.

Otherwise, this looks like a reasonable heuristic to me.

@ian-r-rose ian-r-rose mentioned this pull request Sep 1, 2022
3 tasks
@rjzamora
Copy link
Member Author

rjzamora commented Sep 7, 2022

Closing this for now.

@rjzamora rjzamora closed this Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe enhancement Improve existing functionality or make things work better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants