Skip to content

Allow split_out to be None, which then defaults to 1 in groupby().aggregate()#9491

Merged
jrbourbeau merged 3 commits intodask:mainfrom
ian-r-rose:allow-split-out-none
Sep 14, 2022
Merged

Allow split_out to be None, which then defaults to 1 in groupby().aggregate()#9491
jrbourbeau merged 3 commits intodask:mainfrom
ian-r-rose:allow-split-out-none

Conversation

@ian-r-rose
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! Would you mind adding a small test (e.g. similar to the example in #9490)?

@jrbourbeau
Copy link
Copy Markdown
Member

Also, should we deprecate allowing None here?

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

Thanks @ian-r-rose! Would you mind adding a small test (e.g. similar to the example in #9490)?

I was a bit uncomfortable adding a test since I wasn't sure if we actually wanted to support None here. I'd be curious to hear from @wence- if there are any things that would become hugely onerous if we dropped that support.

@jrbourbeau
Copy link
Copy Markdown
Member

Agreed, it'd be good to get thoughts from @wence- on his particular use case. That said, do we advertise anywhere that split_out can be None? From a brief skim of the API docs, it looks like we're consistent in saying it's an int

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

I only see None documented in one place (sort of), which is the docstring for groupby.aggregate: https://docs.dask.org/en/latest/generated/dask.dataframe.groupby.DataFrameGroupBy.aggregate.html#dask.dataframe.groupby.DataFrameGroupBy.aggregate

Even then, it's saying it's an optional arg, which is a bit ambiguous.

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

The implementation for apply_concat_apply clearly anticipates a None value, but I might consider that an internal API.

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Sep 14, 2022

The implementation for apply_concat_apply clearly anticipates a None value, but I might consider that an internal API.

That also has a bug :( (sort=True, split_out=None => TypeError)

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Sep 14, 2022

Thanks @ian-r-rose! Would you mind adding a small test (e.g. similar to the example in #9490)?

I was a bit uncomfortable adding a test since I wasn't sure if we actually wanted to support None here. I'd be curious to hear from @wence- if there are any things that would become hugely onerous if we dropped that support.

I don't think it would be problematic. I noticed because in dask_cudf we explicitly test groupby().agg with split_out=None. But that is in tests so I can remove it.

The maintenance issue here is somehow that there are lots of public APIs that call into the few "semi-private" APIs. So there are lots of places for the defaults to go out of whack.

One way to partially solve this is to not provide default values for arguments to internal APIs, and promote the default handling to the public callers. I generally think that internal APIs should not do argument defaulting, since all that business logic handling is not really germane to the usage of the function.

It does make calling these functions slightly more verbose (especially given that they all have many arguments), though one can utilise my favourite python argument list approach of required keyword-only args:

def aca(*, split_out, split_every, ...)

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

though one can utilise my favourite python argument list approach of required keyword-only args:

If I ever have a week of nothing to do, I'd love to make 90% of the APIs here make use of keyword-only args.

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Sep 14, 2022

Also, should we deprecate allowing None here?

I am inclined to say yes. I will change things on the dask_cudf side.

wence- added a commit to wence-/cudf that referenced this pull request Sep 14, 2022
Discussion in dask/dask#9490 and dask/dask#9491 suggests that
split_out=None as a default value was never really intended and is
likely to be deprecated.
@rjzamora
Copy link
Copy Markdown
Member

My concern with dropping support for split_out=None is that we may find a reasonable way to automatically infer a good value for split_out in the future. For example, perhaps the Groupby class itself should optionally infer group cardinality up-front. In that case, we would probably not want to override split_out if it was already set to a specific integer value.

This hypothetical concern may not be realistic.

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

My concern with dropping support for split_out=None is that we may find a reasonable way to automatically infer a good value for split_out in the future. For example, perhaps the Groupby class itself should optionally infer group cardinality up-front. In that case, we would probably not want to override split_out if it was already set to a specific integer value.

This hypothetical concern may not be realistic.

I think it's a realistic hypothetical, but I guess that in that case I would advocate for expanding the accepted types to include None, or a sentinel value like the pandas NoDefault when it actually comes up. That change shouldn't be API-breaking for, except to the extent that we might want to make inference default.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Sep 14, 2022
Discussion in dask/dask#9490 and dask/dask#9491 suggests that split_out=None as a default value was never really intended and is likely to be deprecated.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ashwin Srinath (https://github.com/shwina)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11704
@jrbourbeau jrbourbeau changed the title Allow split_out to be None, which then defaults to 1 in groupby().aggregate() Allow split_out to be None, which then defaults to 1 in groupby().aggregate() Sep 14, 2022
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose!

@jrbourbeau jrbourbeau changed the title Allow split_out to be None, which then defaults to 1 in groupby().aggregate() Allow split_out to be None, which then defaults to 1 in groupby().aggregate() Sep 14, 2022
@jrbourbeau jrbourbeau merged commit 982376e into dask:main Sep 14, 2022
@ian-r-rose ian-r-rose self-assigned this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can no longer provide split_out=None as a default in groupby.aggregate

4 participants