Skip to content

Add numeric_only support to min, max and prod#10219

Merged
jrbourbeau merged 14 commits intodask:mainfrom
phofl:numeric_only_others
Apr 30, 2023
Merged

Add numeric_only support to min, max and prod#10219
jrbourbeau merged 14 commits intodask:mainfrom
phofl:numeric_only_others

Conversation

@phofl
Copy link
Copy Markdown
Collaborator

@phofl phofl commented Apr 24, 2023

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Sits on top of #10194

Copy link
Copy Markdown
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generally looks good to me, someone with more knowledge of this part of the codebase should have another look though.

return result

@_numeric_only
@derived_from(pd.DataFrame)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you removed the decorator. The decorator did two things, raise on numeric_only=False, and filter the underlying data on numeric_only=True. Are they both irrelevant now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas takes care of filtering for numeric_only=True, we technically don't need it here. Is there a reason why we would want to do it ourselves?

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for offloading to pandas where we can


ddf = dd.from_pandas(df, 3)
funcs = ["sum"]
funcs = ["sum", "prod", "product", "min", "max"]
Copy link
Copy Markdown
Contributor

@j-bennet j-bennet Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to move these in pytest.mark.parametrize, if possible. There, you could also mark some as xfail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about this as well, modelled after the test below that uses the same pattern. Not sure why that was, performance might be a reason? Happy to change though if you prefer.

One thing about xfails: They really slow down the test suite if you have too many of them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the original tests were formatted this way. Not a huge deal either way, but FWIW I also prefer pytest.mark.parametrize. Pushed a tiny commit that moves funcs into a parameterization.

One thing about xfails: They really slow down the test suite if you have too many of them.

Yeah, fair point. We could avoid pytest.mark.parametrize in these cases or even just mark them as skip instead of xfail -- this isn't exactly the same, but may be close enough

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I prefer parametrisation as well. Saw the loop logic a couple of times in the test suite, so wasn't sure if this was preferred here.

Yeah skip is better performance wise

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl for the updates here and @j-bennet for reviewing

getattr(ddf, func)()
getattr(ddf, func)(numeric_only=False)

warning = FutureWarning
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- this is a neat way to determine warning behavior later on

return result

@_numeric_only
@derived_from(pd.DataFrame)
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for offloading to pandas where we can

getattr(df, func)(**kwargs),
getattr(ddf, func)(**kwargs),
check_dtype=func in ["mean", "max"],
check_dtype=func in ["mean"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious if this was still needed, so I removed it and locally things passed. Including a small change to remove the special check_dtype handling here -- let's just use the default of always checking.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this makes sense. Didn't pay too much attention, since I expect that we get rid of this test soonish

getattr(df_numerics, func)(),
getattr(ddf_numerics, func)(),
check_dtype=func in ["mean", "max"],
check_dtype=func in ["mean"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing here


ddf = dd.from_pandas(df, 3)
funcs = ["sum"]
funcs = ["sum", "prod", "product", "min", "max"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the original tests were formatted this way. Not a huge deal either way, but FWIW I also prefer pytest.mark.parametrize. Pushed a tiny commit that moves funcs into a parameterization.

One thing about xfails: They really slow down the test suite if you have too many of them.

Yeah, fair point. We could avoid pytest.mark.parametrize in these cases or even just mark them as skip instead of xfail -- this isn't exactly the same, but may be close enough

@jrbourbeau
Copy link
Copy Markdown
Member

Note: test failures are unrelated to the changes in this PR and are being addressed over in #10241

@jrbourbeau jrbourbeau merged commit 7db8efc into dask:main Apr 30, 2023
@phofl phofl deleted the numeric_only_others branch April 30, 2023 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants