Add numeric_only support to min, max and prod#10219
Conversation
# Conflicts: # dask/dataframe/_compat.py # dask/dataframe/tests/test_arithmetics_reduction.py
hendrikmakait
left a comment
There was a problem hiding this comment.
The code generally looks good to me, someone with more knowledge of this part of the codebase should have another look though.
| return result | ||
|
|
||
| @_numeric_only | ||
| @derived_from(pd.DataFrame) |
There was a problem hiding this comment.
I see you removed the decorator. The decorator did two things, raise on numeric_only=False, and filter the underlying data on numeric_only=True. Are they both irrelevant now?
There was a problem hiding this comment.
pandas takes care of filtering for numeric_only=True, we technically don't need it here. Is there a reason why we would want to do it ourselves?
There was a problem hiding this comment.
+1 for offloading to pandas where we can
|
|
||
| ddf = dd.from_pandas(df, 3) | ||
| funcs = ["sum"] | ||
| funcs = ["sum", "prod", "product", "min", "max"] |
There was a problem hiding this comment.
It might be good to move these in pytest.mark.parametrize, if possible. There, you could also mark some as xfail.
There was a problem hiding this comment.
Yeah I thought about this as well, modelled after the test below that uses the same pattern. Not sure why that was, performance might be a reason? Happy to change though if you prefer.
One thing about xfails: They really slow down the test suite if you have too many of them.
There was a problem hiding this comment.
Not sure why the original tests were formatted this way. Not a huge deal either way, but FWIW I also prefer pytest.mark.parametrize. Pushed a tiny commit that moves funcs into a parameterization.
One thing about xfails: They really slow down the test suite if you have too many of them.
Yeah, fair point. We could avoid pytest.mark.parametrize in these cases or even just mark them as skip instead of xfail -- this isn't exactly the same, but may be close enough
There was a problem hiding this comment.
Yeah I prefer parametrisation as well. Saw the loop logic a couple of times in the test suite, so wasn't sure if this was preferred here.
Yeah skip is better performance wise
| getattr(ddf, func)() | ||
| getattr(ddf, func)(numeric_only=False) | ||
|
|
||
| warning = FutureWarning |
There was a problem hiding this comment.
Nice -- this is a neat way to determine warning behavior later on
| return result | ||
|
|
||
| @_numeric_only | ||
| @derived_from(pd.DataFrame) |
There was a problem hiding this comment.
+1 for offloading to pandas where we can
| getattr(df, func)(**kwargs), | ||
| getattr(ddf, func)(**kwargs), | ||
| check_dtype=func in ["mean", "max"], | ||
| check_dtype=func in ["mean"], |
There was a problem hiding this comment.
I was curious if this was still needed, so I removed it and locally things passed. Including a small change to remove the special check_dtype handling here -- let's just use the default of always checking.
There was a problem hiding this comment.
Yeah I think this makes sense. Didn't pay too much attention, since I expect that we get rid of this test soonish
| getattr(df_numerics, func)(), | ||
| getattr(ddf_numerics, func)(), | ||
| check_dtype=func in ["mean", "max"], | ||
| check_dtype=func in ["mean"], |
|
|
||
| ddf = dd.from_pandas(df, 3) | ||
| funcs = ["sum"] | ||
| funcs = ["sum", "prod", "product", "min", "max"] |
There was a problem hiding this comment.
Not sure why the original tests were formatted this way. Not a huge deal either way, but FWIW I also prefer pytest.mark.parametrize. Pushed a tiny commit that moves funcs into a parameterization.
One thing about xfails: They really slow down the test suite if you have too many of them.
Yeah, fair point. We could avoid pytest.mark.parametrize in these cases or even just mark them as skip instead of xfail -- this isn't exactly the same, but may be close enough
|
Note: test failures are unrelated to the changes in this PR and are being addressed over in #10241 |
pre-commit run --all-filesSits on top of #10194