Deprecate axis= for some groupby operations#10094
Conversation
| PANDAS_GT_150 = PANDAS_VERSION >= Version("1.5.0") | ||
| PANDAS_GT_200 = PANDAS_VERSION.major >= 2 # Also true for nightly builds | ||
| PANDAS_GT_200 = PANDAS_VERSION.major >= 2 | ||
| PANDAS_GT_210 = PANDAS_VERSION.release >= (2, 1, 0) |
There was a problem hiding this comment.
@phofl I saw pandas-dev/pandas#52018 is in the 2.1 milestone, so I think the changes in this PR are correct. Just wanted to double check that this deprecation won't show up in a future pandas=2.0.x release
There was a problem hiding this comment.
Yes, this will come in 2.1. generally we don’t backport deprecations if we can avoid it.
there will be a couple more axis=1 deprecations soonish (just as a heads up)
| if axis in (1, "columns"): | ||
| raise NotImplementedError( | ||
| f"The axis={axis} keyword is not implemented for groupby.idxmax" | ||
| ) |
There was a problem hiding this comment.
Noting for reviewers that axis=1 isn't currently supported and just gets silently ignored. I've opted here to raise an informative error in this case (test coverage is included in test_df_groupby_idx_axis)
| if axis in (1, "columns"): | ||
| raise NotImplementedError( | ||
| f"The axis={axis} keyword is not implemented for groupby.idxmin" | ||
| ) | ||
| self._normalize_axis(axis, "idxmin") |
There was a problem hiding this comment.
Similar comment here about axis=1 being silently ignored now
|
rerun tests |
| "be removed in a future version. Call without passing 'axis' instead.", | ||
| FutureWarning, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
I notice that with this condition, we emit a warning for any arbitrary axis="foo", even when this would subsequently raise an error. Would it make sense to change this to
| else: | |
| elif axis in (1, "columns", None): |
To capture the remaining supported axis values instead?
There was a problem hiding this comment.
To capture the remaining supported axis values instead?
What other axis values are supported?
There was a problem hiding this comment.
I had assumed (1, "columns", None) based on DataFrame._validate_axis, but I recognize that this list varies; I guess my underlying question is why we opt to flip pandas' approach here of specifying the deprecated axis argument in a finite set of cases, and the general deprecation warning otherwise:
There was a problem hiding this comment.
pandas has some internal methods for normalizing to 0 or 1 for input axis values (i.e. "columns" --> 1). For example, not even when one specifies axis="columns" in pandas, the deprecation warnings that's emitted says axis=1
In [6]: type(df)
Out[6]: pandas.core.frame.DataFrame
In [7]: df
Out[7]:
A B C D E
0 1 3 NaN 4.0 6.0
1 1 4 3.0 NaN NaN
2 2 3 NaN 5.0 7.0
3 2 4 NaN NaN NaN
In [8]: df.groupby("A").fillna(0, axis="columns")
<ipython-input-8-b27392e2513d>:1: FutureWarning: DataFrameGroupBy.fillna with axis=1 is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
df.groupby("A").fillna(0, axis="columns")
Out[8]:
B C D E
0 3.0 0.0 4.0 6.0
1 4.0 3.0 0.0 0.0
2 3.0 0.0 5.0 7.0
3 4.0 0.0 0.0 0.0In this PR we're not normalizing to integers and are instead using the user-provided axis value in the deprecation warning
In [12]: ddf.groupby("A").fillna(0, axis="columns")
/Users/james/projects/dask/dask/dask/dataframe/groupby.py:2741: FutureWarning: DataFrameGroupBy.fillna with axis=columns is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
warnings.warn(
Out[12]:
Dask DataFrame Structure:
B C D E
npartitions=2
float64 float64 float64 float64
... ... ... ...
... ... ... ...
Dask Name: droplevel, 8 graph layersI think this is a bit more clear, but probably not a huge deal either way. Does that help clarify things? Or am I missing something?
dask/dataframe/groupby.py
Outdated
| extract_kwargs, | ||
| nonempty=True, | ||
| ) | ||
| meta = self._meta_nonempty.shift(**meta_kwargs) |
There was a problem hiding this comment.
Is the intent here to emit the deprecation warning only through Dask, and suppress any cases where it comes up in pandas? If so, we might want to conditionally suppress warnings here depending on the value of include_axis to avoid getting two identical warnings.
There was a problem hiding this comment.
Yeah, that's the intent. While digging into #10094 (comment) I realized we weren't properly handling the case when we're checking for multiple warnings being emitted. As part of fixing, I decided to always forward axis to pandas, but also catch and ignore these deprecation warnings (coming from pandas, we emit these warnings ourself in `dask)
dask/dataframe/groupby.py
Outdated
| token="groupby-shift", | ||
| group_keys=self.group_keys, | ||
| meta=meta, | ||
| **axis_kwarg, |
There was a problem hiding this comment.
Believe the same as above should apply to the g.shift() call in _groupby_slice_shift
| with groupby_axis_deprecated(): | ||
| expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis) | ||
| with groupby_axis_deprecated(): | ||
| result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis) |
There was a problem hiding this comment.
Just for my understanding, is there a reason we're opting to do the fillna calls in separate contexts throughout instead of
| with groupby_axis_deprecated(): | |
| expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis) | |
| with groupby_axis_deprecated(): | |
| result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis) | |
| with groupby_axis_deprecated(): | |
| expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis) | |
| result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis) |
There was a problem hiding this comment.
Doing them separately ensure that both pandas and dask emit FutureWarnings. With a single context manager, things would still pass if only one emitted a FutureWarning
There was a problem hiding this comment.
Thanks for the clarification!
|
Thanks @j-bennet @charlesbluca for reviewing! |
Corresponding PR to pandas-dev/pandas#52018
Note that
shift-related tests, e.g.test_groupby_shift_basic_input, should be raising deprecation warnings, but aren't. When I run the same code outside the test suite, I see the deprecation warnings. Will need to look into this further.