Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream fix for numeric_only default [WIP] #9241

Closed
wants to merge 1 commit into from

Conversation

ncclementi
Copy link
Member

I started to work on a fix for the numeric_only CI failure we are seeing. I've only covered one file, but I wonder if this is the approach we want to pursue. If so I can continue adding all the other files.

cc: @jsignell @ian-r-rose and @pavithraes because I know you were looking at this one.

@jsignell
Copy link
Member

jsignell commented Jul 6, 2022

Thanks for starting this Naty! I would think that the approach would be to change the default on the method definition itself. The part that is up in the air is whether the default value of numeric_only should depend on the version of pandas or not.

@ncclementi
Copy link
Member Author

I would think that the approach would be to change the default on the method definition itself. The part that is up in the air is whether the default value of numeric_only should depend on the version of pandas or not.

Do you mean like here for example for the mean?

dask/dask/dataframe/groupby.py

Lines 1475 to 1480 in ecbab9d

def mean(self, split_every=None, split_out=1):
s = self.sum(split_every=split_every, split_out=split_out)
c = self.count(split_every=split_every, split_out=split_out)
if is_dataframe_like(s):
c = c[s.columns]
return s / c

Do you want to change it on the method to be numeric_only=True, regarding of the pandas version? This will be an inconsistency as Pandas default will be numeric_only=False, is this the way we want to go?. Unless I'm missing something here.

@jsignell
Copy link
Member

jsignell commented Jul 6, 2022

Oh. I hadn't fully internalized that pandas is explicitly trying to make people specify numeric_only. This seems super verbose, but I guess yeah what you have is right if we are committed to exposing dask users to the future warnings from pandas. Sorry for the distraction.

@jsignell
Copy link
Member

jsignell commented Jul 6, 2022

When you go to add numeric_only to groupby methods you can probably use the _numeric_only decorator.

@pavithraes
Copy link
Member

pavithraes commented Jul 6, 2022

@ncclementi @ian-r-rose and I looked at this today.

According to this changelog entry:

  • for pandas=1.5, numeric_only=None has been deprecated and set to True by default, edit: pandas will raise a deprecation warning if default values are used and if None is used (but no defaults are getting changed)
  • however, there is a deprecation warning that says it will be False by default in a future release (looks like 2.0)
  • there are also new functions that have "gained" this numeric_only kwarg -- same defaults as the previous two points

Next,

  • Currently in Dask we don't support numeric_only=False (see here)
  • If this will be the default in pandas, do we want to support it? (probably not?)
  • We can also set the default to False in the definition but continue to only support the True case (like we do with apply, axis=1 here)

Should we support this intermediate numeric_only=True case? Or, shall we jump to False directly?

It'll be nice to get a consensus beforehand because we'll be touching many functions with this.

Sidenote, current docs (example here) says we don't support numeric_only because it's not in the function definition, but we actually do support (at least numeric_only=True) because pandas does.

@jsignell
Copy link
Member

jsignell commented Jul 6, 2022

I think short term we should add support for numeric_only=False and then when pandas switches to False so can we.

for pandas=1.5, numeric_only=None has been deprecated and set to True by default

Yeah I think that's right and it raises a warning unless you explicitly set numeric_only:

In [1]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame({"a": [1,2,3], "b": ["a", "b", "c"]})

In [2]: df.mean()
<ipython-input-2-c61f0c8f89b5>:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()
Out[2]: 
a    2.0
dtype: float64

If False will be the default in pandas, do we want to support it? (probably not?)

I think it would be fairly easy to support. And it would eagerly fail (like pandas does) most of the time since we try to run functions on the meta first.

We can also set the default to False in the definition but continue to only support the True case.

I don't like this idea very much.

Should we support this intermediate numeric_only=True case? Or, shall we jump to False directly?

I think even longer term in pandas the intention is to get rid of this kwarg so probably we can just jump to False and delete the whole thing when pandas goes to 2.0 and we can just stop supporting versions of pandas where numeric_only isn't a valid kwarg.

@ncclementi
Copy link
Member Author

Linking here the reply on the pandas repo about the defaults status for all the affected operations.
pandas-dev/pandas#46560 (comment)

@ncclementi
Copy link
Member Author

@jsignell and I were chatting about this one, I'm closing this PR in favor of #9269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

⚠️ Upstream CI failed ⚠️
3 participants