-
Notifications
You must be signed in to change notification settings - Fork 89
Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4072 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 349 349
Lines 37583 37546 -37
=======================================
- Hits 37465 37428 -37
Misses 118 118
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
y_pred_proba = infer_feature_types(y_pred_proba).to_numpy() | ||
|
||
if len(y_pred_proba.shape) == 1: | ||
y_pred_proba = y_pred_proba.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The roc_curve
changes are so drastic to allow us to pass in nullable types, because we were manipulating y_true
and y_predict_proba
as numpy arrays instead of pandas series/dataframes.
We generally tend to stay in pandas in evalml (infer_feature_types
converts numpy arrays to pandas), so I don't think it's so problematic to stay in pandas to avoid the extra nullable types logic. Though worth noting a slight increase in runtime when we switch to using pandas (the difference increased proportionally as I increased number of rows and was similar whether or not the dtype was int64
or Int64
). Right now, I'm thinking that this is worth being able to remove the nullable type logic.
The reason we can't stay in numpy is that a series.to_numpy()
call from pandas nullable dtypes will always result in data with the object
dtype, and this is not something numpy is going to change. Previously, we couldn't pass in the nullable types to the label_binarizer
or sklearn_roc_curve
, so moving to numpy right away via _convert_ww_series_to_np_array
made sense, but now there's no need. But now that we can pass in pandas data with nullable dtypes, we should follow the rest of evalml and stay in pandas when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.
@@ -838,9 +838,9 @@ def objective_function(self, y_true, y_predicted, X=None, sample_weight=None): | |||
"targets contain the value 0.", | |||
) | |||
if isinstance(y_true, pd.Series): | |||
y_true = y_true.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change was needed because .values
doesn't produce a numpy array when used with nullable. Int64
would produce an <IntegerArray>
, which then will be problematic for the .mean
call below. Changing to to_numpy
fixed this, so I'm not calling it an integer nullable incompatibility since the workaround doesnt need to worry about the types.
675c68c
to
5abb763
Compare
5abb763
to
5a4c0c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just small questions.
# Only use one column for binary inputs that are still a DataFrame | ||
elif y_pred_proba.shape[1] == 2: | ||
y_pred_proba = pd.DataFrame(y_pred_proba.iloc[:, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following the logic here. When do we hit this case, what kind of input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I also struggled to understand this, but it's specifically meant to allow users to pass in the results from binary classification predict_proba
without having to pull out the positive class, since we don't make any such requirements for the multiclass case. It was suggested by freddy in this comment: #1164 (comment). We test it here: https://github.com/alteryx/evalml/blob/main/evalml/tests/model_understanding_tests/test_metrics.py#L347-L366
I tried to add some clarity with the comment Only use one column for binary inputs that are still a DataFrame
but let me know if there's a better way to clarify what's going on here
y_true = y_true.ww.replace({0: 10}) | ||
y_pred = y_pred.replace({0: 10}) | ||
|
||
obj.score(y_true=y_true, y_predicted=y_pred, X=X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sanity's sake, should we be checking any equality or non-nullness here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a check that the value isnt null: b5ba0a2
b982900
to
7d859af
Compare
… method and properties from ObjectiveBase
7d859af
to
ba32924
Compare
y_pred_proba = infer_feature_types(y_pred_proba).to_numpy() | ||
|
||
if len(y_pred_proba.shape) == 1: | ||
y_pred_proba = y_pred_proba.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.
closes #4021, closes #4020, closes #4018, closes #3992, closes #4019, closes #4054
Removes logic related to model understanding nullable type handling and Objective nullable type handling, since sklearn 1.2.2 allows us to fully support nullable types!