Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

Merged
merged 8 commits into from
Mar 16, 2023

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 13, 2023

closes #4021, closes #4020, closes #4018, closes #3992, closes #4019, closes #4054

Removes logic related to model understanding nullable type handling and Objective nullable type handling, since sklearn 1.2.2 allows us to fully support nullable types!

@codecov
Copy link

codecov bot commented Mar 13, 2023

Codecov Report

Merging #4072 (ba32924) into main (405655f) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4072     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        349     349             
  Lines      37583   37546     -37     
=======================================
- Hits       37465   37428     -37     
  Misses       118     118             
Impacted Files Coverage Δ
evalml/tests/conftest.py 98.3% <ø> (-<0.1%) ⬇️
evalml/model_understanding/metrics.py 100.0% <100.0%> (ø)
evalml/objectives/objective_base.py 100.0% <100.0%> (ø)
evalml/objectives/standard_metrics.py 100.0% <100.0%> (ø)
...ml/tests/model_understanding_tests/test_metrics.py 100.0% <100.0%> (ø)
evalml/tests/objective_tests/test_objectives.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

y_pred_proba = infer_feature_types(y_pred_proba).to_numpy()

if len(y_pred_proba.shape) == 1:
y_pred_proba = y_pred_proba.reshape(-1, 1)
Copy link
Contributor Author

@tamargrey tamargrey Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roc_curve changes are so drastic to allow us to pass in nullable types, because we were manipulating y_true and y_predict_proba as numpy arrays instead of pandas series/dataframes.

We generally tend to stay in pandas in evalml (infer_feature_types converts numpy arrays to pandas), so I don't think it's so problematic to stay in pandas to avoid the extra nullable types logic. Though worth noting a slight increase in runtime when we switch to using pandas (the difference increased proportionally as I increased number of rows and was similar whether or not the dtype was int64 or Int64). Right now, I'm thinking that this is worth being able to remove the nullable type logic.

image

The reason we can't stay in numpy is that a series.to_numpy() call from pandas nullable dtypes will always result in data with the object dtype, and this is not something numpy is going to change. Previously, we couldn't pass in the nullable types to the label_binarizer or sklearn_roc_curve, so moving to numpy right away via _convert_ww_series_to_np_array made sense, but now there's no need. But now that we can pass in pandas data with nullable dtypes, we should follow the rest of evalml and stay in pandas when possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.

@@ -838,9 +838,9 @@ def objective_function(self, y_true, y_predicted, X=None, sample_weight=None):
"targets contain the value 0.",
)
if isinstance(y_true, pd.Series):
y_true = y_true.values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was needed because .values doesn't produce a numpy array when used with nullable. Int64 would produce an <IntegerArray>, which then will be problematic for the .mean call below. Changing to to_numpy fixed this, so I'm not calling it an integer nullable incompatibility since the workaround doesnt need to worry about the types.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just small questions.

Comment on lines +266 to +268
# Only use one column for binary inputs that are still a DataFrame
elif y_pred_proba.shape[1] == 2:
y_pred_proba = pd.DataFrame(y_pred_proba.iloc[:, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following the logic here. When do we hit this case, what kind of input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I also struggled to understand this, but it's specifically meant to allow users to pass in the results from binary classification predict_proba without having to pull out the positive class, since we don't make any such requirements for the multiclass case. It was suggested by freddy in this comment: #1164 (comment). We test it here: https://github.com/alteryx/evalml/blob/main/evalml/tests/model_understanding_tests/test_metrics.py#L347-L366

I tried to add some clarity with the comment Only use one column for binary inputs that are still a DataFrame but let me know if there's a better way to clarify what's going on here

y_true = y_true.ww.replace({0: 10})
y_pred = y_pred.replace({0: 10})

obj.score(y_true=y_true, y_predicted=y_pred, X=X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sanity's sake, should we be checking any equality or non-nullness here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check that the value isnt null: b5ba0a2

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@tamargrey tamargrey force-pushed the use-sklearn-1.2.2-fixes branch 2 times, most recently from b982900 to 7d859af Compare March 16, 2023 15:08
y_pred_proba = infer_feature_types(y_pred_proba).to_numpy()

if len(y_pred_proba.shape) == 1:
y_pred_proba = y_pred_proba.reshape(-1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.

@tamargrey tamargrey merged commit a40c7f1 into main Mar 16, 2023
@tamargrey tamargrey deleted the use-sklearn-1.2.2-fixes branch March 16, 2023 18:46
@chukarsten chukarsten mentioned this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment