Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

tamargrey · 2023-03-13T17:04:44Z

closes #4021, closes #4020, closes #4018, closes #3992, closes #4019, closes #4054

Removes logic related to model understanding nullable type handling and Objective nullable type handling, since sklearn 1.2.2 allows us to fully support nullable types!

codecov · 2023-03-13T17:12:06Z

Codecov Report

Merging #4072 (ba32924) into main (405655f) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4072     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        349     349             
  Lines      37583   37546     -37     
=======================================
- Hits       37465   37428     -37     
  Misses       118     118

Impacted Files	Coverage Δ
evalml/tests/conftest.py	`98.3% <ø> (-<0.1%)`	⬇️
evalml/model_understanding/metrics.py	`100.0% <100.0%> (ø)`
evalml/objectives/objective_base.py	`100.0% <100.0%> (ø)`
evalml/objectives/standard_metrics.py	`100.0% <100.0%> (ø)`
...ml/tests/model_understanding_tests/test_metrics.py	`100.0% <100.0%> (ø)`
evalml/tests/objective_tests/test_objectives.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tamargrey · 2023-03-13T17:32:33Z

evalml/model_understanding/metrics.py

-    y_pred_proba = infer_feature_types(y_pred_proba).to_numpy()
-
-    if len(y_pred_proba.shape) == 1:
-        y_pred_proba = y_pred_proba.reshape(-1, 1)


The roc_curve changes are so drastic to allow us to pass in nullable types, because we were manipulating y_true and y_predict_proba as numpy arrays instead of pandas series/dataframes.

We generally tend to stay in pandas in evalml (infer_feature_types converts numpy arrays to pandas), so I don't think it's so problematic to stay in pandas to avoid the extra nullable types logic. Though worth noting a slight increase in runtime when we switch to using pandas (the difference increased proportionally as I increased number of rows and was similar whether or not the dtype was int64 or Int64). Right now, I'm thinking that this is worth being able to remove the nullable type logic.

The reason we can't stay in numpy is that a series.to_numpy() call from pandas nullable dtypes will always result in data with the object dtype, and this is not something numpy is going to change. Previously, we couldn't pass in the nullable types to the label_binarizer or sklearn_roc_curve, so moving to numpy right away via _convert_ww_series_to_np_array made sense, but now there's no need. But now that we can pass in pandas data with nullable dtypes, we should follow the rest of evalml and stay in pandas when possible.

Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.

tamargrey · 2023-03-13T17:52:45Z

evalml/objectives/standard_metrics.py

@@ -838,9 +838,9 @@ def objective_function(self, y_true, y_predicted, X=None, sample_weight=None):
                "targets contain the value 0.",
            )
        if isinstance(y_true, pd.Series):
-            y_true = y_true.values


this change was needed because .values doesn't produce a numpy array when used with nullable. Int64 would produce an <IntegerArray>, which then will be problematic for the .mean call below. Changing to to_numpy fixed this, so I'm not calling it an integer nullable incompatibility since the workaround doesnt need to worry about the types.

eccabay

LGTM! Just small questions.

eccabay · 2023-03-15T14:53:41Z

evalml/model_understanding/metrics.py

+    # Only use one column for binary inputs that are still a DataFrame
+    elif y_pred_proba.shape[1] == 2:
+        y_pred_proba = pd.DataFrame(y_pred_proba.iloc[:, 1])


I'm not quite following the logic here. When do we hit this case, what kind of input?

yeah, I also struggled to understand this, but it's specifically meant to allow users to pass in the results from binary classification predict_proba without having to pull out the positive class, since we don't make any such requirements for the multiclass case. It was suggested by freddy in this comment: #1164 (comment). We test it here: https://github.com/alteryx/evalml/blob/main/evalml/tests/model_understanding_tests/test_metrics.py#L347-L366

I tried to add some clarity with the comment Only use one column for binary inputs that are still a DataFrame but let me know if there's a better way to clarify what's going on here

eccabay · 2023-03-15T15:11:39Z

evalml/tests/objective_tests/test_objectives.py

+        y_true = y_true.ww.replace({0: 10})
+        y_pred = y_pred.replace({0: 10})
+
+    obj.score(y_true=y_true, y_predicted=y_pred, X=X)


For sanity's sake, should we be checking any equality or non-nullness here?

added a check that the value isnt null: b5ba0a2

docs/source/release_notes.rst

… method and properties from ObjectiveBase

…ibility

christopherbunn · 2023-03-16T17:41:37Z

evalml/model_understanding/metrics.py

-    y_pred_proba = infer_feature_types(y_pred_proba).to_numpy()
-
-    if len(y_pred_proba.shape) == 1:
-        y_pred_proba = y_pred_proba.reshape(-1, 1)


Imo the slowdown is relatively small compared to the benefits you highlighted here so I'm onboard with removing the numpy conversion.

tamargrey commented Mar 13, 2023

View reviewed changes

tamargrey force-pushed the use-sklearn-1.2.2-fixes branch from 675c68c to 5abb763 Compare March 13, 2023 18:05

tamargrey marked this pull request as ready for review March 14, 2023 19:55

auto-assign bot assigned tamargrey Mar 14, 2023

tamargrey force-pushed the use-sklearn-1.2.2-fixes branch from 5abb763 to 5a4c0c9 Compare March 15, 2023 12:32

tamargrey requested review from eccabay, chukarsten and christopherbunn March 15, 2023 13:14

eccabay approved these changes Mar 15, 2023

View reviewed changes

tamargrey force-pushed the use-sklearn-1.2.2-fixes branch 2 times, most recently from b982900 to 7d859af Compare March 16, 2023 15:08

Tamar Grey added 8 commits March 16, 2023 12:13

Remove nullable type handling for confusion matrix

0b9b0c4

Confirm objectives can all take in nulalble types and remove handling…

3bc0df1

… method and properties from ObjectiveBase

Stop passing np.unique into label binaryizer for nullable type compat…

2454da4

…ibility

Use pandas nullable types in roc_curve

b5d8fe9

clearn up roc curve changes

3fc4992

Add release note

ca0e760

Fix unintended test change

3347abe

PR comments

ba32924

tamargrey force-pushed the use-sklearn-1.2.2-fixes branch from 7d859af to ba32924 Compare March 16, 2023 16:13

christopherbunn approved these changes Mar 16, 2023

View reviewed changes

tamargrey merged commit a40c7f1 into main Mar 16, 2023

tamargrey deleted the use-sklearn-1.2.2-fixes branch March 16, 2023 18:46

chukarsten mentioned this pull request Mar 17, 2023

Release v0.70.0 #4081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

tamargrey commented Mar 13, 2023 •

edited

Loading

codecov bot commented Mar 13, 2023 •

edited

Loading

tamargrey Mar 13, 2023 •

edited

Loading

christopherbunn Mar 16, 2023

tamargrey Mar 13, 2023

eccabay left a comment

eccabay Mar 15, 2023

tamargrey Mar 15, 2023

eccabay Mar 15, 2023

tamargrey Mar 15, 2023

christopherbunn Mar 16, 2023

Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

Remove nullable handlings where possible from sklearn 1.2.2 upgrade #4072

Conversation

tamargrey commented Mar 13, 2023 • edited Loading

codecov bot commented Mar 13, 2023 • edited Loading

Codecov Report

tamargrey Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

christopherbunn Mar 16, 2023

Choose a reason for hiding this comment

tamargrey Mar 13, 2023

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

eccabay Mar 15, 2023

Choose a reason for hiding this comment

tamargrey Mar 15, 2023

Choose a reason for hiding this comment

eccabay Mar 15, 2023

Choose a reason for hiding this comment

tamargrey Mar 15, 2023

Choose a reason for hiding this comment

christopherbunn Mar 16, 2023

Choose a reason for hiding this comment

tamargrey commented Mar 13, 2023 •

edited

Loading

codecov bot commented Mar 13, 2023 •

edited

Loading

tamargrey Mar 13, 2023 •

edited

Loading