Upgrade sklearn to 1.2 and switch tuner#3983
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3983 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 347
Lines 36896 36906 +10
=======================================
+ Hits 36775 36785 +10
Misses 121 121
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
| >>> y_true = pd.Series([0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1]) | ||
| >>> y_pred = pd.Series([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]) | ||
| >>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 18.8393325) | ||
| >>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 19.6601745) |
There was a problem hiding this comment.
Should this really change that much?
There was a problem hiding this comment.
Agreed that this is weird, but I did test between the two versions and the number does in fact change. The release notes for version 1.2 mention the change of an epsilon parameter for log loss, which might be the cause of this.
| Example: | ||
| >>> pipeline = TimeSeriesRegressionPipeline(component_graph=["Simple Imputer", "Linear Regressor"], | ||
| ... parameters={"Linear Regressor": {"normalize": True}, | ||
| ... parameters={"Simple Imputer": {"impute_strategy": "mean"}, |
There was a problem hiding this comment.
Why did we have to do this?
There was a problem hiding this comment.
Scikit-learn removed the normalize parameter from their linear models (it's been deprecated for a while before that, as well, you can see the note about it in old docs here)
| assert snc.categorical_features == [0, 1, 2] | ||
| assert snc.categorical_features == [20, 21, 2] |
There was a problem hiding this comment.
Why would this be the case?
There was a problem hiding this comment.
It's due to how columns are sorted with strings vs numbers - since the column names get converted to strings before adding the new columns, the new ones are added at the end instead of the beginning. I could remove the need for this change by converting the column names to strings after the new columns are added.
| "cat_c": Boolean(), | ||
| "EMAIL_ADDRESS_TO_DOMAIN(email)_gmail.com": Boolean(), | ||
| "IS_FREE_EMAIL_DOMAIN(email)_True": Boolean(), | ||
| "IS_FREE_EMAIL_DOMAIN(email)_1.0": Boolean(), |
There was a problem hiding this comment.
Uh oh, I don't know if this is desired, right? @Cmancuso
There was a problem hiding this comment.
Yeah, I'm concerned about this as well, and I don't quite understand why it's happening. I spent some time digging into it and it seems to be stemming exclusively from scikit-learn. We correctly pass booleans into the encoder, but this came out as a float with no change on our end.
|
Just noting the relation to the nullable types epic for posterity: scikit-learn 1.2.0 and 1.2.1 introduce new support for nullable types, which will allow us to use the estimators mentioned in #3910 with nullable types. |
ValueError: Unknown label type: when nullable y passed in
#3927
Resolves #3984
Trying #3909 again, this time updating our
SKOptTunerto work around the skopt issue, since their repo hasn't been updated since October of 2021.Perf tests for the tuner estimator produced no significant changes save some small speedups.