-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade sklearn to 1.2 and switch tuner #3983
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3983 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 347
Lines 36896 36906 +10
=======================================
+ Hits 36775 36785 +10
Misses 121 121
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -577,7 +577,7 @@ class LogLossBinary(BinaryClassificationObjective): | |||
Example: | |||
>>> y_true = pd.Series([0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1]) | |||
>>> y_pred = pd.Series([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]) | |||
>>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 18.8393325) | |||
>>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 19.6601745) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really change that much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is weird, but I did test between the two versions and the number does in fact change. The release notes for version 1.2 mention the change of an epsilon parameter for log loss, which might be the cause of this.
... parameters={"Linear Regressor": {"normalize": True}, | ||
... parameters={"Simple Imputer": {"impute_strategy": "mean"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we have to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn removed the normalize
parameter from their linear models (it's been deprecated for a while before that, as well, you can see the note about it in old docs here)
assert snc.categorical_features == [0, 1, 2] | ||
assert snc.categorical_features == [20, 21, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to how columns are sorted with strings vs numbers - since the column names get converted to strings before adding the new columns, the new ones are added at the end instead of the beginning. I could remove the need for this change by converting the column names to strings after the new columns are added.
"IS_FREE_EMAIL_DOMAIN(email)_True": Boolean(), | ||
"IS_FREE_EMAIL_DOMAIN(email)_1.0": Boolean(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh, I don't know if this is desired, right? @Cmancuso
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm concerned about this as well, and I don't quite understand why it's happening. I spent some time digging into it and it seems to be stemming exclusively from scikit-learn. We correctly pass booleans into the encoder, but this came out as a float with no change on our end.
Just noting the relation to the nullable types epic for posterity: scikit-learn 1.2.0 and 1.2.1 introduce new support for nullable types, which will allow us to use the estimators mentioned in #3910 with nullable types. |
ValueError: Unknown label type:
when nullable y passed in
#3927
Resolves #3984
Trying #3909 again, this time updating our
SKOptTuner
to work around the skopt issue, since their repo hasn't been updated since October of 2021.Perf tests for the tuner estimator produced no significant changes save some small speedups.