Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade sklearn to 1.2 and switch tuner #3983

Merged
merged 8 commits into from
Feb 13, 2023
Merged

Upgrade sklearn to 1.2 and switch tuner #3983

merged 8 commits into from
Feb 13, 2023

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Feb 8, 2023

Resolves #3984

Trying #3909 again, this time updating our SKOptTuner to work around the skopt issue, since their repo hasn't been updated since October of 2021.

Perf tests for the tuner estimator produced no significant changes save some small speedups.

@codecov
Copy link

codecov bot commented Feb 8, 2023

Codecov Report

Merging #3983 (c4ff45c) into main (5f58354) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3983     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        347     347             
  Lines      36896   36906     +10     
=======================================
+ Hits       36775   36785     +10     
  Misses       121     121             
Impacted Files Coverage Δ
evalml/objectives/standard_metrics.py 100.0% <ø> (ø)
...s/estimators/regressors/decision_tree_regressor.py 100.0% <ø> (ø)
...ents/estimators/regressors/elasticnet_regressor.py 100.0% <ø> (ø)
evalml/pipelines/components/utils.py 96.6% <ø> (ø)
evalml/pipelines/regression_pipeline.py 100.0% <ø> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_components.py 99.0% <ø> (ø)
evalml/tests/component_tests/test_en_regressor.py 100.0% <ø> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <ø> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.9% <ø> (ø)
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@eccabay eccabay mentioned this pull request Feb 8, 2023
@eccabay eccabay marked this pull request as ready for review February 8, 2023 19:12
@@ -577,7 +577,7 @@ class LogLossBinary(BinaryClassificationObjective):
Example:
>>> y_true = pd.Series([0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1])
>>> y_pred = pd.Series([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
>>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 18.8393325)
>>> np.testing.assert_almost_equal(LogLossBinary().objective_function(y_true, y_pred), 19.6601745)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really change that much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is weird, but I did test between the two versions and the number does in fact change. The release notes for version 1.2 mention the change of an epsilon parameter for log loss, which might be the cause of this.

... parameters={"Linear Regressor": {"normalize": True},
... parameters={"Simple Imputer": {"impute_strategy": "mean"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn removed the normalize parameter from their linear models (it's been deprecated for a while before that, as well, you can see the note about it in old docs here)

Comment on lines -309 to +314
assert snc.categorical_features == [0, 1, 2]
assert snc.categorical_features == [20, 21, 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to how columns are sorted with strings vs numbers - since the column names get converted to strings before adding the new columns, the new ones are added at the end instead of the beginning. I could remove the need for this change by converting the column names to strings after the new columns are added.

"IS_FREE_EMAIL_DOMAIN(email)_True": Boolean(),
"IS_FREE_EMAIL_DOMAIN(email)_1.0": Boolean(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh, I don't know if this is desired, right? @Cmancuso

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm concerned about this as well, and I don't quite understand why it's happening. I spent some time digging into it and it seems to be stemming exclusively from scikit-learn. We correctly pass booleans into the encoder, but this came out as a float with no change on our end.

@tamargrey
Copy link
Contributor

Just noting the relation to the nullable types epic for posterity: scikit-learn 1.2.0 and 1.2.1 introduce new support for nullable types, which will allow us to use the estimators mentioned in #3910 with nullable types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update EvalML to Sklearn 1.2.0
4 participants