Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StandardScaler to ElasticNet pipelines. #1065

Merged
merged 3 commits into from
Aug 17, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Adds StandardScaler to all ElasticNet pipelines.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 14, 2020

Codecov Report

Merging #1065 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1065   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         188      188           
  Lines       10286    10296   +10     
=======================================
+ Hits        10277    10287   +10     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fe1630...0dabc81. Read the comment docs.

@freddyaboulton freddyaboulton changed the title All linear models have a standard scaler in the pipeline. Add StandardScaler to ElasticNet pipelines. Aug 14, 2020
@freddyaboulton freddyaboulton marked this pull request as ready for review August 14, 2020 18:51
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about adding a test but thanks for catching and adding this! 😁

(Side note: I'm curious / abusing the perf tests but still curious if/how this changes scores :d)

@@ -56,7 +55,7 @@ def _get_preprocessing_components(X, y, problem_type, estimator_class):
if (add_datetime_featurizer or len(categorical_cols.columns) > 0) and estimator_class not in {CatBoostClassifier, CatBoostRegressor}:
pp_components.append(OneHotEncoder)

if estimator_class in {LinearRegressor, LogisticRegressionClassifier}:
if estimator_class.model_family == ModelFamily.LINEAR_MODEL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Could we add a test (testing test_make_pipeline) to make sure :d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for thinking of this. As mentioned in slack, I think all non-tree-based models should apply scaling, for now.

@freddyaboulton it would be great to see perf test results on this, perhaps 5 trials on datasets_small_0.yaml with model_family limited to linear. But since this is a small change, not required--if it did introduce a regression we'd catch it before release.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, LGTM!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for adding the test!

@@ -114,6 +116,11 @@ def test_make_pipeline():
assert isinstance(binary_pipeline, type(BinaryClassificationPipeline))
assert binary_pipeline.component_graph == [DropNullColumns, Imputer, DateTimeFeaturizer, OneHotEncoder, StandardScaler, LogisticRegressionClassifier]

en_binary_pipeline = make_pipeline(X, y, ElasticNetClassifier, ProblemTypes.BINARY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for this!

@freddyaboulton
Copy link
Contributor Author

@angela97lin @dsherry @eccabay I was able to run the performance tests for this change. The results are here. In short, introducing this change increases the number of times EN is picked as the best pipeline without introducing any regressions (fit time or average best pipeline score) so I think we're ok to merge this.

@freddyaboulton freddyaboulton merged commit 2ecf3bd into main Aug 17, 2020
@dsherry dsherry mentioned this pull request Aug 25, 2020
@freddyaboulton freddyaboulton deleted the add-standard-scaler-elastic-net-pipelines branch October 22, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants