Implement stacked ensemble classes #1134

angela97lin · 2020-09-02T23:01:31Z

@dsherry I did some digging around to see what would be necessary to support passing in our estimators to scikit-learn directly, rather than estimator._component_obj. It required at least:

implementing a get_params(deep)
adding classes_ and estimator_type

These additions were fine, but the underlying implementation for concatenating the results from each of the estimators assumed that the predict method for each estimator returned an np.array and errors out since we pass in pd.DataFrame. As such, I think there would need to be more work done to enable passing in our estimators and decided to opt for just passing in _component.obj, just as we've done for our other components.

Update: This PR also includes adding a utility method to wrap our pipeline / component classes in a simple scikit-learn estimator instance. Currently, check_estimator(obj) (from documentation on rolling your own estimator breaks, so the returned estimators are probably not fully scikit-learn compliant, but they're sufficient in this case. Will put up an issue for this once this is merged :)

I tried making the objects scikit-learn compliant, but it is probably not really plausible without some big API changes on our end. I ran into a lot of pickling issues, since dynamic classes can't be pickled using pickle and there are tests in check_estimator for this.

It seems like the wrapper works fine if n_jobs=1, and weird behavior happens otherwise (https://app.circleci.com/pipelines/github/alteryx/evalml/6037/workflows/ebe774d2-77be-42cb-8716-d73b9f382130/jobs/64121). Hence, for the sake of pushing this PR forward, I'll try to investigate this issue and as a fallback, just use n_jobs=1 for now.

Potentially relevant issue: scikit-learn-contrib/skope-rules#18

codecov · 2020-09-03T20:01:44Z

Codecov Report

Merging #1134 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##             main    #1134    +/-   ##
========================================
  Coverage   99.92%   99.93%            
========================================
  Files         201      207     +6     
  Lines       12511    12898   +387     
========================================
+ Hits        12502    12889   +387     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/exceptions/__init__.py	`100.00% <ø> (ø)`
evalml/tests/component_tests/test_en_classifier.py	`100.00% <ø> (ø)`
evalml/tests/component_tests/test_et_classifier.py	`100.00% <ø> (ø)`
evalml/tests/component_tests/test_et_regressor.py	`100.00% <ø> (ø)`
...alml/tests/component_tests/test_lgbm_classifier.py	`100.00% <ø> (ø)`
evalml/exceptions/exceptions.py	`100.00% <100.00%> (ø)`
evalml/model_family/model_family.py	`100.00% <100.00%> (ø)`
evalml/pipelines/components/__init__.py	`100.00% <100.00%> (ø)`
evalml/pipelines/components/ensemble/__init__.py	`100.00% <100.00%> (ø)`
...lines/components/ensemble/stacked_ensemble_base.py	`100.00% <100.00%> (ø)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dcf640...4f5f9f9. Read the comment docs.

evalml/tests/component_tests/test_components.py

evalml/pipelines/components/ensemble/ensemble_base.py

dsherry · 2020-09-24T16:43:00Z

@angela97lin if there's more work which needs to be done on this impl, could you please close the PR and get a fresh one up when its ready to merge?

evalml/pipelines/components/ensemble/stacked_ensemble_classifier.py

…o 1129_ensemble_classes

evalml/pipelines/components/estimators/estimator.py

dsherry

@angela97lin great work, this looks real solid!

The only things I had which were blocking were:

Remove line added to TextFeaturizer
Docstring for WrappedSKClassifier/other in utils
Resolve comment on default_parameters
File bug in epic for the multi-processing error you saw related to n_jobs

Everything else was just small stuff. Right on!

dsherry · 2020-09-28T19:59:58Z

evalml/pipelines/components/ensemble/stacked_ensemble_base.py

+        """Stacked ensemble base class.
+
+        Arguments:
+            input_pipelines (list(PipelineBase or subclass)): List of PipelineBase objects to use as the base estimators.


"List of PipelineBase classes" instead of objects because they're not instances, yes?

Right now we're passing in objects / instances of PipelineBase subclasses :o

Ah, got it, my bad.

evalml/pipelines/components/ensemble/stacked_ensemble_base.py

dsherry · 2020-09-29T01:33:40Z

evalml/pipelines/components/ensemble/stacked_ensemble_base.py

+                - An scikit-learn cross-validation generator object
+                - An iterable yielding (train, test) splits
+            n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
+                None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.


This is supported below -1? That's counterintuitive to me.

Haha yup: https://scikit-learn.org/stable/glossary.html#term-n-jobs

evalml/pipelines/components/ensemble/stacked_ensemble_base.py

dsherry · 2020-09-29T03:03:28Z

evalml/pipelines/components/utils.py

+            return WrappedSKRegressor(evalml_obj)
+        elif evalml_obj.supported_problem_types == [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
+            return WrappedSKClassifier(evalml_obj)
+    raise ValueError("Could not wrap EvalML object in scikit-learn wrapper.")


Two notes:

You can replace is_pipeline with isinstance(evalml_obj, PipelineBase)

This code is great but augh we're still being chased by this issue of estimators not having an explicit problem type!

@dsherry Hmm, doing so will cause a circular dependency (having to import PipelineBase) but I do like this better. For now, will put the import statement in the method.

Ah good point on the circular dep. Now that you mention it, since this is supposed to work for pipelines and components, it should go in evalml/pipelines/utils.py. Then the circular dep wouldn't happen either :)

I don't think you updated the impl but that's ok!

@dsherry Oh, I believe the circular dependency still happens if we put it in evalml/pipelines/utils.py since the method is being used in a component (StackedEnsembleBase). I think this is another case where we should think about separating pipelines from components to avoid the problem!

Also, I currently see on main that from evalml.pipelines.pipeline_base import PipelineBase is called in scikit_learn_wrapped_estimator. Is that different from what you see? :o

Ah, interesting, right... this component now knows about pipelines. Ok, got it, thanks.

Oh lol I was looking for it in this PR. Got it.

Yup! Definitely a bit odd. This works for now but could be indicative of some API considerations we'll need to make.

And cool, glad that everything was in place! Odd that Github doesn't mark it as outdated but I guess the change was not exactly to this block 😂

evalml/tests/component_tests/test_components.py

evalml/tests/pipeline_tests/test_pipelines.py

dsherry · 2020-09-29T03:17:17Z

evalml/tests/pipeline_tests/test_pipelines.py

+    elif problem_type == ProblemTypes.MULTICLASS:
+        X, y = X_y_multi
+        base_pipeline_class = MulticlassClassificationPipeline
+        input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers]


So this makes a pipeline with a single component, which is the estimator? Where the list of estimators from stackable_classifiers/stackable_regressors includes all accessible estimators? Cool :)

Yup! Maybe overkill but it was my way of checking that any of our accessible classifiers/regressors are allowed as inputs to the stacked ensemble component ahaha

Resolved

angela97lin added 3 commits August 31, 2020 18:19

init

aec7234

init

c1af455

fixing but some tricky things to do

cd7b13b

angela97lin self-assigned this Sep 2, 2020

angela97lin added 3 commits September 3, 2020 11:22

reeee more testing

c95c383

Merge branch 'main' into 1129_ensemble_classes

0267edc

clean up and more test

02d49bc

angela97lin added 8 commits September 3, 2020 17:38

Merge branch 'main' into 1129_ensemble_classes

8842cab

release note

8d497f5

Merge branch 'main' into 1129_ensemble_classes

7ea1594

add test, add broken pipeline sadness

9b625e3

updating estimators to be a kwarg

bdb0c4e

cleanup and remove feature importances

3e76578

add empty hyperparameter ranges

2faf743

merging

eab4524

angela97lin marked this pull request as ready for review September 9, 2020 18:34

angela97lin marked this pull request as draft September 9, 2020 18:34

angela97lin added 4 commits September 9, 2020 15:23

fix some tests, serialization still broken

0a4515c

add separate tests for serialization

1a5eec0

fix more tests

72a3968

revert code

1dd9b22

angela97lin commented Sep 10, 2020

View reviewed changes

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved

angela97lin commented Sep 10, 2020

View reviewed changes

evalml/pipelines/components/ensemble/ensemble_base.py Outdated Show resolved Hide resolved

angela97lin requested review from dsherry, freddyaboulton, jeremyliweishih, bchen1116, christopherbunn and eccabay September 10, 2020 14:08

set directly in parameters

4b2efcb

angela97lin added 4 commits September 24, 2020 13:23

revert requirement

ea19650

set n_jobs = 1

a96ec86

fix test

340fa23

Merge branch 'main' into 1129_ensemble_classes

d637f12

angela97lin commented Sep 24, 2020

View reviewed changes

evalml/pipelines/components/ensemble/stacked_ensemble_classifier.py Show resolved Hide resolved

angela97lin requested review from dsherry, freddyaboulton and eccabay September 24, 2020 19:10

angela97lin added 5 commits September 24, 2020 15:12

add to api ref

aba6e87

Merge branch '1129_ensemble_classes' of github.com:alteryx/evalml int…

9c00abb

…o 1129_ensemble_classes

fix docs by adding custom default parameters impl

2ad139a

fix docstr

bc809cd

Merge branch 'main' into 1129_ensemble_classes

631e5e1

angela97lin commented Sep 26, 2020

View reviewed changes

evalml/pipelines/components/estimators/estimator.py Show resolved Hide resolved

angela97lin mentioned this pull request Sep 28, 2020

Ensembling #1128

Closed

Merge branch 'main' into 1129_ensemble_classes

7f2374c

dsherry approved these changes Sep 29, 2020

View reviewed changes

merging

e54984b

angela97lin mentioned this pull request Sep 29, 2020

TerminatedWorker error when running StackedEnsemble components on CircleCI with n_jobs!=1 #1240

Closed

angela97lin added 5 commits September 29, 2020 13:00

some cleanup

631ffa0

minor cleanup for tests and docstring addition

db44af5

linting, merging, cleanup

d7a0aaa

fix new test

92d7dc6

merge and move release notes

4f5f9f9

angela97lin merged commit a985d86 into main Sep 30, 2020

angela97lin deleted the 1129_ensemble_classes branch September 30, 2020 14:56

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement stacked ensemble classes #1134

Implement stacked ensemble classes #1134

angela97lin commented Sep 2, 2020 •

edited

Loading

codecov bot commented Sep 3, 2020 •

edited

Loading

dsherry commented Sep 24, 2020

dsherry left a comment •

edited

Loading

dsherry Sep 28, 2020

angela97lin Sep 29, 2020

dsherry Sep 30, 2020

dsherry Sep 29, 2020

angela97lin Sep 29, 2020

dsherry Sep 29, 2020

angela97lin Sep 29, 2020 •

edited

Loading

dsherry Sep 30, 2020

dsherry Sep 30, 2020

angela97lin Sep 30, 2020

dsherry Sep 30, 2020

angela97lin Sep 30, 2020

dsherry Sep 29, 2020

angela97lin Sep 29, 2020

Implement stacked ensemble classes #1134

Implement stacked ensemble classes #1134

Conversation

angela97lin commented Sep 2, 2020 • edited Loading

Potentially relevant issue: scikit-learn-contrib/skope-rules#18

codecov bot commented Sep 3, 2020 • edited Loading

Codecov Report

dsherry commented Sep 24, 2020

dsherry left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin Sep 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin commented Sep 2, 2020 •

edited

Loading

codecov bot commented Sep 3, 2020 •

edited

Loading

dsherry left a comment •

edited

Loading

angela97lin Sep 29, 2020 •

edited

Loading