Skip to content

Implement stacked ensemble classes#1134

Merged
angela97lin merged 90 commits into
mainfrom
1129_ensemble_classes
Sep 30, 2020
Merged

Implement stacked ensemble classes#1134
angela97lin merged 90 commits into
mainfrom
1129_ensemble_classes

Conversation

@angela97lin

@angela97lin angela97lin commented Sep 2, 2020

Copy link
Copy Markdown
Contributor

Closes #1129

@dsherry I did some digging around to see what would be necessary to support passing in our estimators to scikit-learn directly, rather than estimator._component_obj. It required at least:

  • implementing a get_params(deep)
  • adding classes_ and estimator_type

These additions were fine, but the underlying implementation for concatenating the results from each of the estimators assumed that the predict method for each estimator returned an np.array and errors out since we pass in pd.DataFrame. As such, I think there would need to be more work done to enable passing in our estimators and decided to opt for just passing in _component.obj, just as we've done for our other components.


Update: This PR also includes adding a utility method to wrap our pipeline / component classes in a simple scikit-learn estimator instance. Currently, check_estimator(obj) (from documentation on rolling your own estimator breaks, so the returned estimators are probably not fully scikit-learn compliant, but they're sufficient in this case. Will put up an issue for this once this is merged :)

I tried making the objects scikit-learn compliant, but it is probably not really plausible without some big API changes on our end. I ran into a lot of pickling issues, since dynamic classes can't be pickled using pickle and there are tests in check_estimator for this.

It seems like the wrapper works fine if n_jobs=1, and weird behavior happens otherwise (https://app.circleci.com/pipelines/github/alteryx/evalml/6037/workflows/ebe774d2-77be-42cb-8716-d73b9f382130/jobs/64121). Hence, for the sake of pushing this PR forward, I'll try to investigate this issue and as a fallback, just use n_jobs=1 for now.

Potentially relevant issue: scikit-learn-contrib/skope-rules#18

@angela97lin angela97lin self-assigned this Sep 2, 2020
@codecov

codecov Bot commented Sep 3, 2020

Copy link
Copy Markdown

Codecov Report

Merging #1134 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1134    +/-   ##
========================================
  Coverage   99.92%   99.93%            
========================================
  Files         201      207     +6     
  Lines       12511    12898   +387     
========================================
+ Hits        12502    12889   +387     
  Misses          9        9            
Impacted Files Coverage Δ
evalml/exceptions/__init__.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_en_classifier.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_et_classifier.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_et_regressor.py 100.00% <ø> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.00% <ø> (ø)
evalml/exceptions/exceptions.py 100.00% <100.00%> (ø)
evalml/model_family/model_family.py 100.00% <100.00%> (ø)
evalml/pipelines/components/__init__.py 100.00% <100.00%> (ø)
evalml/pipelines/components/ensemble/__init__.py 100.00% <100.00%> (ø)
...lines/components/ensemble/stacked_ensemble_base.py 100.00% <100.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dcf640...4f5f9f9. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review September 9, 2020 18:34
@angela97lin angela97lin marked this pull request as draft September 9, 2020 18:34
Comment thread evalml/tests/component_tests/test_components.py Outdated
Comment thread evalml/pipelines/components/ensemble/ensemble_base.py Outdated
@dsherry

dsherry commented Sep 24, 2020

Copy link
Copy Markdown
Contributor

@angela97lin if there's more work which needs to be done on this impl, could you please close the PR and get a fresh one up when its ready to merge?

Comment thread evalml/pipelines/components/ensemble/stacked_ensemble_classifier.py
Comment thread evalml/pipelines/components/estimators/estimator.py
@angela97lin angela97lin mentioned this pull request Sep 28, 2020

@dsherry dsherry left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin great work, this looks real solid!

The only things I had which were blocking were:

  • Remove line added to TextFeaturizer
  • Docstring for WrappedSKClassifier/other in utils
  • Resolve comment on default_parameters
  • File bug in epic for the multi-processing error you saw related to n_jobs

Everything else was just small stuff. Right on!

"""Stacked ensemble base class.

Arguments:
input_pipelines (list(PipelineBase or subclass)): List of PipelineBase objects to use as the base estimators.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"List of PipelineBase classes" instead of objects because they're not instances, yes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we're passing in objects / instances of PipelineBase subclasses :o

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, my bad.

Comment thread evalml/pipelines/components/ensemble/stacked_ensemble_base.py
- An scikit-learn cross-validation generator object
- An iterable yielding (train, test) splits
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supported below -1? That's counterintuitive to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread evalml/pipelines/components/ensemble/stacked_ensemble_base.py
Comment thread evalml/pipelines/components/ensemble/stacked_ensemble_base.py Outdated
return WrappedSKRegressor(evalml_obj)
elif evalml_obj.supported_problem_types == [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return WrappedSKClassifier(evalml_obj)
raise ValueError("Could not wrap EvalML object in scikit-learn wrapper.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. You can replace is_pipeline with isinstance(evalml_obj, PipelineBase)
  2. This code is great but augh we're still being chased by this issue of estimators not having an explicit problem type!

@angela97lin angela97lin Sep 29, 2020

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Hmm, doing so will cause a circular dependency (having to import PipelineBase) but I do like this better. For now, will put the import statement in the method.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point on the circular dep. Now that you mention it, since this is supposed to work for pipelines and components, it should go in evalml/pipelines/utils.py. Then the circular dep wouldn't happen either :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you updated the impl but that's ok!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Oh, I believe the circular dependency still happens if we put it in evalml/pipelines/utils.py since the method is being used in a component (StackedEnsembleBase). I think this is another case where we should think about separating pipelines from components to avoid the problem!

Also, I currently see on main that from evalml.pipelines.pipeline_base import PipelineBase is called in scikit_learn_wrapped_estimator. Is that different from what you see? :o

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, right... this component now knows about pipelines. Ok, got it, thanks.

Oh lol I was looking for it in this PR. Got it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Definitely a bit odd. This works for now but could be indicative of some API considerations we'll need to make.

And cool, glad that everything was in place! Odd that Github doesn't mark it as outdated but I guess the change was not exactly to this block 😂

Comment thread evalml/tests/component_tests/test_components.py Outdated
Comment thread evalml/tests/pipeline_tests/test_pipelines.py Outdated
Comment thread evalml/tests/pipeline_tests/test_pipelines.py
elif problem_type == ProblemTypes.MULTICLASS:
X, y = X_y_multi
base_pipeline_class = MulticlassClassificationPipeline
input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this makes a pipeline with a single component, which is the estimator? Where the list of estimators from stackable_classifiers/stackable_regressors includes all accessible estimators? Cool :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Maybe overkill but it was my way of checking that any of our accessible classifiers/regressors are allowed as inputs to the stacked ensemble component ahaha

@angela97lin angela97lin merged commit a985d86 into main Sep 30, 2020
@angela97lin angela97lin deleted the 1129_ensemble_classes branch September 30, 2020 14:56
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement stacked ensemble classifier and regressor classes

4 participants