Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stacked ensemble classes #1134

Merged
merged 90 commits into from Sep 30, 2020
Merged

Implement stacked ensemble classes #1134

merged 90 commits into from Sep 30, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Sep 2, 2020

Closes #1129

@dsherry I did some digging around to see what would be necessary to support passing in our estimators to scikit-learn directly, rather than estimator._component_obj. It required at least:

  • implementing a get_params(deep)
  • adding classes_ and estimator_type

These additions were fine, but the underlying implementation for concatenating the results from each of the estimators assumed that the predict method for each estimator returned an np.array and errors out since we pass in pd.DataFrame. As such, I think there would need to be more work done to enable passing in our estimators and decided to opt for just passing in _component.obj, just as we've done for our other components.


Update: This PR also includes adding a utility method to wrap our pipeline / component classes in a simple scikit-learn estimator instance. Currently, check_estimator(obj) (from documentation on rolling your own estimator breaks, so the returned estimators are probably not fully scikit-learn compliant, but they're sufficient in this case. Will put up an issue for this once this is merged :)

I tried making the objects scikit-learn compliant, but it is probably not really plausible without some big API changes on our end. I ran into a lot of pickling issues, since dynamic classes can't be pickled using pickle and there are tests in check_estimator for this.

It seems like the wrapper works fine if n_jobs=1, and weird behavior happens otherwise (https://app.circleci.com/pipelines/github/alteryx/evalml/6037/workflows/ebe774d2-77be-42cb-8716-d73b9f382130/jobs/64121). Hence, for the sake of pushing this PR forward, I'll try to investigate this issue and as a fallback, just use n_jobs=1 for now.

Potentially relevant issue: scikit-learn-contrib/skope-rules#18

@angela97lin angela97lin self-assigned this Sep 2, 2020
@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #1134 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1134    +/-   ##
========================================
  Coverage   99.92%   99.93%            
========================================
  Files         201      207     +6     
  Lines       12511    12898   +387     
========================================
+ Hits        12502    12889   +387     
  Misses          9        9            
Impacted Files Coverage Δ
evalml/exceptions/__init__.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_en_classifier.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_et_classifier.py 100.00% <ø> (ø)
evalml/tests/component_tests/test_et_regressor.py 100.00% <ø> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.00% <ø> (ø)
evalml/exceptions/exceptions.py 100.00% <100.00%> (ø)
evalml/model_family/model_family.py 100.00% <100.00%> (ø)
evalml/pipelines/components/__init__.py 100.00% <100.00%> (ø)
evalml/pipelines/components/ensemble/__init__.py 100.00% <100.00%> (ø)
...lines/components/ensemble/stacked_ensemble_base.py 100.00% <100.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dcf640...4f5f9f9. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review September 9, 2020 18:34
@angela97lin angela97lin marked this pull request as draft September 9, 2020 18:34
@dsherry
Copy link
Collaborator

dsherry commented Sep 24, 2020

@angela97lin if there's more work which needs to be done on this impl, could you please close the PR and get a fresh one up when its ready to merge?

@angela97lin angela97lin mentioned this pull request Sep 28, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin great work, this looks real solid!

The only things I had which were blocking were:

  • Remove line added to TextFeaturizer
  • Docstring for WrappedSKClassifier/other in utils
  • Resolve comment on default_parameters
  • File bug in epic for the multi-processing error you saw related to n_jobs

Everything else was just small stuff. Right on!

"""Stacked ensemble base class.

Arguments:
input_pipelines (list(PipelineBase or subclass)): List of PipelineBase objects to use as the base estimators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"List of PipelineBase classes" instead of objects because they're not instances, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we're passing in objects / instances of PipelineBase subclasses :o

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, my bad.

- An scikit-learn cross-validation generator object
- An iterable yielding (train, test) splits
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supported below -1? That's counterintuitive to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return WrappedSKRegressor(evalml_obj)
elif evalml_obj.supported_problem_types == [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
return WrappedSKClassifier(evalml_obj)
raise ValueError("Could not wrap EvalML object in scikit-learn wrapper.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. You can replace is_pipeline with isinstance(evalml_obj, PipelineBase)
  2. This code is great but augh we're still being chased by this issue of estimators not having an explicit problem type!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Hmm, doing so will cause a circular dependency (having to import PipelineBase) but I do like this better. For now, will put the import statement in the method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point on the circular dep. Now that you mention it, since this is supposed to work for pipelines and components, it should go in evalml/pipelines/utils.py. Then the circular dep wouldn't happen either :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you updated the impl but that's ok!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Oh, I believe the circular dependency still happens if we put it in evalml/pipelines/utils.py since the method is being used in a component (StackedEnsembleBase). I think this is another case where we should think about separating pipelines from components to avoid the problem!

Also, I currently see on main that from evalml.pipelines.pipeline_base import PipelineBase is called in scikit_learn_wrapped_estimator. Is that different from what you see? :o

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, right... this component now knows about pipelines. Ok, got it, thanks.

Oh lol I was looking for it in this PR. Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Definitely a bit odd. This works for now but could be indicative of some API considerations we'll need to make.

And cool, glad that everything was in place! Odd that Github doesn't mark it as outdated but I guess the change was not exactly to this block 😂

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
elif problem_type == ProblemTypes.MULTICLASS:
X, y = X_y_multi
base_pipeline_class = MulticlassClassificationPipeline
input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this makes a pipeline with a single component, which is the estimator? Where the list of estimators from stackable_classifiers/stackable_regressors includes all accessible estimators? Cool :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Maybe overkill but it was my way of checking that any of our accessible classifiers/regressors are allowed as inputs to the stacked ensemble component ahaha

@angela97lin angela97lin merged commit a985d86 into main Sep 30, 2020
@angela97lin angela97lin deleted the 1129_ensemble_classes branch September 30, 2020 14:56
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement stacked ensemble classifier and regressor classes
4 participants