-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement stacked ensemble classes #1134
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1134 +/- ##
========================================
Coverage 99.92% 99.93%
========================================
Files 201 207 +6
Lines 12511 12898 +387
========================================
+ Hits 12502 12889 +387
Misses 9 9
Continue to review full report at Codecov.
|
@angela97lin if there's more work which needs to be done on this impl, could you please close the PR and get a fresh one up when its ready to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin great work, this looks real solid!
The only things I had which were blocking were:
- Remove line added to
TextFeaturizer
- Docstring for WrappedSKClassifier/other in utils
- Resolve comment on
default_parameters
- File bug in epic for the multi-processing error you saw related to
n_jobs
Everything else was just small stuff. Right on!
"""Stacked ensemble base class. | ||
|
||
Arguments: | ||
input_pipelines (list(PipelineBase or subclass)): List of PipelineBase objects to use as the base estimators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"List of PipelineBase classes" instead of objects because they're not instances, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we're passing in objects / instances of PipelineBase subclasses :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, my bad.
- An scikit-learn cross-validation generator object | ||
- An iterable yielding (train, test) splits | ||
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines. | ||
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supported below -1? That's counterintuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return WrappedSKRegressor(evalml_obj) | ||
elif evalml_obj.supported_problem_types == [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]: | ||
return WrappedSKClassifier(evalml_obj) | ||
raise ValueError("Could not wrap EvalML object in scikit-learn wrapper.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes:
- You can replace
is_pipeline
withisinstance(evalml_obj, PipelineBase)
- This code is great but augh we're still being chased by this issue of estimators not having an explicit problem type!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Hmm, doing so will cause a circular dependency (having to import PipelineBase
) but I do like this better. For now, will put the import statement in the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point on the circular dep. Now that you mention it, since this is supposed to work for pipelines and components, it should go in evalml/pipelines/utils.py
. Then the circular dep wouldn't happen either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you updated the impl but that's ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Oh, I believe the circular dependency still happens if we put it in evalml/pipelines/utils.py
since the method is being used in a component (StackedEnsembleBase
). I think this is another case where we should think about separating pipelines from components to avoid the problem!
Also, I currently see on main that from evalml.pipelines.pipeline_base import PipelineBase
is called in scikit_learn_wrapped_estimator
. Is that different from what you see? :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, right... this component now knows about pipelines. Ok, got it, thanks.
Oh lol I was looking for it in this PR. Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Definitely a bit odd. This works for now but could be indicative of some API considerations we'll need to make.
And cool, glad that everything was in place! Odd that Github doesn't mark it as outdated but I guess the change was not exactly to this block 😂
elif problem_type == ProblemTypes.MULTICLASS: | ||
X, y = X_y_multi | ||
base_pipeline_class = MulticlassClassificationPipeline | ||
input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this makes a pipeline with a single component, which is the estimator? Where the list of estimators from stackable_classifiers
/stackable_regressors
includes all accessible estimators? Cool :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Maybe overkill but it was my way of checking that any of our accessible classifiers/regressors are allowed as inputs to the stacked ensemble component ahaha
Closes #1129
@dsherry I did some digging around to see what would be necessary to support passing in our estimators to scikit-learn directly, rather than
estimator._component_obj
. It required at least:get_params(deep)
classes_
andestimator_type
These additions were fine, but the underlying implementation for concatenating the results from each of the estimators assumed that the
predict
method for each estimator returned annp.array
and errors out since we pass inpd.DataFrame
. As such, I think there would need to be more work done to enable passing in our estimators and decided to opt for just passing in_component.obj
, just as we've done for our other components.Update: This PR also includes adding a utility method to wrap our pipeline / component classes in a simple scikit-learn estimator instance. Currently,
check_estimator(obj)
(from documentation on rolling your own estimator breaks, so the returned estimators are probably not fully scikit-learn compliant, but they're sufficient in this case. Will put up an issue for this once this is merged :)I tried making the objects scikit-learn compliant, but it is probably not really plausible without some big API changes on our end. I ran into a lot of pickling issues, since dynamic classes can't be pickled using
pickle
and there are tests incheck_estimator
for this.It seems like the wrapper works fine if
n_jobs=1
, and weird behavior happens otherwise (https://app.circleci.com/pipelines/github/alteryx/evalml/6037/workflows/ebe774d2-77be-42cb-8716-d73b9f382130/jobs/64121). Hence, for the sake of pushing this PR forward, I'll try to investigate this issue and as a fallback, just usen_jobs=1
for now.Potentially relevant issue: scikit-learn-contrib/skope-rules#18