-
Notifications
You must be signed in to change notification settings - Fork 89
Integrate ComponentGraphs into Pipelines #1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1543 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 17677 18030 +353
=========================================
+ Hits 17669 18022 +353
Misses 8 8
Continue to review full report at Codecov.
|
evalml/pipelines/pipeline_base.py
Outdated
# TODO: Does this make sense | ||
return ModelFamily.ENSEMBLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the best way to determine a model family for non-linear pipelines would be. The main problems:
- A non-linear pipeline can have multiple estimators, should the model family reflect all of them or just the final one?
- Since this is a class property, we only have access to the dictionary describing the pipeline, which may not be in any sort of computation order. Without the instantiated ComponentGraph object, we have no clean way of knowing which estimator is the final one
Since this is ambiguous, I've opted to bundle these all under the ensemble model family but I'm looking for input on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question! These are some of the places where we call problem_type
at the pipeline level:
- To skip cv for ensemble pipelines in
_compute_cv_scores
in AutoMLSearch. - To set allowed_model_families in AutoMLSearch after pipelines are created
- To describe a pipeline.
- To check which pipelines are created by automl in tests
- To check the input pipeline has a decision tree as a final estimator in our model understanding utils
- To check we don't pass in baseline pipelines for partial dependence
- To determine which pipelines to add to our ensemble in IterativeAlgorithm
- To determine how the pipeline scores should be saved in IterativeAlgorithm
In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family
for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property
instead.
I think we should fix this before merge. I don't think returning ModelFamily.Ensemble
is a good idea because in our codebase, that typically means a StackedEnsembleClassifier
or StackedEnsembleRegressor
is in the pipeline, which isn't guaranteed to be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just talked to @eccabay and we explored two options:
- Instantiating the
component_graph
here and getting the order - Moving
ComponentGraph.compute_order
to a static method and calling it here
Number two looks like a good compromise for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Looks good! I have a couple of comments on the implementation/design.
What's the plan for adding support for non-linear pipelines in make_pipeline_from_components
?
evalml/pipelines/pipeline_base.py
Outdated
# TODO: Does this make sense | ||
return ModelFamily.ENSEMBLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question! These are some of the places where we call problem_type
at the pipeline level:
- To skip cv for ensemble pipelines in
_compute_cv_scores
in AutoMLSearch. - To set allowed_model_families in AutoMLSearch after pipelines are created
- To describe a pipeline.
- To check which pipelines are created by automl in tests
- To check the input pipeline has a decision tree as a final estimator in our model understanding utils
- To check we don't pass in baseline pipelines for partial dependence
- To determine which pipelines to add to our ensemble in IterativeAlgorithm
- To determine how the pipeline scores should be saved in IterativeAlgorithm
In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family
for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property
instead.
I think we should fix this before merge. I don't think returning ModelFamily.Ensemble
is a good idea because in our codebase, that typically means a StackedEnsembleClassifier
or StackedEnsembleRegressor
is in the pipeline, which isn't guaranteed to be the case here.
@@ -344,8 +344,8 @@ def test_make_pipeline_from_components(X_y_binary, logistic_regression_binary_pi | |||
est = RandomForestClassifier(random_state=7) | |||
pipeline = make_pipeline_from_components([imp, est], ProblemTypes.BINARY, custom_name='My Pipeline', | |||
random_state=15) | |||
assert [c.__class__ for c in pipeline.component_graph] == [Imputer, RandomForestClassifier] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this highlights a subtle breaking change. Before component_graph
would be populated with instances of the components after init. Now component_graph
will always be the "static" graph definition (no instances). Is there a reason the instances are kept private as _component_graph
?
If there is a reason to keep the instances separate from the "static" graph, I'd prefer if the instances were kept public like before. In that case, we can rename _component_graph
to component_instances
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting point. The challenge here is that the instances are contained within the ComponentGraph
object, which we're keeping completely private from users. It may be possible to add a separate public reference to the instances directly from the pipeline, but I worry that between this and the linearized_component_graph
classproperty we discussed earlier, we're adding a fair amount of redundant information that's only there to maintain the current status quo.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about wanting to keep ComponentGraph
private from the end user. I think you're right that we shouldn't add another accessor for the instances. Maybe what we should do is just make iterating over/indexing the pipeline instead of the pipeline.component_graph the "preferred" public accessor for the component instances. The benefit is that this functionality is already built into your PR!
For example, the line we're commenting on would turn to [c.__class__ for c in pipeline]
. And accessing individual components can be done with pipeline[index]
This would require updating our docs to not access instances with pipeline.component_graph
as well as our unit tests.
What do you think? If you are on board, we could also leave this for another issue since this diff is large already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, I think it makes a lot of sense! I do agree it's better fit for a separate issue though haha, so I can file that if other people also think this would be beneficial!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed issue #1571
Closes #1278
This definitely needs more tests but at this point I'm not sure what would be the best to add, so I'm very open to suggestions!
The most up to date design doc