Integrate ComponentGraphs into Pipelines#1543
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1543 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 17677 18030 +353
=========================================
+ Hits 17669 18022 +353
Misses 8 8
Continue to review full report at Codecov.
|
| # TODO: Does this make sense | ||
| return ModelFamily.ENSEMBLE |
There was a problem hiding this comment.
I'm not sure what the best way to determine a model family for non-linear pipelines would be. The main problems:
- A non-linear pipeline can have multiple estimators, should the model family reflect all of them or just the final one?
- Since this is a class property, we only have access to the dictionary describing the pipeline, which may not be in any sort of computation order. Without the instantiated ComponentGraph object, we have no clean way of knowing which estimator is the final one
Since this is ambiguous, I've opted to bundle these all under the ensemble model family but I'm looking for input on this!
There was a problem hiding this comment.
Interesting question! These are some of the places where we call problem_type at the pipeline level:
- To skip cv for ensemble pipelines in
_compute_cv_scoresin AutoMLSearch. - To set allowed_model_families in AutoMLSearch after pipelines are created
- To describe a pipeline.
- To check which pipelines are created by automl in tests
- To check the input pipeline has a decision tree as a final estimator in our model understanding utils
- To check we don't pass in baseline pipelines for partial dependence
- To determine which pipelines to add to our ensemble in IterativeAlgorithm
- To determine how the pipeline scores should be saved in IterativeAlgorithm
In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property instead.
I think we should fix this before merge. I don't think returning ModelFamily.Ensemble is a good idea because in our codebase, that typically means a StackedEnsembleClassifier or StackedEnsembleRegressor is in the pipeline, which isn't guaranteed to be the case here.
There was a problem hiding this comment.
just talked to @eccabay and we explored two options:
- Instantiating the
component_graphhere and getting the order - Moving
ComponentGraph.compute_orderto a static method and calling it here
Number two looks like a good compromise for now.
freddyaboulton
left a comment
There was a problem hiding this comment.
@eccabay Looks good! I have a couple of comments on the implementation/design.
What's the plan for adding support for non-linear pipelines in make_pipeline_from_components ?
| # TODO: Does this make sense | ||
| return ModelFamily.ENSEMBLE |
There was a problem hiding this comment.
Interesting question! These are some of the places where we call problem_type at the pipeline level:
- To skip cv for ensemble pipelines in
_compute_cv_scoresin AutoMLSearch. - To set allowed_model_families in AutoMLSearch after pipelines are created
- To describe a pipeline.
- To check which pipelines are created by automl in tests
- To check the input pipeline has a decision tree as a final estimator in our model understanding utils
- To check we don't pass in baseline pipelines for partial dependence
- To determine which pipelines to add to our ensemble in IterativeAlgorithm
- To determine how the pipeline scores should be saved in IterativeAlgorithm
In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property instead.
I think we should fix this before merge. I don't think returning ModelFamily.Ensemble is a good idea because in our codebase, that typically means a StackedEnsembleClassifier or StackedEnsembleRegressor is in the pipeline, which isn't guaranteed to be the case here.
| est = RandomForestClassifier(random_state=7) | ||
| pipeline = make_pipeline_from_components([imp, est], ProblemTypes.BINARY, custom_name='My Pipeline', | ||
| random_state=15) | ||
| assert [c.__class__ for c in pipeline.component_graph] == [Imputer, RandomForestClassifier] |
There was a problem hiding this comment.
I think this highlights a subtle breaking change. Before component_graph would be populated with instances of the components after init. Now component_graph will always be the "static" graph definition (no instances). Is there a reason the instances are kept private as _component_graph?
If there is a reason to keep the instances separate from the "static" graph, I'd prefer if the instances were kept public like before. In that case, we can rename _component_graph to component_instances or something else?
There was a problem hiding this comment.
This is a really interesting point. The challenge here is that the instances are contained within the ComponentGraph object, which we're keeping completely private from users. It may be possible to add a separate public reference to the instances directly from the pipeline, but I worry that between this and the linearized_component_graph classproperty we discussed earlier, we're adding a fair amount of redundant information that's only there to maintain the current status quo.
What do you think?
There was a problem hiding this comment.
I see what you mean about wanting to keep ComponentGraph private from the end user. I think you're right that we shouldn't add another accessor for the instances. Maybe what we should do is just make iterating over/indexing the pipeline instead of the pipeline.component_graph the "preferred" public accessor for the component instances. The benefit is that this functionality is already built into your PR!
For example, the line we're commenting on would turn to [c.__class__ for c in pipeline]. And accessing individual components can be done with pipeline[index]
This would require updating our docs to not access instances with pipeline.component_graph as well as our unit tests.
What do you think? If you are on board, we could also leave this for another issue since this diff is large already.
There was a problem hiding this comment.
I like this idea, I think it makes a lot of sense! I do agree it's better fit for a separate issue though haha, so I can file that if other people also think this would be beneficial!
Closes #1278
This definitely needs more tests but at this point I'm not sure what would be the best to add, so I'm very open to suggestions!
The most up to date design doc