Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ComponentGraphs into Pipelines #1543

Merged
merged 24 commits into from Dec 18, 2020
Merged

Integrate ComponentGraphs into Pipelines #1543

merged 24 commits into from Dec 18, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Dec 10, 2020

Closes #1278

This definitely needs more tests but at this point I'm not sure what would be the best to add, so I'm very open to suggestions!

The most up to date design doc

@codecov
Copy link

codecov bot commented Dec 10, 2020

Codecov Report

Merging #1543 (29a8a88) into main (162992d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1543     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       17677    18030    +353     
=========================================
+ Hits        17669    18022    +353     
  Misses          8        8             
Impacted Files Coverage Δ
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/model_family/model_family.py 100.0% <100.0%> (ø)
evalml/pipelines/binary_classification_pipeline.py 100.0% <100.0%> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
evalml/pipelines/pipeline_base.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 100.0% <100.0%> (ø)
.../automl_tests/test_automl_search_classification.py 100.0% <100.0%> (ø)
...ests/automl_tests/test_automl_search_regression.py 100.0% <100.0%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <100.0%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 162992d...29a8a88. Read the comment docs.

Comment on lines 272 to 273
# TODO: Does this make sense
return ModelFamily.ENSEMBLE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best way to determine a model family for non-linear pipelines would be. The main problems:

  • A non-linear pipeline can have multiple estimators, should the model family reflect all of them or just the final one?
  • Since this is a class property, we only have access to the dictionary describing the pipeline, which may not be in any sort of computation order. Without the instantiated ComponentGraph object, we have no clean way of knowing which estimator is the final one

Since this is ambiguous, I've opted to bundle these all under the ensemble model family but I'm looking for input on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question! These are some of the places where we call problem_type at the pipeline level:

  1. To skip cv for ensemble pipelines in _compute_cv_scores in AutoMLSearch.
  2. To set allowed_model_families in AutoMLSearch after pipelines are created
  3. To describe a pipeline.
  4. To check which pipelines are created by automl in tests
  5. To check the input pipeline has a decision tree as a final estimator in our model understanding utils
  6. To check we don't pass in baseline pipelines for partial dependence
  7. To determine which pipelines to add to our ensemble in IterativeAlgorithm
  8. To determine how the pipeline scores should be saved in IterativeAlgorithm

In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property instead.

I think we should fix this before merge. I don't think returning ModelFamily.Ensemble is a good idea because in our codebase, that typically means a StackedEnsembleClassifier or StackedEnsembleRegressor is in the pipeline, which isn't guaranteed to be the case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just talked to @eccabay and we explored two options:

  1. Instantiating the component_graph here and getting the order
  2. Moving ComponentGraph.compute_order to a static method and calling it here

Number two looks like a good compromise for now.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Looks good! I have a couple of comments on the implementation/design.

What's the plan for adding support for non-linear pipelines in make_pipeline_from_components ?

Comment on lines 272 to 273
# TODO: Does this make sense
return ModelFamily.ENSEMBLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question! These are some of the places where we call problem_type at the pipeline level:

  1. To skip cv for ensemble pipelines in _compute_cv_scores in AutoMLSearch.
  2. To set allowed_model_families in AutoMLSearch after pipelines are created
  3. To describe a pipeline.
  4. To check which pipelines are created by automl in tests
  5. To check the input pipeline has a decision tree as a final estimator in our model understanding utils
  6. To check we don't pass in baseline pipelines for partial dependence
  7. To determine which pipelines to add to our ensemble in IterativeAlgorithm
  8. To determine how the pipeline scores should be saved in IterativeAlgorithm

In all of these use cases, what's intended is to get the final estimator/component model family. It's interesting that we always access model_family for the instance, not the class. I think we can avoid the computation order problem you bring up by making this a property instead.

I think we should fix this before merge. I don't think returning ModelFamily.Ensemble is a good idea because in our codebase, that typically means a StackedEnsembleClassifier or StackedEnsembleRegressor is in the pipeline, which isn't guaranteed to be the case here.

evalml/pipelines/utils.py Show resolved Hide resolved
@@ -344,8 +344,8 @@ def test_make_pipeline_from_components(X_y_binary, logistic_regression_binary_pi
est = RandomForestClassifier(random_state=7)
pipeline = make_pipeline_from_components([imp, est], ProblemTypes.BINARY, custom_name='My Pipeline',
random_state=15)
assert [c.__class__ for c in pipeline.component_graph] == [Imputer, RandomForestClassifier]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this highlights a subtle breaking change. Before component_graph would be populated with instances of the components after init. Now component_graph will always be the "static" graph definition (no instances). Is there a reason the instances are kept private as _component_graph?

If there is a reason to keep the instances separate from the "static" graph, I'd prefer if the instances were kept public like before. In that case, we can rename _component_graph to component_instances or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting point. The challenge here is that the instances are contained within the ComponentGraph object, which we're keeping completely private from users. It may be possible to add a separate public reference to the instances directly from the pipeline, but I worry that between this and the linearized_component_graph classproperty we discussed earlier, we're adding a fair amount of redundant information that's only there to maintain the current status quo.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about wanting to keep ComponentGraph private from the end user. I think you're right that we shouldn't add another accessor for the instances. Maybe what we should do is just make iterating over/indexing the pipeline instead of the pipeline.component_graph the "preferred" public accessor for the component instances. The benefit is that this functionality is already built into your PR!

For example, the line we're commenting on would turn to [c.__class__ for c in pipeline]. And accessing individual components can be done with pipeline[index]

This would require updating our docs to not access instances with pipeline.component_graph as well as our unit tests.

What do you think? If you are on board, we could also leave this for another issue since this diff is large already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, I think it makes a lot of sense! I do agree it's better fit for a separate issue though haha, so I can file that if other people also think this would be beneficial!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed issue #1571

evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/pipelines/component_graph.py Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
evalml/pipelines/component_graph.py Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
@eccabay eccabay merged commit 3eb27b8 into main Dec 18, 2020
1 check passed
@dsherry dsherry mentioned this pull request Dec 29, 2020
@eccabay eccabay deleted the 1278_pipelines_as_dags branch March 10, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update PipelineBase to use ComponentGraph object
4 participants