Add option to AutoMLSearch to exclude featurizers from pipelines#3631
Add option to AutoMLSearch to exclude featurizers from pipelines#3631thehomebrewnerd merged 22 commits intomainfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3631 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33666 33750 +84
=======================================
+ Hits 33543 33627 +84
Misses 123 123
Help us with your feedback. Take ten seconds to tell us how you rate us. |
jeremyliweishih
left a comment
There was a problem hiding this comment.
This looks great to me but just have some docstring suggestions and a testing suggestion!
| @pytest.mark.parametrize("input_type", ["pd", "ww"]) | ||
| @pytest.mark.parametrize("automl_algorithm", ["default", "iterative"]) | ||
| @pytest.mark.parametrize("problem_type", ProblemTypes.all_problem_types) | ||
| def test_exclude_featurizers( |
There was a problem hiding this comment.
could you repurpose this test to run search and then check the pipelines generated by search with get_pipeline()? I think this current test would be better suited in both test_default_algorithm and test_iterative_algorithm separately where we check pipeline generation with each algorithm. In general I try to separate testing top level search behavior in this file and then algorithm specific pipeline generation logic in each algorithm's test file. You can use our AutoMLTestEnv fixture to skip computation here. Here's a good example.
There was a problem hiding this comment.
@jeremyliweishih Just to be clear on what you are proposing.
You think I should split the existing test into two, and move the first case for default algorithm into test_default_algorithm.py and the second iterative case into test_iterative_algorithm.py. Then, create a new test in test_automl.py that uses the AutoMLTestEnv fixture like the example you linked? Is that correct?
There was a problem hiding this comment.
Works for me! I'll work on that update this afternoon.
There was a problem hiding this comment.
@jeremyliweishih I'm having a little trouble with using AutoMLTestEnv, so might need some pointers next week on how to resolve. This commit contains my attempt at updating the test in test_automl.py: c9c3377
Is that what you were envisioning?
There was a problem hiding this comment.
yup! it looks basically there. Please ping me next week!
There was a problem hiding this comment.
Thanks - will do. This is the error I am getting, but haven't really dug too deep to understand what is happening.
evalml/automl/automl_search.py:1064: in search
pipeline_id = self._post_evaluation_callback(
evalml/automl/automl_search.py:1388: in _post_evaluation_callback
self.automl_algorithm.add_result(
evalml/automl/automl_algorithm/default_algorithm.py:485: in add_result
self._parse_selected_categorical_features(pipeline)
evalml/automl/automl_algorithm/default_algorithm.py:431: in _parse_selected_categorical_features
self._get_feature_provenance_and_remove_engineered_features(
evalml/automl/automl_algorithm/default_algorithm.py:411: in _get_feature_provenance_and_remove_engineered_features
component = pipeline.get_component(component_name)
evalml/pipelines/pipeline_base.py:224: in get_component
return self.component_graph.get_component(name)
FAILED evalml/tests/automl_tests/test_automl.py::test_exclude_featurizers[binary-default-pd] - ValueError: Component URL Featurizer is not in the graph
jeremyliweishih
left a comment
There was a problem hiding this comment.
This looks basically there! Just one small thing to fix re checking if a component is in a pipeline.
| # A check to make sure we actually retrieve constructed pipelines from the algo. | ||
| assert len(pipelines) > 0 | ||
|
|
||
| assert not any([DateTimeFeaturizer.name in pl for pl in pipelines]) |
There was a problem hiding this comment.
get_pipeline() returns a pipeline instance so you would need to check for DateTimeFeaturizer.name in pl.component_graph.compute_order for pl in pipelines. You could also do DateTimeFeaturizer.name in pl.name for pl in pipelines but the compute order check is more explicit as compute_order is the list of components in text.
There was a problem hiding this comment.
Thanks for the clarification.
Updated: f111676
jeremyliweishih
left a comment
There was a problem hiding this comment.
This LGTM! Great work and this brings our two libraries even closer 😄 just left a comment about updating the algo tests with the same component checking logic!
| # A check to make sure we actually retrieve constructed pipelines from the algo. | ||
| assert len(pipelines) > 0 | ||
|
|
||
| assert not any([DateTimeFeaturizer.name in pl for pl in pipelines]) |
There was a problem hiding this comment.
could you repeat the changes you've made to test_exclude_featurizers here and in test_exclude_featurizers_iterative_algorithm as well? Thanks!
chukarsten
left a comment
There was a problem hiding this comment.
This looks good @thehomebrewnerd . I think the only thing to perhaps consider is : what happens if the user inputs "TmeSeriesFeaturizer" expecting, but not checking for the featurizer to be removed? I get that this is a flag added mostly for internal stakeholder use, but it might be nice to add a quick name check by pulling all the featurizer components and raising a ValueError if the provided name is not contained.
I added a check for this in AutoMLSearch: 9d79cf5 |
|
@chukarsten @jeremyliweishih Added the check for invalid parameter inputs as requested, just wanted to confirm you are good with that change (and the new test) before I merge. |
Add option to AutoMLSearch to exclude featurizers from pipelines
Adds an optional parameter to
AutoMLSearchto allow users to optionally exclude the EvalML*Featurizercomponents from pipelines in situations where feature engineering has been performed ahead of executing the search.Closes #3619
Closes #3590