-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to AutoMLSearch to exclude featurizers from pipelines #3631
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3631 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33666 33750 +84
=======================================
+ Hits 33543 33627 +84
Misses 123 123
Help us with your feedback. Take ten seconds to tell us how you rate us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me but just have some docstring suggestions and a testing suggestion!
@pytest.mark.parametrize("input_type", ["pd", "ww"]) | ||
@pytest.mark.parametrize("automl_algorithm", ["default", "iterative"]) | ||
@pytest.mark.parametrize("problem_type", ProblemTypes.all_problem_types) | ||
def test_exclude_featurizers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you repurpose this test to run search and then check the pipelines generated by search with get_pipeline()
? I think this current test would be better suited in both test_default_algorithm
and test_iterative_algorithm
separately where we check pipeline generation with each algorithm. In general I try to separate testing top level search behavior in this file and then algorithm specific pipeline generation logic in each algorithm's test file. You can use our AutoMLTestEnv
fixture to skip computation here. Here's a good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih Just to be clear on what you are proposing.
You think I should split the existing test into two, and move the first case for default algorithm into test_default_algorithm.py
and the second iterative case into test_iterative_algorithm.py
. Then, create a new test in test_automl.py
that uses the AutoMLTestEnv
fixture like the example you linked? Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thehomebrewnerd yes! how does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me! I'll work on that update this afternoon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih I'm having a little trouble with using AutoMLTestEnv
, so might need some pointers next week on how to resolve. This commit contains my attempt at updating the test in test_automl.py
: c9c3377
Is that what you were envisioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! it looks basically there. Please ping me next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - will do. This is the error I am getting, but haven't really dug too deep to understand what is happening.
evalml/automl/automl_search.py:1064: in search
pipeline_id = self._post_evaluation_callback(
evalml/automl/automl_search.py:1388: in _post_evaluation_callback
self.automl_algorithm.add_result(
evalml/automl/automl_algorithm/default_algorithm.py:485: in add_result
self._parse_selected_categorical_features(pipeline)
evalml/automl/automl_algorithm/default_algorithm.py:431: in _parse_selected_categorical_features
self._get_feature_provenance_and_remove_engineered_features(
evalml/automl/automl_algorithm/default_algorithm.py:411: in _get_feature_provenance_and_remove_engineered_features
component = pipeline.get_component(component_name)
evalml/pipelines/pipeline_base.py:224: in get_component
return self.component_graph.get_component(name)
FAILED evalml/tests/automl_tests/test_automl.py::test_exclude_featurizers[binary-default-pd] - ValueError: Component URL Featurizer is not in the graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks basically there! Just one small thing to fix re checking if a component is in a pipeline.
# A check to make sure we actually retrieve constructed pipelines from the algo. | ||
assert len(pipelines) > 0 | ||
|
||
assert not any([DateTimeFeaturizer.name in pl for pl in pipelines]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_pipeline()
returns a pipeline instance so you would need to check for DateTimeFeaturizer.name in pl.component_graph.compute_order for pl in pipelines
. You could also do DateTimeFeaturizer.name in pl.name for pl in pipelines
but the compute order check is more explicit as compute_order
is the list of components in text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification.
Updated: f111676
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Great work and this brings our two libraries even closer 😄 just left a comment about updating the algo tests with the same component checking logic!
# A check to make sure we actually retrieve constructed pipelines from the algo. | ||
assert len(pipelines) > 0 | ||
|
||
assert not any([DateTimeFeaturizer.name in pl for pl in pipelines]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you repeat the changes you've made to test_exclude_featurizers
here and in test_exclude_featurizers_iterative_algorithm
as well? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated here: f115377
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @thehomebrewnerd . I think the only thing to perhaps consider is : what happens if the user inputs "TmeSeriesFeaturizer" expecting, but not checking for the featurizer to be removed? I get that this is a flag added mostly for internal stakeholder use, but it might be nice to add a quick name check by pulling all the featurizer components and raising a ValueError if the provided name is not contained.
I added a check for this in AutoMLSearch: 9d79cf5 |
@chukarsten @jeremyliweishih Added the check for invalid parameter inputs as requested, just wanted to confirm you are good with that change (and the new test) before I merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
Add option to AutoMLSearch to exclude featurizers from pipelines
Adds an optional parameter to
AutoMLSearch
to allow users to optionally exclude the EvalML*Featurizer
components from pipelines in situations where feature engineering has been performed ahead of executing the search.Closes #3619
Closes #3590