Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to AutoMLSearch to exclude featurizers from pipelines #3631

Merged
merged 22 commits into from
Aug 2, 2022

Conversation

thehomebrewnerd
Copy link
Contributor

Add option to AutoMLSearch to exclude featurizers from pipelines

Adds an optional parameter to AutoMLSearch to allow users to optionally exclude the EvalML *Featurizer components from pipelines in situations where feature engineering has been performed ahead of executing the search.

Closes #3619
Closes #3590

@thehomebrewnerd thehomebrewnerd marked this pull request as draft July 28, 2022 16:41
@codecov
Copy link

codecov bot commented Jul 28, 2022

Codecov Report

Merging #3631 (bc12c9a) into main (05a38e0) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3631     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33666   33750     +84     
=======================================
+ Hits       33543   33627     +84     
  Misses       123     123             
Impacted Files Coverage Δ
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <100.0%> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 97.4% <100.0%> (+0.1%) ⬆️
evalml/automl/automl_search.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <100.0%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us.

@thehomebrewnerd thehomebrewnerd marked this pull request as ready for review July 28, 2022 18:22
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me but just have some docstring suggestions and a testing suggestion!

@pytest.mark.parametrize("input_type", ["pd", "ww"])
@pytest.mark.parametrize("automl_algorithm", ["default", "iterative"])
@pytest.mark.parametrize("problem_type", ProblemTypes.all_problem_types)
def test_exclude_featurizers(
Copy link
Collaborator

@jeremyliweishih jeremyliweishih Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you repurpose this test to run search and then check the pipelines generated by search with get_pipeline()? I think this current test would be better suited in both test_default_algorithm and test_iterative_algorithm separately where we check pipeline generation with each algorithm. In general I try to separate testing top level search behavior in this file and then algorithm specific pipeline generation logic in each algorithm's test file. You can use our AutoMLTestEnv fixture to skip computation here. Here's a good example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Just to be clear on what you are proposing.

You think I should split the existing test into two, and move the first case for default algorithm into test_default_algorithm.py and the second iterative case into test_iterative_algorithm.py. Then, create a new test in test_automl.py that uses the AutoMLTestEnv fixture like the example you linked? Is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thehomebrewnerd yes! how does that sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me! I'll work on that update this afternoon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih I'm having a little trouble with using AutoMLTestEnv, so might need some pointers next week on how to resolve. This commit contains my attempt at updating the test in test_automl.py: c9c3377

Is that what you were envisioning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! it looks basically there. Please ping me next week!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - will do. This is the error I am getting, but haven't really dug too deep to understand what is happening.

evalml/automl/automl_search.py:1064: in search
    pipeline_id = self._post_evaluation_callback(
evalml/automl/automl_search.py:1388: in _post_evaluation_callback
    self.automl_algorithm.add_result(
evalml/automl/automl_algorithm/default_algorithm.py:485: in add_result
    self._parse_selected_categorical_features(pipeline)
evalml/automl/automl_algorithm/default_algorithm.py:431: in _parse_selected_categorical_features
    self._get_feature_provenance_and_remove_engineered_features(
evalml/automl/automl_algorithm/default_algorithm.py:411: in _get_feature_provenance_and_remove_engineered_features
    component = pipeline.get_component(component_name)
evalml/pipelines/pipeline_base.py:224: in get_component
    return self.component_graph.get_component(name)

FAILED evalml/tests/automl_tests/test_automl.py::test_exclude_featurizers[binary-default-pd] - ValueError: Component URL Featurizer is not in the graph

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks basically there! Just one small thing to fix re checking if a component is in a pipeline.

# A check to make sure we actually retrieve constructed pipelines from the algo.
assert len(pipelines) > 0

assert not any([DateTimeFeaturizer.name in pl for pl in pipelines])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pipeline() returns a pipeline instance so you would need to check for DateTimeFeaturizer.name in pl.component_graph.compute_order for pl in pipelines. You could also do DateTimeFeaturizer.name in pl.name for pl in pipelines but the compute order check is more explicit as compute_order is the list of components in text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.

Updated: f111676

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Great work and this brings our two libraries even closer 😄 just left a comment about updating the algo tests with the same component checking logic!

# A check to make sure we actually retrieve constructed pipelines from the algo.
assert len(pipelines) > 0

assert not any([DateTimeFeaturizer.name in pl for pl in pipelines])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you repeat the changes you've made to test_exclude_featurizers here and in test_exclude_featurizers_iterative_algorithm as well? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here: f115377

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @thehomebrewnerd . I think the only thing to perhaps consider is : what happens if the user inputs "TmeSeriesFeaturizer" expecting, but not checking for the featurizer to be removed? I get that this is a flag added mostly for internal stakeholder use, but it might be nice to add a quick name check by pulling all the featurizer components and raising a ValueError if the provided name is not contained.

@thehomebrewnerd
Copy link
Contributor Author

This looks good @thehomebrewnerd . I think the only thing to perhaps consider is : what happens if the user inputs "TmeSeriesFeaturizer" expecting, but not checking for the featurizer to be removed? I get that this is a flag added mostly for internal stakeholder use, but it might be nice to add a quick name check by pulling all the featurizer components and raising a ValueError if the provided name is not contained.

I added a check for this in AutoMLSearch: 9d79cf5

@thehomebrewnerd
Copy link
Contributor Author

@chukarsten @jeremyliweishih Added the check for invalid parameter inputs as requested, just wanted to confirm you are good with that change (and the new test) before I merge.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@thehomebrewnerd thehomebrewnerd merged commit 31d4708 into main Aug 2, 2022
@thehomebrewnerd thehomebrewnerd deleted the exclude-featurizers branch August 2, 2022 17:02
@chukarsten chukarsten mentioned this pull request Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants