Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AutoMLSearch so users can specify custom hyperparameters to pipelines with duplicate components #2133

Merged
merged 7 commits into from Apr 19, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Apr 13, 2021

Pull Request Description

Fixes #2049, #2132, #2125

When I started, I wasn't planning on a three-in-one PR but we can't fix #2049 without fixing #2132.

Without fixing #2125 , users can't apply custom hyperparameters by attaching them to the pipeline definition so automl support for pipelines with duplicate components would still be broken.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@@ -289,6 +289,10 @@ def __init__(self,
allowed_estimators = get_estimators(self.problem_type, self.allowed_model_families)
logger.debug(f"allowed_estimators set to {[estimator.name for estimator in allowed_estimators]}")
self.allowed_pipelines = [make_pipeline(self.X_train, self.y_train, estimator, self.problem_type, custom_hyperparameters=self.pipeline_parameters) for estimator in allowed_estimators]
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix #2132. Without this change the hyperparameters passed in to pipeline_parameters would only get used in the first batch.

if cls.custom_hyperparameters and component_name in cls.custom_hyperparameters:
component_hyperparameters.update(cls.custom_hyperparameters.get(component_name, {}))
hyperparameter_ranges[component_name] = component_hyperparameters
for component_name, component_class in ComponentGraph.linearized_component_graph(component_graph):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix #2125.


automl = AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=[PipelineDict, PipeLineLinear],
pipeline_parameters={"Imputer": {"numeric_impute_strategy": Categorical(["most_frequent"])},
"Imputer_1": {"numeric_impute_strategy": Categorical(["median"])}},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use Categorical because of #2130.

X, y = X_y_binary

class PipelineDict(BinaryClassificationPipeline):
custom_hyperparameters = {"Imputer": {"numeric_impute_strategy": Categorical(["most_frequent", 'mean'])},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to specify mean for the imputers and 50 for the rf classifier because of #2131

@freddyaboulton freddyaboulton force-pushed the 2049-automl-with-duplicate-components branch from ccffde4 to f3a4b15 Compare April 13, 2021 16:50
@codecov
Copy link

codecov bot commented Apr 13, 2021

Codecov Report

Merging #2133 (cc35405) into main (ff3fa9b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2133     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         293      293             
  Lines       23915    24010     +95     
=========================================
+ Hits        23905    24000     +95     
  Misses         10       10             
Impacted Files Coverage Δ
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
evalml/pipelines/pipeline_base.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff3fa9b...cc35405. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 2049-automl-with-duplicate-components branch from f3a4b15 to 092af0f Compare April 13, 2021 18:00
@@ -260,6 +262,10 @@ def __init__(self,
except ImportError:
logger.warning("Unable to import plotly; skipping pipeline search plotting\n")

if allowed_pipelines is not None and not isinstance(allowed_pipelines, list):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this data validation so that custom_hyperparameters can be overloaded with pipeline_parameters without weird bugs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid! 3 issues in a PR, you're going off.

Left a comment on potentially changing the warning for pipelines that are passed in that are instantiated, but otherwise, test coverage looks good and code changes look clean!

with pytest.raises(ValueError, match="Parameter allowed_pipelines must be either None or a list!"):
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=dummy_binary_pipeline_class)

with pytest.raises(ValueError, match="Every element of allowed_pipelines must be a subclass of PipelineBase!"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be confusing for the user since the pipeline is technically a subclass of PipelineBase. I think it'd be much more helpful to output an error stating the class must not be instantiated, since that's the actual issue here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterpoint: Ut just happened to be that @freddyaboulton tested using an instantiated object but this error would be raised if we just threw in something random ex: AutoMLSearch class so I think this is fine

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you always resolve 3 issues in one PR? LOL

I had minor comments about the UX of creating duplicate components and knowing how to specify the hyperparameters of each--could just be some documentation somewhere :d

Comment on lines +12 to +15
* Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components :pr:`2133`
* Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` :pr:`2133`
* Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components :pr:`2133`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, fixing three bugs in one 😂

else:
for pipeline_class in self.allowed_pipelines:
if self.pipeline_parameters:
pipeline_class.custom_hyperparameters = self.pipeline_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Does this mean that if the pipeline classes passed in already had custom_hyperparameters set, we're choosing to ignore them? I guess this touches on the two-ways-to-specify-hyperparameter-ranges we talked about in #2091 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I thought this is what the intended behavior should be. My reasoning was that users should use one api over the other. I thought it'd weird for a user to go through the trouble of defining pipeline classes and specifying custom hyperparameters and then decide to specify additional hyperparameters through AutoMLSearch.

I filed #2147 to continue the discussion about the two apis but in the mean time I think the least aggressive thing to do is to just update the custom hyperparameters with the values in self.pipeline_parameters !

if cls.custom_hyperparameters and component_name in cls.custom_hyperparameters:
component_hyperparameters.update(cls.custom_hyperparameters.get(component_name, {}))
hyperparameter_ranges[component_name] = component_hyperparameters
for component_name, component_class in ComponentGraph.linearized_component_graph(component_graph):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JW, why not call for component_name, component_class in PipelineBase.linearized_component_graph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cause I saw the existing implementation used a copy of the component graph lol. I don't think that's necessary though? In that case, I can just use the pipeline class property 👍

Comment on lines +2213 to +2215
component_graph = {"Imputer": ["Imputer"],
"Imputer_1": ["Imputer", "Imputer"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if my brain's a little fried, but what is this doing here? We're passing the imputer's return val to Imputer_1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe yes. I decided to keep the names between the dict and list component graph the same so that we don't have to check for different names but I recognize it's confusing. I'll add a comment.

"Imputer_1": ["Imputer", "Imputer"],
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]}

class PipeLineLinear(BinaryClassificationPipeline):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this PR fixes this case, though I have to say its a bit weird for the user to specify Imputer_1 in pipeline_parameters; I guess this is dependent on them knowing how the naming would work for duplicate cases. Looks fine though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's weird. One thing I thought about doing was be able to specify custom component names for the list api by using tuples. Something like

class MyPipeline(BinaryClassificationPipeline):
     component_graph = [("My Imputer", "Imputer"), ("Cool Classifier", "Logistic Regression Classifier")]

but I thought that too big a change for what the original issue asked for. Plus, not sure how long the list api will be around 😅 I will file an issue so we can decide if this is worth doing and prioritize if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI #2146

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the APIs must die...we give the users too much freedom, lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put #2146 into the next sprint, per the Elder Council's approval.

with pytest.raises(ValueError, match="Parameter allowed_pipelines must be either None or a list!"):
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=dummy_binary_pipeline_class)

with pytest.raises(ValueError, match="Every element of allowed_pipelines must be a subclass of PipelineBase!"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterpoint: Ut just happened to be that @freddyaboulton tested using an instantiated object but this error would be raised if we just threw in something random ex: AutoMLSearch class so I think this is fine

Comment on lines +978 to +979
custom_hyperparameters = {"One Hot Encoder_1": {"top_n": Integer(10, 50)}}
component_graph = ["One Hot Encoder", "One Hot Encoder", "Random Forest Classifier"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above about weirdness w/ user knowing to specify "One Hot Encoder_1"... maybe the right thing to do is just to have some documentation about the naming convention for duplicate components?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I anticipate that's going to be a big gotcha going forward...

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just have a docstring change request! Other than that, way to be super efficient! Good work!

evalml/pipelines/component_graph.py Outdated Show resolved Hide resolved
evalml/pipelines/component_graph.py Show resolved Hide resolved
"Imputer_1": ["Imputer", "Imputer"],
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]}

class PipeLineLinear(BinaryClassificationPipeline):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the APIs must die...we give the users too much freedom, lol.

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
Comment on lines +978 to +979
custom_hyperparameters = {"One Hot Encoder_1": {"top_n": Integer(10, 50)}}
component_graph = ["One Hot Encoder", "One Hot Encoder", "Random Forest Classifier"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I anticipate that's going to be a big gotcha going forward...

"Imputer_1": ["Imputer", "Imputer"],
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]}

class PipeLineLinear(BinaryClassificationPipeline):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put #2146 into the next sprint, per the Elder Council's approval.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@freddyaboulton freddyaboulton force-pushed the 2049-automl-with-duplicate-components branch from 2535e2f to cc35405 Compare April 19, 2021 16:11
@freddyaboulton freddyaboulton merged commit afe60a9 into main Apr 19, 2021
@freddyaboulton freddyaboulton deleted the 2049-automl-with-duplicate-components branch April 19, 2021 17:58
chukarsten added a commit that referenced this pull request Apr 20, 2021
…skEngine`` #1975.

- Added optional ``engine`` argument to ``AutoMLSearch`` #1975
- Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118
- Added ``NaturalLanguageNaNDataCheck`` data check #2122
- Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120
- Added standard deviation of cv scores to rankings table #2154
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077
- Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117
- Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133
- Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133
- Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133
- Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113
- Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126
- Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117
- Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135
- Fixed ``conf.py`` file #2112
- Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118
- Fixed documentation demos #2139
- Update test badge in README to use GitHub Actions #2150
- Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129
- Added a GitHub Action for building the conda package #1870 #2148
.. warning::
- Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113
- Deleted the "errors" key from automl results #1975
- Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
@chukarsten chukarsten mentioned this pull request Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML cannot handle updating pipelines parameters with duplicate components
4 participants