New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AutoMLSearch so users can specify custom hyperparameters to pipelines with duplicate components #2133
Conversation
@@ -289,6 +289,10 @@ def __init__(self, | |||
allowed_estimators = get_estimators(self.problem_type, self.allowed_model_families) | |||
logger.debug(f"allowed_estimators set to {[estimator.name for estimator in allowed_estimators]}") | |||
self.allowed_pipelines = [make_pipeline(self.X_train, self.y_train, estimator, self.problem_type, custom_hyperparameters=self.pipeline_parameters) for estimator in allowed_estimators] | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fix #2132. Without this change the hyperparameters passed in to pipeline_parameters
would only get used in the first batch.
evalml/pipelines/pipeline_base.py
Outdated
if cls.custom_hyperparameters and component_name in cls.custom_hyperparameters: | ||
component_hyperparameters.update(cls.custom_hyperparameters.get(component_name, {})) | ||
hyperparameter_ranges[component_name] = component_hyperparameters | ||
for component_name, component_class in ComponentGraph.linearized_component_graph(component_graph): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fix #2125.
|
||
automl = AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=[PipelineDict, PipeLineLinear], | ||
pipeline_parameters={"Imputer": {"numeric_impute_strategy": Categorical(["most_frequent"])}, | ||
"Imputer_1": {"numeric_impute_strategy": Categorical(["median"])}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use Categorical
because of #2130.
X, y = X_y_binary | ||
|
||
class PipelineDict(BinaryClassificationPipeline): | ||
custom_hyperparameters = {"Imputer": {"numeric_impute_strategy": Categorical(["most_frequent", 'mean'])}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to specify mean
for the imputers and 50
for the rf classifier because of #2131
ccffde4
to
f3a4b15
Compare
Codecov Report
@@ Coverage Diff @@
## main #2133 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 293 293
Lines 23915 24010 +95
=========================================
+ Hits 23905 24000 +95
Misses 10 10
Continue to review full report at Codecov.
|
f3a4b15
to
092af0f
Compare
@@ -260,6 +262,10 @@ def __init__(self, | |||
except ImportError: | |||
logger.warning("Unable to import plotly; skipping pipeline search plotting\n") | |||
|
|||
if allowed_pipelines is not None and not isinstance(allowed_pipelines, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this data validation so that custom_hyperparameters can be overloaded with pipeline_parameters
without weird bugs.
092af0f
to
8581ac8
Compare
8581ac8
to
c061798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid! 3 issues in a PR, you're going off.
Left a comment on potentially changing the warning for pipelines that are passed in that are instantiated, but otherwise, test coverage looks good and code changes look clean!
with pytest.raises(ValueError, match="Parameter allowed_pipelines must be either None or a list!"): | ||
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=dummy_binary_pipeline_class) | ||
|
||
with pytest.raises(ValueError, match="Every element of allowed_pipelines must be a subclass of PipelineBase!"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be confusing for the user since the pipeline is technically a subclass of PipelineBase
. I think it'd be much more helpful to output an error stating the class must not be instantiated, since that's the actual issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counterpoint: Ut just happened to be that @freddyaboulton tested using an instantiated object but this error would be raised if we just threw in something random ex: AutoMLSearch class so I think this is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can you always resolve 3 issues in one PR? LOL
I had minor comments about the UX of creating duplicate components and knowing how to specify the hyperparameters of each--could just be some documentation somewhere :d
* Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components :pr:`2133` | ||
* Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` :pr:`2133` | ||
* Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components :pr:`2133` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, fixing three bugs in one 😂
evalml/automl/automl_search.py
Outdated
else: | ||
for pipeline_class in self.allowed_pipelines: | ||
if self.pipeline_parameters: | ||
pipeline_class.custom_hyperparameters = self.pipeline_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Does this mean that if the pipeline classes passed in already had custom_hyperparameters set, we're choosing to ignore them? I guess this touches on the two-ways-to-specify-hyperparameter-ranges we talked about in #2091 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! I thought this is what the intended behavior should be. My reasoning was that users should use one api over the other. I thought it'd weird for a user to go through the trouble of defining pipeline classes and specifying custom hyperparameters and then decide to specify additional hyperparameters through AutoMLSearch
.
I filed #2147 to continue the discussion about the two apis but in the mean time I think the least aggressive thing to do is to just update the custom hyperparameters with the values in self.pipeline_parameters
!
evalml/pipelines/pipeline_base.py
Outdated
if cls.custom_hyperparameters and component_name in cls.custom_hyperparameters: | ||
component_hyperparameters.update(cls.custom_hyperparameters.get(component_name, {})) | ||
hyperparameter_ranges[component_name] = component_hyperparameters | ||
for component_name, component_class in ComponentGraph.linearized_component_graph(component_graph): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JW, why not call for component_name, component_class in PipelineBase.linearized_component_graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cause I saw the existing implementation used a copy of the component graph lol. I don't think that's necessary though? In that case, I can just use the pipeline class property 👍
component_graph = {"Imputer": ["Imputer"], | ||
"Imputer_1": ["Imputer", "Imputer"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if my brain's a little fried, but what is this doing here? We're passing the imputer's return val to Imputer_1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe yes. I decided to keep the names between the dict and list component graph the same so that we don't have to check for different names but I recognize it's confusing. I'll add a comment.
"Imputer_1": ["Imputer", "Imputer"], | ||
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]} | ||
|
||
class PipeLineLinear(BinaryClassificationPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad this PR fixes this case, though I have to say its a bit weird for the user to specify Imputer_1
in pipeline_parameters
; I guess this is dependent on them knowing how the naming would work for duplicate cases. Looks fine though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's weird. One thing I thought about doing was be able to specify custom component names for the list api by using tuples. Something like
class MyPipeline(BinaryClassificationPipeline):
component_graph = [("My Imputer", "Imputer"), ("Cool Classifier", "Logistic Regression Classifier")]
but I thought that too big a change for what the original issue asked for. Plus, not sure how long the list api will be around 😅 I will file an issue so we can decide if this is worth doing and prioritize if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #2146
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the APIs must die...we give the users too much freedom, lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put #2146 into the next sprint, per the Elder Council's approval.
with pytest.raises(ValueError, match="Parameter allowed_pipelines must be either None or a list!"): | ||
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=dummy_binary_pipeline_class) | ||
|
||
with pytest.raises(ValueError, match="Every element of allowed_pipelines must be a subclass of PipelineBase!"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counterpoint: Ut just happened to be that @freddyaboulton tested using an instantiated object but this error would be raised if we just threw in something random ex: AutoMLSearch class so I think this is fine
custom_hyperparameters = {"One Hot Encoder_1": {"top_n": Integer(10, 50)}} | ||
component_graph = ["One Hot Encoder", "One Hot Encoder", "Random Forest Classifier"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment above about weirdness w/ user knowing to specify "One Hot Encoder_1"... maybe the right thing to do is just to have some documentation about the naming convention for duplicate components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I anticipate that's going to be a big gotcha going forward...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just have a docstring change request! Other than that, way to be super efficient! Good work!
"Imputer_1": ["Imputer", "Imputer"], | ||
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]} | ||
|
||
class PipeLineLinear(BinaryClassificationPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the APIs must die...we give the users too much freedom, lol.
custom_hyperparameters = {"One Hot Encoder_1": {"top_n": Integer(10, 50)}} | ||
component_graph = ["One Hot Encoder", "One Hot Encoder", "Random Forest Classifier"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I anticipate that's going to be a big gotcha going forward...
"Imputer_1": ["Imputer", "Imputer"], | ||
"Random Forest Classifier": ["Random Forest Classifier", "Imputer_1"]} | ||
|
||
class PipeLineLinear(BinaryClassificationPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put #2146 into the next sprint, per the Elder Council's approval.
7f09241
to
7cc887e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2535e2f
to
cc35405
Compare
…skEngine`` #1975. - Added optional ``engine`` argument to ``AutoMLSearch`` #1975 - Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118 - Added ``NaturalLanguageNaNDataCheck`` data check #2122 - Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120 - Added standard deviation of cv scores to rankings table #2154 - Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077 - Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117 - Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133 - Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133 - Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133 - Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113 - Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126 - Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117 - Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135 - Fixed ``conf.py`` file #2112 - Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118 - Fixed documentation demos #2139 - Update test badge in README to use GitHub Actions #2150 - Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129 - Added a GitHub Action for building the conda package #1870 #2148 .. warning:: - Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113 - Deleted the "errors" key from automl results #1975 - Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975 - Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
Pull Request Description
Fixes #2049, #2132, #2125
When I started, I wasn't planning on a three-in-one PR but we can't fix #2049 without fixing #2132.
Without fixing #2125 , users can't apply custom hyperparameters by attaching them to the pipeline definition so automl support for pipelines with duplicate components would still be broken.
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.