-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take hyperparameter ranges from components #516
Conversation
Codecov Report
@@ Coverage Diff @@
## master #516 +/- ##
=======================================
Coverage 98.68% 98.69%
=======================================
Files 114 114
Lines 4026 4056 +30
=======================================
+ Hits 3973 4003 +30
Misses 53 53
Continue to review full report at Codecov.
|
evalml/pipelines/pipeline_base.py
Outdated
@@ -32,6 +32,8 @@ def component_graph(cls): | |||
def problem_types(cls): | |||
return NotImplementedError("This pipeline must have `problem_types` as a class variable.") | |||
|
|||
_hyperparameters = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih why the underscore here? I see you included an example in the custom pipelines doc of how to override hyperparams. If we support that, we should make this public, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include PipelineBase.hyperparameters
as a class variable we can't name this field hyperparameters as well as they share the same namespace. If it makes more sense I can make this hyperparameters
and PipelineBase._hyperparameters
. I went with this as it was along the same setup as _name
as a class variable and name
as a property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I had forgotten this is what we did for name
. Cool, makes sense to me.
I bet there's a way to wrangle this using python magic methods/decorators. But it's not worth it. What you have here is great 👍
"max_depth": Integer(1, 32), | ||
"impute_strategy": ["mean", "median", "most_frequent"], | ||
"percent_features": Real(.01, 1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!!
evalml/pipelines/pipeline_base.py
Outdated
return handle_component(cls.component_graph[-1]).model_family | ||
|
||
@classproperty | ||
def hyperparameters(cls): | ||
"Returns hyperparameter ranges from components" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a quick description of the return structure, specifically that it returns hyperparameters as a flat list across all components?
evalml/pipelines/pipeline_base.py
Outdated
hyperparameter_ranges.update(component.hyperparameter_ranges) | ||
|
||
if cls._hyperparameters: | ||
hyperparameter_ranges.update(cls._hyperparameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is how we support pipeline-level overriding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was the easiest way I thought of doing it.
"impute_strategy": ["most_frequent"], | ||
"n_estimators": Integer(10, 1000), | ||
"eta": Real(0, 1), | ||
"max_depth": Integer(1, 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great. So after your changes, this pipeline inherits its hyperparams from its components, but also fixes impute_strategy
to "most_frequent"
, because that's all that's supported by catboost. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah thats how the overriding works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why the other settings for impute_strategy
don't matter here? I would think you still want to explore the other options for the numeric features handled by simple imputer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatBoost handles categorical encoding natively and since One Hot Encoder is not a component of this pipeline, Simple Imputer needs to handle categorical data as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we set the simple imputer like this it is also using most_frequent
for the numeric columns as well. what if the user wanted to also search for imputing with mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the catboost documentation, it appears catboost can handle nan's as well, at least for numeric
https://catboost.ai/docs/concepts/algorithm-missing-values-processing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put up an issue for this and discuss more there! This PR doesn't actually alter how the hyperparameters are set for the catboost pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, let's do that. sorry for the tangent. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
assert MockPipeline.hyperparameters == hyperparameters | ||
assert MockPipeline(parameters={}, objective='precision').hyperparameters == hyperparameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for more tests:
- Create a pipeline which just uses one component, which has no hyperparameters. Make sure the hyperparams list comes back empty with no error
- Try setting invalid hyperparam values. This may be outside the scope of this PR... but it would be nice to have hyperparameter validation somewhere... although it's probably best if that's a separate PR, so that's ok
- What happens if you set
"impute_strategy": None
in the pipeline hyperparams? I'd hope that would mean the tuner would ignore that parameter. We should decide what that should do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Try setting invalid hyperparam values. This may be outside the scope of this PR... but it would be nice to have hyperparameter validation somewhere... although it's probably best if that's a separate PR, so that's ok
I'll add an issue to validate overridden hyperparameters.
- What happens if you set
"impute_strategy": None
in the pipeline hyperparams? I'd hope that would mean the tuner would ignore that parameter. We should decide what that should do
When testing this skopt gives me a ValueError: Dimension has to be a list or tuple.
. Should we handle this case in validation as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call, can add to #518 . Maybe we need code which ignores args set to None
. We can discuss elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some testing suggestions. Also started a discussion of whether _hyperparameters
should be marked private or not.
@@ -31,10 +31,10 @@ def __init__(self, space, n_points=10, random_state=None): | |||
raw_dimensions = list() | |||
for dimension in space: | |||
# Categorical dimension | |||
if isinstance(dimension, list) and all(isinstance(s, (str, bool)) for s in dimension): | |||
if isinstance(dimension, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this patch in because RFClassifierSelectFromModel
and RFRegressorSelectFromModel
includes a list with a string and a number as its values: "threshold": ['mean', -np.inf]
. I like this change as we should standardize using Real
and Integer
for ranges and lists for specific values. Let me know what you guys think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So our "categorical" hyperparameter option can support not only string values, but also numbers or even python objects. That's cool.
I have two suggestions:
- Add unit test coverage for this type of input, for all our tuners
- Where should we document this? Perhaps we can add this to the list for Documentation for Component and Pipeline API #385 ? This should go into a guide we write about how to define custom pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, pending a) update to _hyperparameters
name, and b) unit test coverage of the GridSearch
patch. Good stuff!
Fixes #359.