Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take hyperparameter ranges from components #516

Merged
merged 18 commits into from
Mar 30, 2020
Merged

Conversation

jeremyliweishih
Copy link
Collaborator

Fixes #359.

@jeremyliweishih jeremyliweishih changed the title Take hyperparameter ranges from components [WIP] Take hyperparameter ranges from components Mar 20, 2020
@jeremyliweishih jeremyliweishih self-assigned this Mar 20, 2020
@codecov
Copy link

codecov bot commented Mar 20, 2020

Codecov Report

Merging #516 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #516   +/-   ##
=======================================
  Coverage   98.68%   98.69%           
=======================================
  Files         114      114           
  Lines        4026     4056   +30     
=======================================
+ Hits         3973     4003   +30     
  Misses         53       53           
Impacted Files Coverage Δ
...ml/pipelines/classification/logistic_regression.py 100.00% <ø> (ø)
evalml/pipelines/classification/random_forest.py 100.00% <ø> (ø)
evalml/pipelines/classification/xgboost.py 100.00% <ø> (ø)
evalml/pipelines/regression/linear_regression.py 100.00% <ø> (ø)
evalml/pipelines/regression/random_forest.py 100.00% <ø> (ø)
evalml/pipelines/classification/catboost.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 98.52% <100.00%> (+0.07%) ⬆️
evalml/pipelines/regression/catboost.py 100.00% <100.00%> (ø)
evalml/tests/conftest.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.62% <100.00%> (+0.04%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ba98f3...c87e4ba. Read the comment docs.

@jeremyliweishih jeremyliweishih changed the title [WIP] Take hyperparameter ranges from components Take hyperparameter ranges from components Mar 20, 2020
@@ -32,6 +32,8 @@ def component_graph(cls):
def problem_types(cls):
return NotImplementedError("This pipeline must have `problem_types` as a class variable.")

_hyperparameters = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih why the underscore here? I see you included an example in the custom pipelines doc of how to override hyperparams. If we support that, we should make this public, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include PipelineBase.hyperparameters as a class variable we can't name this field hyperparameters as well as they share the same namespace. If it makes more sense I can make this hyperparameters and PipelineBase._hyperparameters. I went with this as it was along the same setup as _name as a class variable and name as a property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I had forgotten this is what we did for name. Cool, makes sense to me.

I bet there's a way to wrangle this using python magic methods/decorators. But it's not worth it. What you have here is great 👍

"max_depth": Integer(1, 32),
"impute_strategy": ["mean", "median", "most_frequent"],
"percent_features": Real(.01, 1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!

return handle_component(cls.component_graph[-1]).model_family

@classproperty
def hyperparameters(cls):
"Returns hyperparameter ranges from components"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a quick description of the return structure, specifically that it returns hyperparameters as a flat list across all components?

hyperparameter_ranges.update(component.hyperparameter_ranges)

if cls._hyperparameters:
hyperparameter_ranges.update(cls._hyperparameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is how we support pipeline-level overriding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was the easiest way I thought of doing it.

"impute_strategy": ["most_frequent"],
"n_estimators": Integer(10, 1000),
"eta": Real(0, 1),
"max_depth": Integer(1, 8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great. So after your changes, this pipeline inherits its hyperparams from its components, but also fixes impute_strategy to "most_frequent", because that's all that's supported by catboost. Right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thats how the overriding works!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why the other settings for impute_strategy don't matter here? I would think you still want to explore the other options for the numeric features handled by simple imputer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatBoost handles categorical encoding natively and since One Hot Encoder is not a component of this pipeline, Simple Imputer needs to handle categorical data as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we set the simple imputer like this it is also using most_frequent for the numeric columns as well. what if the user wanted to also search for imputing with mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the catboost documentation, it appears catboost can handle nan's as well, at least for numeric

https://catboost.ai/docs/concepts/algorithm-missing-values-processing.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put up an issue for this and discuss more there! This PR doesn't actually alter how the hyperparameters are set for the catboost pipelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, let's do that. sorry for the tangent. thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #519. It was great you brought it up @kmax12

}

assert MockPipeline.hyperparameters == hyperparameters
assert MockPipeline(parameters={}, objective='precision').hyperparameters == hyperparameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for more tests:

  • Create a pipeline which just uses one component, which has no hyperparameters. Make sure the hyperparams list comes back empty with no error
  • Try setting invalid hyperparam values. This may be outside the scope of this PR... but it would be nice to have hyperparameter validation somewhere... although it's probably best if that's a separate PR, so that's ok
  • What happens if you set "impute_strategy": None in the pipeline hyperparams? I'd hope that would mean the tuner would ignore that parameter. We should decide what that should do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Try setting invalid hyperparam values. This may be outside the scope of this PR... but it would be nice to have hyperparameter validation somewhere... although it's probably best if that's a separate PR, so that's ok

I'll add an issue to validate overridden hyperparameters.

  • What happens if you set "impute_strategy": None in the pipeline hyperparams? I'd hope that would mean the tuner would ignore that parameter. We should decide what that should do

When testing this skopt gives me a ValueError: Dimension has to be a list or tuple.. Should we handle this case in validation as well?

Copy link
Contributor

@dsherry dsherry Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call, can add to #518 . Maybe we need code which ignores args set to None. We can discuss elsewhere

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some testing suggestions. Also started a discussion of whether _hyperparameters should be marked private or not.

@@ -31,10 +31,10 @@ def __init__(self, space, n_points=10, random_state=None):
raw_dimensions = list()
for dimension in space:
# Categorical dimension
if isinstance(dimension, list) and all(isinstance(s, (str, bool)) for s in dimension):
if isinstance(dimension, list):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry @christopherbunn

I put this patch in because RFClassifierSelectFromModel and RFRegressorSelectFromModel includes a list with a string and a number as its values: "threshold": ['mean', -np.inf]. I like this change as we should standardize using Real and Integer for ranges and lists for specific values. Let me know what you guys think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. So our "categorical" hyperparameter option can support not only string values, but also numbers or even python objects. That's cool.

I have two suggestions:

  • Add unit test coverage for this type of input, for all our tuners
  • Where should we document this? Perhaps we can add this to the list for Documentation for Component and Pipeline API  #385 ? This should go into a guide we write about how to define custom pipelines.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, pending a) update to _hyperparameters name, and b) unit test coverage of the GridSearch patch. Good stuff!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hyperparameter ranges defined in components are not used
3 participants