-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding default_parameters classproperty to components and pipelines. #879
Adding default_parameters classproperty to components and pipelines. #879
Conversation
Codecov Report
@@ Coverage Diff @@
## master #879 +/- ##
=======================================
Coverage 99.75% 99.75%
=======================================
Files 195 195
Lines 8505 8532 +27
=======================================
+ Hits 8484 8511 +27
Misses 21 21
Continue to review full report at Codecov.
|
from evalml.exceptions import MissingComponentError | ||
from evalml.pipelines import PipelineBase | ||
from evalml.pipelines.classification import * # noqa: F401,F403 | ||
from evalml.pipelines.components import * # noqa: F401,F403 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry What are your thoughts on using this pattern to automatically update all_components
and all_pipelines
? We would need to be careful not to import base classes but I think it may be more maintainable than having to update components/utils.py
manually. For example, SelectColumns
is not in all_components
because I did not know I needed to update components/utils.py
. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton are you suggesting we say from evalml.pipelines.components import *
in evalml/components/utils.py
?
#363 tracks coming up with a more permanent solution for how to define 1) the list of all components/estimators defined in the evalml codebase and 2) the subset of components/estimators supported for automl (we haven't added elasticnet/extratrees to automl yet although we have components in place for them). I suggest we continue this discussion there and don't make any changes for that in this PR. I just added a comment there.
RE SelectColumns
not being in there, oh oops! More evidence that we need a better pattern in place.
RE this specific line of code: do you think we should update our lint to allow this syntax? We've forbidden the use of import *
to-date. I think we should avoid putting noqa
in our code as much as possible. I think we only use those statements in three places outside of __init__.py
files. I'm kind of partial to continuing to forbid it, because I think it'll generally keep our imports faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Sounds good - let's continue the discussion about a better registry in #363. Regarding this line of code, I think I can just get rid of it by using all_components
and all_pipelines
(after adding SelectColumns
). I used the import *
statement because I did not know there was an existing solution!
On a separate note, are you ok with these tests living here or do you want me to add them to test_components
/test_pipelines
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks.
Cool! Sounds good.
Ah got it. RE my other comments in the tests, perhaps moving them to test_components.py
/test_pipelines.py
as you suggested would be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Impl looks good! Left comments on testing and some small stuff. Also left a discussion on the section about #363 and component/pipeline registration, which I see you already responded to.
@@ -18,6 +18,7 @@ Changelog | |||
* Added new utility functions necessary for generating dynamic preprocessing pipelines :pr:`852` | |||
* Added kwargs to all components :pr:`863` | |||
* Added SelectColumns transformer :pr:`873` | |||
* Added `default_parameters` class property to components and pipelines :pr:`879` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
defaults = {} | ||
for variable_name, value in signature.parameters.items(): | ||
if value.default is not inspect.Parameter.empty and variable_name != "random_state": | ||
defaults[variable_name] = value.default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does value.default is not inspect.Parameter.empty
mean we're excluding parameter=None
from the defaults? And if so, why do so? We currently use None
to specify defaults in some places
Good call on ignoring random_state
. We've been treating random seeds as special and separate from other params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it excludes parameters that do not have defaults. Nones are preserved, so for example, DropColumns
has {"columns": None}
as defaults.
@@ -32,6 +38,21 @@ def parameters(self): | |||
"""Returns the parameters which were used to initialize the component""" | |||
return copy.copy(self._parameters) | |||
|
|||
@classproperty | |||
def default_parameters(cls,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Food for thought: something we could do once we have a metaclass is, in the metaclasses' __new__
method, i.e. when each component is being defined, we can set default_parameters
as a dict and avoid having to recompute it. I suppose we could implement memoization in other ways too lol, I just have a desire for metaclasses on my mind lately!
@@ -32,6 +38,21 @@ def parameters(self): | |||
"""Returns the parameters which were used to initialize the component""" | |||
return copy.copy(self._parameters) | |||
|
|||
@classproperty | |||
def default_parameters(cls,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an extra comma here which can be deleted. Funny that lint doesn't care about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted!
# Our convention is that default parameters are what get passed in to the parameters dict | ||
# when nothing is changed to the init. In this case, since we use None to encode a list of date units, | ||
# we need to manually specify the defaults. | ||
return {"features_to_extract": ["year", "month", "day_of_week", "hour"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine for now to delete this and just have ComponentBase.default_parameters
return {'features_to_extract': None}
, right? In the short-term, as long as the parameters produced here correspond to the default behavior, we're all set. In the long-term, as we discussed this morning, we have a representation challenge to solve around how to best encode defaults and how to handle backwards compatibility of defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know - as a user I think the implementation as it is now is better. {"features_to_extract": None}
makes it sound like no features are being extracted! Plus, this makes it possible to test all components with the Component.default_parameters == Component().parameters
line without having to check if the Component is DateTimeFeaturization
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton
I agree the behavior is better the way you have it, but I'm not sure I agree the implementation is better! What if we want to change these values -- we'd have to remember to change them in both the component __init__
and the component default_parameters
definitions. Having to update this in two places isn't great w.r.t. maintainability.
We could define a static member like _DATETIME_FEATURES_DEFAULTS=["year", "month", "day_of_week", "hour"]
and use that in both places, but that's kinda messy and doesn't fully solve the problem IMO.
I have a proposal. What if you do this in ComponentBase
and delete all the subclass impls:
@classproperty
def default_parameters(cls):
inst = cls()
return inst.parameters
This would ensure the behavior of each component's __init__
method is always reflected in default_parameters
I could see an argument being made about it being bad performance-wise to create an instance here; but memoization would solve that, plus I'd want to see that happening in the profiler before trying to optimize it.
Here's the POC I just did to verify this would work:
from evalml.utils import classproperty
class Main:
def __init__(self, a='A', b='B', random_state=0):
self.a = a
self.b = b
self.random_state = random_state
self._parameters = {'a':a, 'b':b}
@property
def parameters(self):
return self._parameters
@classproperty
def default_parameters(cls):
inst = cls()
return inst.parameters
print(Main.default_parameters)
inst = Main()
print(inst.parameters)
print(inst.default_parameters)
inst = Main(a='aaa', b='bbb')
print(inst.parameters)
print(inst.default_parameters)
With output:
{'a': 'A', 'b': 'B'}
{'a': 'A', 'b': 'B'}
{'a': 'A', 'b': 'B'}
{'a': 'aaa', 'b': 'bbb'}
{'a': 'A', 'b': 'B'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I like your suggestion! I think adding memoization will be straight-forward:
@classproperty
def default_parameters(cls):
"""Returns the default parameters for this component.
Our convention is that Component.default_parameters == Component().parameters.
Returns:
dict: default parameters for this component.
"""
if cls._default_parameters is None:
cls._default_parameters = cls().parameters
return cls._default_parameters
I'll implement and push this change if it's ok with you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah!! Sweet
evalml/pipelines/pipeline_base.py
Outdated
@@ -270,6 +270,16 @@ def parameters(self): | |||
""" | |||
return {c.name: copy.copy(c.parameters) for c in self.component_graph if c.parameters} | |||
|
|||
@classproperty | |||
def default_parameters(cls): | |||
"""Returns the default parameter dictionary for this pipeline.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs Returns
line in docstring
component = handle_component_class(c) | ||
if component.default_parameters: | ||
defaults[component.name] = component.default_parameters | ||
return defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Another great reason to have a metaclass: we can validate if the component_graph
is correct before the class is defined, rather than having to worry about whether that code could break in this for
loop! Ok, sorry, I'll stop peppering this PR with comments about metaclasses 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe it's a neat idea! I'll look into metaclasses more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#561 tracks this!
|
||
regression_pipelines = inspect.getmembers(sys.modules['evalml.pipelines.regression'], inspect.isclass) | ||
classification_pipelines = inspect.getmembers(sys.modules['evalml.pipelines.classification'], inspect.isclass) | ||
all_pipelines = regression_pipelines + classification_pipelines + [("PipelineBase", PipelineBase)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, perhaps this was part of the pattern you mentioned? I think we should use all_components
/all_pipelines
here and have a separate conversation about how to update the implementation of those methods.
|
||
def cannot_check_because_base_or_not_installed(cls): | ||
|
||
if issubclass(cls, ComponentBase): # noqa: F405 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the noqa
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just get rid of this function!
if cannot_check_because_base_or_not_installed(cls): | ||
pytest.skip(f"Skipping {class_name} because it is not installed or it is a base class.") | ||
|
||
assert cls.default_parameters == cls({}).parameters, f"{class_name}'s default parameters don't match __init__." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy! It took me a bit to understand the cannot_check_because_base_or_not_installed
but I think I follow now. I wonder if there's a middle ground in terms of complexity between what you've got and the for
loops I was doing. Perhaps you could keep the parametrize over all components, but then spell out the contents of the test? OK to have duplicate code.
Come to think of it, why can't you do:
@pytest.mark.parametrize("class_name,cls", all_components())
def test_default_parameters(class_name, cls):
assert cls.default_parameters == cls().parameters, f"{class_name}'s default parameters don't match __init__."
@pytest.mark.parametrize("class_name,cls", all_pipelines())
def test_pipeline_default_parameters(class_name, cls):
assert cls.default_parameters == cls({}).parameters, f"{class_name}'s default parameters don't match __init__."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I like this suggestion! I just didn't know about all_components
and all_pipelines
!
…st_pipelines.py and test_components.py
…st_pipelines.py and test_components.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look at pipeline_class.rst
to see how to add these class properties to the API reference? Basically you need to manually add autoattribute
for default_parameters
for it to show up in the API reference. You would need to do this for both pipelines and components. Then you can verify either locally by building docs or looking at your build on RTD. I know this is quite a confusing process so let me know if you want to talk about it!
@jeremyliweishih Do you also want me to add |
@freddyaboulton theres no need to add
Sphinx can do the same for the class properties (but I think it cannot retrieve the docstring) but we want to be able to show the values of these class in the API reference as well. |
@jeremyliweishih Got it, thanks for explaining that! |
…rmer_class rst files.
…es' of github.com:FeatureLabs/evalml into 848-add-default-params-getter-to-components-and-pipelines
@jeremyliweishih I just added |
@freddyaboulton looks good to me! you can also verify here when its done building. You can check the builds here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
@@ -494,3 +509,12 @@ def test_estimator_predict_output_type(X_y): | |||
assert isinstance(predict_proba_output, pd.DataFrame) | |||
assert predict_proba_output.shape == (len(y), len(np.unique(y))) | |||
assert (predict_proba_output.columns == y_cols_expected).all() | |||
|
|||
|
|||
components = list(all_components().items()) + [(DateTimeFeaturization.name, DateTimeFeaturization)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we don't list this in all_components
? Why is that? @angela97lin do you remember? I think we should be including it in all_components
.
We can do it in a separate PR, just wondering!
@@ -6,13 +6,14 @@ | |||
|
|||
.. autoclass:: {{ objname }} | |||
|
|||
{% set class_attributes = ['name', 'model_family', 'hyperparameter_ranges'] %} | |||
{% set class_attributes = ['name', 'model_family', 'hyperparameter_ranges', 'default_parameters'] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @freddyaboulton !
I love having that parametrize
test coverage on all our components.
I left a comment on the impl -- I have a suggestion for something which I think will be easier to maintain. Let's resolve that conversation before you merge.
Also left a comment on the tests about DatetimeFeaturization
not being in all_components
but that's not blocking / we can handle it separately.
@@ -51,6 +52,14 @@ def __init__(self, features_to_extract=None, random_state=0, **kwargs): | |||
component_obj=None, | |||
random_state=random_state) | |||
|
|||
@classproperty | |||
def default_parameters(cls,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh style nit-pick: lol one more extra comma!
…ization does not have custom code.
Pull Request Description
Addresses #848 by adding a
default_parameters
classproperty to bothComponentBase
andPipelineBase
by inspecting the__init__
signature.Potential Issues
There can be cases where the default parameters of the component don't match the
__init__
signature, e.g.DateTimeFeaturization
. In these cases, the component should manually specify the default parameters. Fortunately, this issue will come up during the unit tests as long as it's added to the components__init__
file.After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of
docs/source/changelog.rst
to include this pull request by adding :pr:123
.