Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding default_parameters classproperty to components and pipelines. #879

Merged

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jun 24, 2020

Pull Request Description

Addresses #848 by adding a default_parameters classproperty to both ComponentBase and PipelineBase by inspecting the __init__ signature.

Potential Issues

There can be cases where the default parameters of the component don't match the __init__ signature, e.g. DateTimeFeaturization. In these cases, the component should manually specify the default parameters. Fortunately, this issue will come up during the unit tests as long as it's added to the components __init__ file.


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton linked an issue Jun 24, 2020 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #879 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #879   +/-   ##
=======================================
  Coverage   99.75%   99.75%           
=======================================
  Files         195      195           
  Lines        8505     8532   +27     
=======================================
+ Hits         8484     8511   +27     
  Misses         21       21           
Impacted Files Coverage Δ
evalml/pipelines/components/utils.py 100.00% <ø> (ø)
evalml/pipelines/components/component_base.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 97.67% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.80% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1740209...f79ecf4. Read the comment docs.

@freddyaboulton freddyaboulton self-assigned this Jun 24, 2020
@freddyaboulton freddyaboulton marked this pull request as ready for review June 24, 2020 18:35
@freddyaboulton freddyaboulton requested review from dsherry and jeremyliweishih and removed request for dsherry June 24, 2020 18:35
from evalml.exceptions import MissingComponentError
from evalml.pipelines import PipelineBase
from evalml.pipelines.classification import * # noqa: F401,F403
from evalml.pipelines.components import * # noqa: F401,F403
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry What are your thoughts on using this pattern to automatically update all_components and all_pipelines? We would need to be careful not to import base classes but I think it may be more maintainable than having to update components/utils.py manually. For example, SelectColumns is not in all_components because I did not know I needed to update components/utils.py. 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton are you suggesting we say from evalml.pipelines.components import * in evalml/components/utils.py?

#363 tracks coming up with a more permanent solution for how to define 1) the list of all components/estimators defined in the evalml codebase and 2) the subset of components/estimators supported for automl (we haven't added elasticnet/extratrees to automl yet although we have components in place for them). I suggest we continue this discussion there and don't make any changes for that in this PR. I just added a comment there.

RE SelectColumns not being in there, oh oops! More evidence that we need a better pattern in place.

RE this specific line of code: do you think we should update our lint to allow this syntax? We've forbidden the use of import * to-date. I think we should avoid putting noqa in our code as much as possible. I think we only use those statements in three places outside of __init__.py files. I'm kind of partial to continuing to forbid it, because I think it'll generally keep our imports faster.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Sounds good - let's continue the discussion about a better registry in #363. Regarding this line of code, I think I can just get rid of it by using all_components and all_pipelines (after adding SelectColumns). I used the import * statement because I did not know there was an existing solution!

On a separate note, are you ok with these tests living here or do you want me to add them to test_components/test_pipelines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks.

Cool! Sounds good.

Ah got it. RE my other comments in the tests, perhaps moving them to test_components.py/test_pipelines.py as you suggested would be more clear.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Impl looks good! Left comments on testing and some small stuff. Also left a discussion on the section about #363 and component/pipeline registration, which I see you already responded to.

@@ -18,6 +18,7 @@ Changelog
* Added new utility functions necessary for generating dynamic preprocessing pipelines :pr:`852`
* Added kwargs to all components :pr:`863`
* Added SelectColumns transformer :pr:`873`
* Added `default_parameters` class property to components and pipelines :pr:`879`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

defaults = {}
for variable_name, value in signature.parameters.items():
if value.default is not inspect.Parameter.empty and variable_name != "random_state":
defaults[variable_name] = value.default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does value.default is not inspect.Parameter.empty mean we're excluding parameter=None from the defaults? And if so, why do so? We currently use None to specify defaults in some places

Good call on ignoring random_state. We've been treating random seeds as special and separate from other params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it excludes parameters that do not have defaults. Nones are preserved, so for example, DropColumns has {"columns": None} as defaults.

@@ -32,6 +38,21 @@ def parameters(self):
"""Returns the parameters which were used to initialize the component"""
return copy.copy(self._parameters)

@classproperty
def default_parameters(cls,):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought: something we could do once we have a metaclass is, in the metaclasses' __new__ method, i.e. when each component is being defined, we can set default_parameters as a dict and avoid having to recompute it. I suppose we could implement memoization in other ways too lol, I just have a desire for metaclasses on my mind lately!

@@ -32,6 +38,21 @@ def parameters(self):
"""Returns the parameters which were used to initialize the component"""
return copy.copy(self._parameters)

@classproperty
def default_parameters(cls,):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an extra comma here which can be deleted. Funny that lint doesn't care about that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted!

# Our convention is that default parameters are what get passed in to the parameters dict
# when nothing is changed to the init. In this case, since we use None to encode a list of date units,
# we need to manually specify the defaults.
return {"features_to_extract": ["year", "month", "day_of_week", "hour"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine for now to delete this and just have ComponentBase.default_parameters return {'features_to_extract': None}, right? In the short-term, as long as the parameters produced here correspond to the default behavior, we're all set. In the long-term, as we discussed this morning, we have a representation challenge to solve around how to best encode defaults and how to handle backwards compatibility of defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - as a user I think the implementation as it is now is better. {"features_to_extract": None} makes it sound like no features are being extracted! Plus, this makes it possible to test all components with the Component.default_parameters == Component().parameters line without having to check if the Component is DateTimeFeaturization.

Copy link
Contributor

@dsherry dsherry Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton
I agree the behavior is better the way you have it, but I'm not sure I agree the implementation is better! What if we want to change these values -- we'd have to remember to change them in both the component __init__ and the component default_parameters definitions. Having to update this in two places isn't great w.r.t. maintainability.

We could define a static member like _DATETIME_FEATURES_DEFAULTS=["year", "month", "day_of_week", "hour"] and use that in both places, but that's kinda messy and doesn't fully solve the problem IMO.

I have a proposal. What if you do this in ComponentBase and delete all the subclass impls:

@classproperty
def default_parameters(cls):
    inst = cls()
    return inst.parameters

This would ensure the behavior of each component's __init__ method is always reflected in default_parameters

I could see an argument being made about it being bad performance-wise to create an instance here; but memoization would solve that, plus I'd want to see that happening in the profiler before trying to optimize it.


Here's the POC I just did to verify this would work:

from evalml.utils import classproperty

class Main:
    def __init__(self, a='A', b='B', random_state=0):
        self.a = a
        self.b = b
        self.random_state = random_state
        self._parameters = {'a':a, 'b':b}
    @property
    def parameters(self):
        return self._parameters
    @classproperty
    def default_parameters(cls):
        inst = cls()
        return inst.parameters

print(Main.default_parameters)

inst = Main()
print(inst.parameters)
print(inst.default_parameters)

inst = Main(a='aaa', b='bbb')
print(inst.parameters)
print(inst.default_parameters)

With output:

{'a': 'A', 'b': 'B'}
{'a': 'A', 'b': 'B'}
{'a': 'A', 'b': 'B'}
{'a': 'aaa', 'b': 'bbb'}
{'a': 'A', 'b': 'B'}

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I like your suggestion! I think adding memoization will be straight-forward:

    @classproperty
    def default_parameters(cls):
        """Returns the default parameters for this component.

         Our convention is that Component.default_parameters == Component().parameters.

         Returns:
             dict: default parameters for this component.
        """

        if cls._default_parameters is None:
            cls._default_parameters = cls().parameters

        return cls._default_parameters

I'll implement and push this change if it's ok with you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!! Sweet

@@ -270,6 +270,16 @@ def parameters(self):
"""
return {c.name: copy.copy(c.parameters) for c in self.component_graph if c.parameters}

@classproperty
def default_parameters(cls):
"""Returns the default parameter dictionary for this pipeline."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs Returns line in docstring

component = handle_component_class(c)
if component.default_parameters:
defaults[component.name] = component.default_parameters
return defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Another great reason to have a metaclass: we can validate if the component_graph is correct before the class is defined, rather than having to worry about whether that code could break in this for loop! Ok, sorry, I'll stop peppering this PR with comments about metaclasses 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe it's a neat idea! I'll look into metaclasses more!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#561 tracks this!


regression_pipelines = inspect.getmembers(sys.modules['evalml.pipelines.regression'], inspect.isclass)
classification_pipelines = inspect.getmembers(sys.modules['evalml.pipelines.classification'], inspect.isclass)
all_pipelines = regression_pipelines + classification_pipelines + [("PipelineBase", PipelineBase)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, perhaps this was part of the pattern you mentioned? I think we should use all_components/all_pipelines here and have a separate conversation about how to update the implementation of those methods.


def cannot_check_because_base_or_not_installed(cls):

if issubclass(cls, ComponentBase): # noqa: F405
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the noqa here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just get rid of this function!

if cannot_check_because_base_or_not_installed(cls):
pytest.skip(f"Skipping {class_name} because it is not installed or it is a base class.")

assert cls.default_parameters == cls({}).parameters, f"{class_name}'s default parameters don't match __init__."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy! It took me a bit to understand the cannot_check_because_base_or_not_installed but I think I follow now. I wonder if there's a middle ground in terms of complexity between what you've got and the for loops I was doing. Perhaps you could keep the parametrize over all components, but then spell out the contents of the test? OK to have duplicate code.

Come to think of it, why can't you do:

@pytest.mark.parametrize("class_name,cls", all_components())
def test_default_parameters(class_name, cls):
    assert cls.default_parameters == cls().parameters, f"{class_name}'s default parameters don't match __init__."

@pytest.mark.parametrize("class_name,cls", all_pipelines())
def test_pipeline_default_parameters(class_name, cls):
    assert cls.default_parameters == cls({}).parameters, f"{class_name}'s default parameters don't match __init__."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I like this suggestion! I just didn't know about all_components and all_pipelines!

@freddyaboulton freddyaboulton requested a review from dsherry June 25, 2020 14:18
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at pipeline_class.rst to see how to add these class properties to the API reference? Basically you need to manually add autoattribute for default_parameters for it to show up in the API reference. You would need to do this for both pipelines and components. Then you can verify either locally by building docs or looking at your build on RTD. I know this is quite a confusing process so let me know if you want to talk about it!

@freddyaboulton
Copy link
Contributor Author

@jeremyliweishih Do you also want me to add parameters to the pipeline_class.rst? Looks like it's not there.

@jeremyliweishih
Copy link
Collaborator

@jeremyliweishih Do you also want me to add parameters to the pipeline_class.rst? Looks like it's not there.

@freddyaboulton theres no need to add parameters or any python @property since Sphinx autodoc will pick them up below:

   .. rubric:: Instance attributes

   .. autosummary::
      :nosignatures:

   {% for item in attributes %}
   {%- if item not in class_attributes %}
      ~{{ name }}.{{ item }}
   {%- endif %}
   {%- endfor %}
   {% endblock %}

Sphinx can do the same for the class properties (but I think it cannot retrieve the docstring) but we want to be able to show the values of these class in the API reference as well.

@freddyaboulton
Copy link
Contributor Author

@jeremyliweishih Got it, thanks for explaining that!

…es' of github.com:FeatureLabs/evalml into 848-add-default-params-getter-to-components-and-pipelines
@freddyaboulton
Copy link
Contributor Author

Can you take a look at pipeline_class.rst to see how to add these class properties to the API reference? Basically you need to manually add autoattribute for default_parameters for it to show up in the API reference. You would need to do this for both pipelines and components. Then you can verify either locally by building docs or looking at your build on RTD. I know this is quite a confusing process so let me know if you want to talk about it!

@jeremyliweishih I just added default_parameters to pipeline_class.rst, transformer_class.rst and estimator_class.rst.

@jeremyliweishih
Copy link
Collaborator

jeremyliweishih commented Jun 25, 2020

@freddyaboulton looks good to me! you can also verify here when its done building. You can check the builds here.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

@@ -494,3 +509,12 @@ def test_estimator_predict_output_type(X_y):
assert isinstance(predict_proba_output, pd.DataFrame)
assert predict_proba_output.shape == (len(y), len(np.unique(y)))
assert (predict_proba_output.columns == y_cols_expected).all()


components = list(all_components().items()) + [(DateTimeFeaturization.name, DateTimeFeaturization)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we don't list this in all_components? Why is that? @angela97lin do you remember? I think we should be including it in all_components.

We can do it in a separate PR, just wondering!

@@ -6,13 +6,14 @@

.. autoclass:: {{ objname }}

{% set class_attributes = ['name', 'model_family', 'hyperparameter_ranges'] %}
{% set class_attributes = ['name', 'model_family', 'hyperparameter_ranges', 'default_parameters'] %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @freddyaboulton !

I love having that parametrize test coverage on all our components.

I left a comment on the impl -- I have a suggestion for something which I think will be easier to maintain. Let's resolve that conversation before you merge.

Also left a comment on the tests about DatetimeFeaturization not being in all_components but that's not blocking / we can handle it separately.

@@ -51,6 +52,14 @@ def __init__(self, features_to_extract=None, random_state=0, **kwargs):
component_obj=None,
random_state=random_state)

@classproperty
def default_parameters(cls,):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh style nit-pick: lol one more extra comma!

@freddyaboulton freddyaboulton merged commit 1daa7b0 into master Jun 26, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
@freddyaboulton freddyaboulton deleted the 848-add-default-params-getter-to-components-and-pipelines branch July 10, 2020 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add default_parameters dict getter to components and pipelines
3 participants