Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

363 common method for registering components #898

Merged
merged 11 commits into from
Jul 8, 2020

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jun 30, 2020

Pull Request Description

Addresses #363 by creating a function called get_importable_subclasses for creating a list of all non-base-class subclasses of Transformer, Estimator, and PipelineBase. This function has a flag for filtering for estimators/pipelines used for automl search. It should be removed in #899.


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jun 30, 2020

Codecov Report

Merging #898 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #898      +/-   ##
==========================================
- Coverage   99.83%   99.83%   -0.01%     
==========================================
  Files         168      168              
  Lines        8352     8331      -21     
==========================================
- Hits         8338     8317      -21     
  Misses         14       14              
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 100.00% <100.00%> (ø)
evalml/model_family/utils.py 100.00% <100.00%> (ø)
evalml/pipelines/components/utils.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_estimators.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b7afaa...52b6966. Read the comment docs.

@freddyaboulton freddyaboulton linked an issue Jun 30, 2020 that may be closed by this pull request
@freddyaboulton freddyaboulton self-assigned this Jun 30, 2020
@freddyaboulton freddyaboulton force-pushed the 363-common-method-for-registering-components branch from 83cb4f2 to 0adc224 Compare June 30, 2020 17:44
@freddyaboulton freddyaboulton added this to the July 2020 milestone Jun 30, 2020
@freddyaboulton freddyaboulton marked this pull request as ready for review June 30, 2020 18:25
@freddyaboulton freddyaboulton force-pushed the 363-common-method-for-registering-components branch 2 times, most recently from 52370da to bb62700 Compare July 1, 2020 19:11
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discussed with Freddy, agreed on plan for changes. Its close though!


return classes

return get_all_classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra level of indirection here? Why not just return the list of classes?

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
classes = []
for cls in all_classes:
try:
cls(*args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not blocking but perhaps we should just update pipelines to have parameters default to an empty dict. Not sure there's a good reason why we require that for pipelines. What if you want a pipeline with default params? The way to get that would be to pass in an empty dict.

Could file separately

subclasses = []

while classes_to_check:
subclass = classes_to_check.pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we do this:

# __module__ == 'evalml.pipelines.components'
# __name__ == 'MyComponent'
if not 'evalml.pipelines' in subclass.__module__:
    continue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

logger.debug('Estimator {} failed import, withholding from all_estimators'.format(estimator_name))
return estimators


def get_estimators(problem_type, model_families=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also move this to evalml/pipelines/components/utils.py, to be alongside the all_estimators list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, this leads to a circular dependency becausemodel_family depends on pipeline_base.py, which depends on pipelines/components/utils which depends on model_family

_estimator_message = 'Estimator {} failed import, withholding from all_estimators'
_all_estimators = get_importable_subclasses(Estimator, args=[],
message=_estimator_message,
used_in_automl=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In discussion with Freddy. We are including this so that our tests can look up the components which we don't currently support in automl, e.g. extratrees, elasticnet, and baseline estimators. We hope to delete this in the future once those are supported for automl.

used_in_automl=False)
all_estimators_used_in_search = get_importable_subclasses(Estimator, args=[],
message=_estimator_message,
used_in_automl=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In discussion with Freddy. Let's keep all_estimators_used_in_search private for now. Once we add all components to automl, we won't need to draw a distinction between those in our codebase vs those in our codebase supported for automl. And at that point we'll want to rename this. So, keeping private.

@dsherry
Copy link
Contributor

dsherry commented Jul 2, 2020

Use-cases to support:

  1. User wants to run automl with stock evalml components/estimators, make_pipeline
  2. User wants to run automl with custom estimators (we don't currently support this, can support later)
  3. User wants to run automl with custom pipelines (allowed_pipelines)

Plan for this PR:

  • Define get_importable_subclasses, only looks at classes in the evalml module, excludes custom classes
  • Move evalml.pipelines.get_estimators to evalml.pipelines.components.get_estimators
  • Keep the all_estimators* lists private for now
  • Resolve other comments

@freddyaboulton freddyaboulton force-pushed the 363-common-method-for-registering-components branch from bb62700 to e475845 Compare July 2, 2020 20:53
Copy link
Contributor Author

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry This should be good for a second look now! The one thing I wanted to call out is that I saw another circular dependency issue with model_family and pipeline utils. I resolved this by not importing the pipeline utils in pipelines/__init__.py and I left a comment explaining my reasoning. This change had some minor impacts on the api_reference.rst file but apart from that, I kept pretty close to what we discussed earlier.

@@ -16,7 +16,7 @@
"import evalml\n",
"from evalml import AutoMLSearch\n",
"from evalml.demos import load_diabetes\n",
"from evalml.pipelines import PipelineBase, get_pipelines\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed get_pipelines because it is not used.

@@ -59,14 +59,7 @@
BaselineRegressionPipeline,
MeanBaselineRegressionPipeline
)
from .utils import (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the model family utils or the pipeline utils are exposed at the top level of their respective modules but not both.

If both are exposed at the top level, we will get circular dependency errors. The reason is as follows:

  1. model_family is imported first. This needs to import pipelines/components/utils because of the line from evalml.pipelines.components.utils import _all_estimators_used_in_search
  2. Before you can import components/utils, you need to run pipelines/__init__.py
    If you expose, pipeline utils in pipelines/__init__.py, then you will need to import model_family which takes you full circle.

I decided to not expose the pipeline utils because they are only used by automl search internally and so users are not likely to do from evalml.pipelines.get_pipeline (at least for now) whereas model_family.list_model_families feels more intuitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, yeah sounds good to me. We can always change this later if there's a good reason to do so.

@freddyaboulton
Copy link
Contributor Author

Also, there seems to be another issue with code cov like the one I brought up at stand up earliear this week. I reduced the hits and lines by the same amount and now I fail project coverage. I can't find any dark red lines in my diff though.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton looks good!

I think we should wait to merge this until @angela97lin mergeas #904 . Unfortunately it looks like there will be conflicts to resolve but hopefully not too bad

all_pipelines
get_pipelines
get_estimators
make_pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton why make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pipelines.utils are no longer listed in pipelines/__init__.py, we need to change the current module so that sphinx can build the docs.

@@ -59,14 +59,7 @@
BaselineRegressionPipeline,
MeanBaselineRegressionPipeline
)
from .utils import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, yeah sounds good to me. We can always change this later if there's a good reason to do so.

_all_estimators = get_importable_subclasses(Estimator, args=[], used_in_automl=False)
_all_estimators_used_in_search = get_importable_subclasses(Estimator, args=[], used_in_automl=True)
_all_transformers = get_importable_subclasses(Transformer, args=[], used_in_automl=False)
all_components = _all_estimators + _all_transformers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is great.

pipeline_name = pipeline_class.name
logger.debug('Pipeline {} failed import, withholding from all_pipelines'.format(pipeline_name))
return pipelines
all_pipelines = get_importable_subclasses(PipelineBase, args=[{}], used_in_automl=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #904 is merged, we can delete this, and we can then also delete the args input.

components = all_components()
for component_name, component_class in components.items():
components = all_components
for component_class in components:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make into one line

class ChildClass(ComponentBase):
pass

assert ChildClass not in get_importable_subclasses(ComponentBase, args=[])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@freddyaboulton freddyaboulton merged commit d2f6c79 into master Jul 8, 2020
@freddyaboulton freddyaboulton deleted the 363-common-method-for-registering-components branch July 8, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Common method for registering pipeline/component/other classes
2 participants