Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refined make_pipeline_from_components implementation #1204

Merged
merged 7 commits into from
Sep 29, 2020

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Sep 21, 2020

Continuing from the initial PR (#1176 ), additional work for #1162.

  • Added fixes so that make_pipeline_from_components shows up in the API documentation
  • Updated function implementation so that component_graph isn't directly set
  • Added additional test to show that two pipelines with the same components will have the same predictions
  • Added example in API reference

@christopherbunn christopherbunn changed the title Added additional refinements to make_pipeline_from_components implementation Refined make_pipeline_from_components implementation Sep 21, 2020
@codecov
Copy link

codecov bot commented Sep 22, 2020

Codecov Report

Merging #1204 into main will increase coverage by 0.19%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1204      +/-   ##
==========================================
+ Coverage   99.73%   99.92%   +0.19%     
==========================================
  Files         201      201              
  Lines       12464    12489      +25     
==========================================
+ Hits        12431    12480      +49     
+ Misses         33        9      -24     
Impacted Files Coverage Δ
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <100.00%> (+0.74%) ⬆️
...lml/tests/model_understanding_tests/test_graphs.py 100.00% <0.00%> (+0.14%) ⬆️
evalml/automl/automl_search.py 99.58% <0.00%> (+0.41%) ⬆️
.../automl_tests/test_automl_search_classification.py 100.00% <0.00%> (+0.43%) ⬆️
evalml/tests/component_tests/test_components.py 100.00% <0.00%> (+0.93%) ⬆️
...ests/automl_tests/test_automl_search_regression.py 100.00% <0.00%> (+1.06%) ⬆️
evalml/utils/gen_utils.py 99.09% <0.00%> (+1.80%) ⬆️
evalml/tests/component_tests/test_utils.py 100.00% <0.00%> (+3.57%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e1b86a...d59c4a6. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn Thanks for making these changes! I have a question about what the intended behavior should be when all the components are fitted. I think it'd be good to decide on something and add a unit test before merge.

pipeline_instance = TemplatedPipeline({})
pipeline_instance.component_graph = component_instances
return pipeline_instance
return TemplatedPipeline({c.name: c.parameters for c in component_instances})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I think if all components are fitted then the pipeline instance should be considered fit as well - I don't think we decided what the intended behavior should be in #1176 but whatever we decide we should add a unit test to check the pipeline instance (and components) have the correct _is_fitted

from evalml.demos import load_breast_cancer
from evalml.pipelines import BinaryClassificationPipeline
from evalml.pipelines.utils import make_pipeline_from_components

X, y = load_breast_cancer()

class Pipeline(BinaryClassificationPipeline):
    component_graph = ["Simple Imputer", "Random Forest Classifier"]
    
pl = Pipeline({})
pl.fit(X, y)

assert pl._is_fitted
assert all(c._is_fitted for c in pl.component_graph)

new_pl = make_pipeline_from_components(pl.component_graph, pl.problem_type)

assert not new_pl._is_fitted
assert not new_pl.estimator._is_fitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fair for a pipeline to be marked as fitted if all of the components are fitted. I'll update the function logic and add in a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton following up on this: while marking the pipeline as fit isn't a problem, it doesn't operate the same as as the previously fitted pipeline.

X, y = X_y_binary
automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5)
automl.search(X, y)
best_pipeline = automl.best_pipeline
new_fitted_pipeline = make_pipeline_from_components(best_pipeline.component_graph, ProblemTypes.BINARY)
assert new_fitted_pipeline._is_fitted
new_fitted_predictions = new_fitted_pipeline.predict(X)
assert np.array_equal(best_predictions, new_fitted_predictions)

The example above fails on new_fitted_pipeline.predict(X) as input_feature_names in new_fitted_predictions is not set. This pipeline attribute is set during the pipeline fitting process and is used during the prediction process. Because of this, just taking the pipeline components from the previous pipeline doesn't create an equivalent pipeline.

As a result, I don't think we can set the new pipeline as fitted unless we want to have input_feature_names passed into this function as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...] to prevent modeling errors.

A test case which would break in this case:

X, y = X_y_binary
# convert each 0 or 1 int value to a str
y = np.array(list(map(str, y)), dtype='str')
...

In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.

So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.

Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.

@dsherry dsherry self-requested a review September 26, 2020 00:06
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this down!

I liked your discussion of using fitted components. I left my recommendation on what we should do, which is to punt on having this method create a fitted pipeline from fitted components. I also left unit test comments. Otherwise, good stuff!

@@ -119,6 +119,11 @@ def make_pipeline_from_components(component_instances, problem_type, custom_name
Returns:
Pipeline instance with component instances and specified estimator

Example:
>>> components = [Imputer(), StandardScaler(), Estimator()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should say RandomForestClassifier() here, otherwise this example isn't runnable because Estimator() is abstract and will throw. Actually I thought we had docstrings get run during our docs, strange that this didn't fail!

In [2]: evalml.pipelines.components.Estimator()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-81c4f09910a6> in <module>
----> 1 evalml.pipelines.components.Estimator()

TypeError: Can't instantiate abstract class Estimator with abstract methods model_family, name, supported_problem_types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this did fail here. I'm having an interesting problem with this end estimator in the docs. Estimator doesn't work for the reasons you stated, but linting keeps catching RandomForestClassifier as an "unused" import since it isn't used in the code itself. I tried Catboost but that would fail when there during the core dependencies check.

I think the best solution for now is to import RandomForestClassifier and put in a #noqa: F401 after this line.

pipeline_instance = TemplatedPipeline({})
pipeline_instance.component_graph = component_instances
return pipeline_instance
return TemplatedPipeline({c.name: c.parameters for c in component_instances})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...] to prevent modeling errors.

A test case which would break in this case:

X, y = X_y_binary
# convert each 0 or 1 int value to a str
y = np.array(list(map(str, y)), dtype='str')
...

In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.

So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.

Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.

@@ -280,6 +281,19 @@ class DummyEstimator(Estimator):
expected_parameters = {'Dummy!': {'bar': 'baz'}}
assert pipeline.parameters == expected_parameters

X, y = X_y_binary
automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5)
automl.search(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn why use automl in this unit test? You can just create a pipeline class to use:

class TestPipeline(BinaryClassificationPipeline):
    component_graph = [Imputer, OneHotEncoder, LogisticRegressionClassifier]

pipeline = TestPipeline(parameters={})
...

Or, use one already defined in our test fixtures, like this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the intent was to originally compare fitted pipelines, but then I guess that gets fitted separately anyways. I'll update this.

new_pipeline.fit(X, y)
new_predictions = new_pipeline.predict(X)
assert np.array_equal(best_predictions, new_predictions)
assert np.array_equal(best_pipeline.feature_importance, new_pipeline.feature_importance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you're comparing the predictions! Can you also compare both static and instance pipeline attributes: name, parameters, summary, component_graph (can just compare each element with isinstance or something), describe

@@ -244,7 +245,7 @@ def test_make_pipeline_problem_type_mismatch():
make_pipeline(pd.DataFrame(), pd.Series(), Transformer, ProblemTypes.MULTICLASS)


def test_make_pipeline_from_components():
def test_make_pipeline_from_components(X_y_binary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn this test is great! What are some other cases you could cover? Here's a few thoughts on edge cases:
def make_pipeline_from_components(component_instances, problem_type, custom_name=None):

  • What if component_instances contains class references instead of instances? Or generally things which aren't evalml component instances?
  • What if problem_type is invalid?
  • What if custom_name is not a string?

In all of those cases, I'd hope we'd get some sort of exception when we try to define the pipeline template class. As it stands, because our PipelineBaseMeta metaclass is pretty simple and doesn't check these fields, I bet no error comes up for some, so perhaps make_pipeline_from_components should error out instead.

@christopherbunn christopherbunn force-pushed the 1162_util_refinement branch 5 times, most recently from 6a1195d to 7a0a035 Compare September 29, 2020 16:59
@christopherbunn christopherbunn merged commit 7e3ca42 into main Sep 29, 2020
@christopherbunn christopherbunn deleted the 1162_util_refinement branch September 29, 2020 17:30
@angela97lin angela97lin mentioned this pull request Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants