Refined `make_pipeline_from_components` implementation #1204

christopherbunn · 2020-09-21T14:29:52Z

Continuing from the initial PR (#1176 ), additional work for #1162.

Added fixes so that make_pipeline_from_components shows up in the API documentation
Updated function implementation so that component_graph isn't directly set
Added additional test to show that two pipelines with the same components will have the same predictions
Added example in API reference

codecov · 2020-09-22T16:58:11Z

Codecov Report

Merging #1204 into main will increase coverage by 0.19%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1204      +/-   ##
==========================================
+ Coverage   99.73%   99.92%   +0.19%     
==========================================
  Files         201      201              
  Lines       12464    12489      +25     
==========================================
+ Hits        12431    12480      +49     
+ Misses         33        9      -24

Impacted Files	Coverage Δ
evalml/pipelines/utils.py	`100.00% <100.00%> (ø)`
evalml/tests/pipeline_tests/test_pipelines.py	`100.00% <100.00%> (+0.74%)`	⬆️
...lml/tests/model_understanding_tests/test_graphs.py	`100.00% <0.00%> (+0.14%)`	⬆️
evalml/automl/automl_search.py	`99.58% <0.00%> (+0.41%)`	⬆️
.../automl_tests/test_automl_search_classification.py	`100.00% <0.00%> (+0.43%)`	⬆️
evalml/tests/component_tests/test_components.py	`100.00% <0.00%> (+0.93%)`	⬆️
...ests/automl_tests/test_automl_search_regression.py	`100.00% <0.00%> (+1.06%)`	⬆️
evalml/utils/gen_utils.py	`99.09% <0.00%> (+1.80%)`	⬆️
evalml/tests/component_tests/test_utils.py	`100.00% <0.00%> (+3.57%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e1b86a...d59c4a6. Read the comment docs.

freddyaboulton

@christopherbunn Thanks for making these changes! I have a question about what the intended behavior should be when all the components are fitted. I think it'd be good to decide on something and add a unit test before merge.

freddyaboulton · 2020-09-22T23:26:29Z

evalml/pipelines/utils.py

-    pipeline_instance = TemplatedPipeline({})
-    pipeline_instance.component_graph = component_instances
-    return pipeline_instance
+    return TemplatedPipeline({c.name: c.parameters for c in component_instances})


@christopherbunn I think if all components are fitted then the pipeline instance should be considered fit as well - I don't think we decided what the intended behavior should be in #1176 but whatever we decide we should add a unit test to check the pipeline instance (and components) have the correct _is_fitted

from evalml.demos import load_breast_cancer from evalml.pipelines import BinaryClassificationPipeline from evalml.pipelines.utils import make_pipeline_from_components X, y = load_breast_cancer() class Pipeline(BinaryClassificationPipeline): component_graph = ["Simple Imputer", "Random Forest Classifier"] pl = Pipeline({}) pl.fit(X, y) assert pl._is_fitted assert all(c._is_fitted for c in pl.component_graph) new_pl = make_pipeline_from_components(pl.component_graph, pl.problem_type) assert not new_pl._is_fitted assert not new_pl.estimator._is_fitted

I think it's fair for a pipeline to be marked as fitted if all of the components are fitted. I'll update the function logic and add in a unit test.

@freddyaboulton following up on this: while marking the pipeline as fit isn't a problem, it doesn't operate the same as as the previously fitted pipeline.

X, y = X_y_binary automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5) automl.search(X, y) best_pipeline = automl.best_pipeline new_fitted_pipeline = make_pipeline_from_components(best_pipeline.component_graph, ProblemTypes.BINARY) assert new_fitted_pipeline._is_fitted new_fitted_predictions = new_fitted_pipeline.predict(X) assert np.array_equal(best_predictions, new_fitted_predictions)

The example above fails on new_fitted_pipeline.predict(X) as input_feature_names in new_fitted_predictions is not set. This pipeline attribute is set during the pipeline fitting process and is used during the prediction process. Because of this, just taking the pipeline components from the previous pipeline doesn't create an equivalent pipeline.

As a result, I don't think we can set the new pipeline as fitted unless we want to have input_feature_names passed into this function as well.

@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...] to prevent modeling errors.

A test case which would break in this case:

X, y = X_y_binary # convert each 0 or 1 int value to a str y = np.array(list(map(str, y)), dtype='str') ...

In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.

So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.

Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.

dsherry

Thanks for chasing this down!

I liked your discussion of using fitted components. I left my recommendation on what we should do, which is to punt on having this method create a fitted pipeline from fitted components. I also left unit test comments. Otherwise, good stuff!

dsherry · 2020-09-26T00:07:41Z

evalml/pipelines/utils.py

@@ -119,6 +119,11 @@ def make_pipeline_from_components(component_instances, problem_type, custom_name
    Returns:
        Pipeline instance with component instances and specified estimator

+    Example:
+        >>> components = [Imputer(), StandardScaler(), Estimator()]


I think you should say RandomForestClassifier() here, otherwise this example isn't runnable because Estimator() is abstract and will throw. Actually I thought we had docstrings get run during our docs, strange that this didn't fail!

In [2]: evalml.pipelines.components.Estimator() --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-2-81c4f09910a6> in <module> ----> 1 evalml.pipelines.components.Estimator() TypeError: Can't instantiate abstract class Estimator with abstract methods model_family, name, supported_problem_types

Ah yeah this did fail here. I'm having an interesting problem with this end estimator in the docs. Estimator doesn't work for the reasons you stated, but linting keeps catching RandomForestClassifier as an "unused" import since it isn't used in the code itself. I tried Catboost but that would fail when there during the core dependencies check.

I think the best solution for now is to import RandomForestClassifier and put in a #noqa: F401 after this line.

dsherry · 2020-09-26T00:19:57Z

evalml/pipelines/utils.py

-    pipeline_instance = TemplatedPipeline({})
-    pipeline_instance.component_graph = component_instances
-    return pipeline_instance
+    return TemplatedPipeline({c.name: c.parameters for c in component_instances})


@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...] to prevent modeling errors.

A test case which would break in this case:

X, y = X_y_binary # convert each 0 or 1 int value to a str y = np.array(list(map(str, y)), dtype='str') ...

In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.

So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.

Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.

dsherry · 2020-09-26T00:24:33Z

evalml/tests/pipeline_tests/test_pipelines.py

@@ -280,6 +281,19 @@ class DummyEstimator(Estimator):
    expected_parameters = {'Dummy!': {'bar': 'baz'}}
    assert pipeline.parameters == expected_parameters

+    X, y = X_y_binary
+    automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5)
+    automl.search(X, y)


@christopherbunn why use automl in this unit test? You can just create a pipeline class to use:

class TestPipeline(BinaryClassificationPipeline): component_graph = [Imputer, OneHotEncoder, LogisticRegressionClassifier] pipeline = TestPipeline(parameters={}) ...

Or, use one already defined in our test fixtures, like this one.

Hmm, I think the intent was to originally compare fitted pipelines, but then I guess that gets fitted separately anyways. I'll update this.

dsherry · 2020-09-26T00:26:20Z

evalml/tests/pipeline_tests/test_pipelines.py

+    new_pipeline.fit(X, y)
+    new_predictions = new_pipeline.predict(X)
+    assert np.array_equal(best_predictions, new_predictions)
+    assert np.array_equal(best_pipeline.feature_importance, new_pipeline.feature_importance)


Cool that you're comparing the predictions! Can you also compare both static and instance pipeline attributes: name, parameters, summary, component_graph (can just compare each element with isinstance or something), describe

dsherry · 2020-09-26T00:33:13Z

evalml/tests/pipeline_tests/test_pipelines.py

@@ -244,7 +245,7 @@ def test_make_pipeline_problem_type_mismatch():
        make_pipeline(pd.DataFrame(), pd.Series(), Transformer, ProblemTypes.MULTICLASS)


-def test_make_pipeline_from_components():
+def test_make_pipeline_from_components(X_y_binary):


@christopherbunn this test is great! What are some other cases you could cover? Here's a few thoughts on edge cases:
def make_pipeline_from_components(component_instances, problem_type, custom_name=None):

What if component_instances contains class references instead of instances? Or generally things which aren't evalml component instances?

What if problem_type is invalid?

What if custom_name is not a string?

In all of those cases, I'd hope we'd get some sort of exception when we try to define the pipeline template class. As it stands, because our PipelineBaseMeta metaclass is pretty simple and doesn't check these fields, I bet no error comes up for some, so perhaps make_pipeline_from_components should error out instead.

christopherbunn changed the title ~~Added additional refinements to make_pipeline_from_components implementation~~ Refined make_pipeline_from_components implementation Sep 21, 2020

christopherbunn force-pushed the 1162_util_refinement branch from c1b07be to ca4208c Compare September 21, 2020 19:13

freddyaboulton approved these changes Sep 22, 2020

View reviewed changes

dsherry self-requested a review September 26, 2020 00:06

dsherry approved these changes Sep 26, 2020

View reviewed changes

christopherbunn force-pushed the 1162_util_refinement branch 5 times, most recently from 6a1195d to 7a0a035 Compare September 29, 2020 16:59

christopherbunn added 7 commits September 29, 2020 13:00

Added additional tweaks to make_pipeline_from_components

24976d9

Updated release notes

3bcee3b

Fixed doctest typo

b77146e

Removed assert from API docstring

18a8073

Updated doc example to have minimal dependencies

2bfd5a7

Changed example to generic estimator

f7167d7

Added input checks + updated tests

d59c4a6

christopherbunn force-pushed the 1162_util_refinement branch from 7a0a035 to d59c4a6 Compare September 29, 2020 17:00

christopherbunn merged commit 7e3ca42 into main Sep 29, 2020

christopherbunn deleted the 1162_util_refinement branch September 29, 2020 17:30

angela97lin mentioned this pull request Sep 29, 2020

Release v0.14.1 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refined `make_pipeline_from_components` implementation #1204

Refined `make_pipeline_from_components` implementation #1204

Uh oh!

christopherbunn commented Sep 21, 2020 •

edited

Loading

Uh oh!

codecov bot commented Sep 22, 2020 •

edited

Loading

Uh oh!

freddyaboulton left a comment

Uh oh!

freddyaboulton Sep 22, 2020

Uh oh!

christopherbunn Sep 23, 2020

Uh oh!

christopherbunn Sep 24, 2020

Uh oh!

dsherry Sep 26, 2020

Uh oh!

dsherry left a comment

Uh oh!

dsherry Sep 26, 2020

Uh oh!

christopherbunn Sep 28, 2020

Uh oh!

dsherry Sep 26, 2020

Uh oh!

dsherry Sep 26, 2020

Uh oh!

christopherbunn Sep 28, 2020

Uh oh!

dsherry Sep 26, 2020

Uh oh!

dsherry Sep 26, 2020

Uh oh!

Uh oh!

Refined make_pipeline_from_components implementation #1204

Refined make_pipeline_from_components implementation #1204

Uh oh!

Conversation

christopherbunn commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

freddyaboulton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Refined `make_pipeline_from_components` implementation #1204

Refined `make_pipeline_from_components` implementation #1204

christopherbunn commented Sep 21, 2020 •

edited

Loading

codecov bot commented Sep 22, 2020 •

edited

Loading