-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refined make_pipeline_from_components
implementation
#1204
Conversation
make_pipeline_from_components
implementationmake_pipeline_from_components
implementation
c1b07be
to
ca4208c
Compare
Codecov Report
@@ Coverage Diff @@
## main #1204 +/- ##
==========================================
+ Coverage 99.73% 99.92% +0.19%
==========================================
Files 201 201
Lines 12464 12489 +25
==========================================
+ Hits 12431 12480 +49
+ Misses 33 9 -24
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn Thanks for making these changes! I have a question about what the intended behavior should be when all the components are fitted. I think it'd be good to decide on something and add a unit test before merge.
pipeline_instance = TemplatedPipeline({}) | ||
pipeline_instance.component_graph = component_instances | ||
return pipeline_instance | ||
return TemplatedPipeline({c.name: c.parameters for c in component_instances}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn I think if all components are fitted then the pipeline instance should be considered fit as well - I don't think we decided what the intended behavior should be in #1176 but whatever we decide we should add a unit test to check the pipeline instance (and components) have the correct _is_fitted
from evalml.demos import load_breast_cancer
from evalml.pipelines import BinaryClassificationPipeline
from evalml.pipelines.utils import make_pipeline_from_components
X, y = load_breast_cancer()
class Pipeline(BinaryClassificationPipeline):
component_graph = ["Simple Imputer", "Random Forest Classifier"]
pl = Pipeline({})
pl.fit(X, y)
assert pl._is_fitted
assert all(c._is_fitted for c in pl.component_graph)
new_pl = make_pipeline_from_components(pl.component_graph, pl.problem_type)
assert not new_pl._is_fitted
assert not new_pl.estimator._is_fitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fair for a pipeline to be marked as fitted if all of the components are fitted. I'll update the function logic and add in a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton following up on this: while marking the pipeline as fit isn't a problem, it doesn't operate the same as as the previously fitted pipeline.
X, y = X_y_binary
automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5)
automl.search(X, y)
best_pipeline = automl.best_pipeline
new_fitted_pipeline = make_pipeline_from_components(best_pipeline.component_graph, ProblemTypes.BINARY)
assert new_fitted_pipeline._is_fitted
new_fitted_predictions = new_fitted_pipeline.predict(X)
assert np.array_equal(best_predictions, new_fitted_predictions)
The example above fails on new_fitted_pipeline.predict(X)
as input_feature_names
in new_fitted_predictions
is not set. This pipeline attribute is set during the pipeline fitting process and is used during the prediction process. Because of this, just taking the pipeline components from the previous pipeline doesn't create an equivalent pipeline.
As a result, I don't think we can set the new pipeline as fitted unless we want to have input_feature_names
passed into this function as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...]
to prevent modeling errors.
A test case which would break in this case:
X, y = X_y_binary
# convert each 0 or 1 int value to a str
y = np.array(list(map(str, y)), dtype='str')
...
In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.
So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.
Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chasing this down!
I liked your discussion of using fitted components. I left my recommendation on what we should do, which is to punt on having this method create a fitted pipeline from fitted components. I also left unit test comments. Otherwise, good stuff!
evalml/pipelines/utils.py
Outdated
@@ -119,6 +119,11 @@ def make_pipeline_from_components(component_instances, problem_type, custom_name | |||
Returns: | |||
Pipeline instance with component instances and specified estimator | |||
|
|||
Example: | |||
>>> components = [Imputer(), StandardScaler(), Estimator()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should say RandomForestClassifier()
here, otherwise this example isn't runnable because Estimator()
is abstract and will throw. Actually I thought we had docstrings get run during our docs, strange that this didn't fail!
In [2]: evalml.pipelines.components.Estimator()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-81c4f09910a6> in <module>
----> 1 evalml.pipelines.components.Estimator()
TypeError: Can't instantiate abstract class Estimator with abstract methods model_family, name, supported_problem_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this did fail here. I'm having an interesting problem with this end estimator in the docs. Estimator
doesn't work for the reasons you stated, but linting keeps catching RandomForestClassifier
as an "unused" import since it isn't used in the code itself. I tried Catboost
but that would fail when there during the core dependencies check.
I think the best solution for now is to import RandomForestClassifier
and put in a #noqa: F401
after this line.
pipeline_instance = TemplatedPipeline({}) | ||
pipeline_instance.component_graph = component_instances | ||
return pipeline_instance | ||
return TemplatedPipeline({c.name: c.parameters for c in component_instances}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton good idea on the fitting. Unfortunately I don't think we can have this method return a fitted pipeline. Our pipeline classes do more than just wrap the components at this point. We do things like encoding classification labels to [0, 1, 2, ...]
to prevent modeling errors.
A test case which would break in this case:
X, y = X_y_binary
# convert each 0 or 1 int value to a str
y = np.array(list(map(str, y)), dtype='str')
...
In other words, there are structures on the pipeline which need the training data. There are ways around this: we could try to represent the target encoding as a component in the graph just like any other. The thing is, we'd need to define a decode component on the output, and that would get hairy, so we opted for making our pipeline "smart" enough to encode the target to ints for classification targets.
So, let's update the docs for this method to say that it returns an untrained pipeline instance, even if the components are already trained.
Presumably, if someone can train the components, they should be able to get access to the training data anyways, so this shouldn't be a big deal.
@@ -280,6 +281,19 @@ class DummyEstimator(Estimator): | |||
expected_parameters = {'Dummy!': {'bar': 'baz'}} | |||
assert pipeline.parameters == expected_parameters | |||
|
|||
X, y = X_y_binary | |||
automl = AutoMLSearch(problem_type='binary', objective="Log Loss Binary", max_pipelines=5) | |||
automl.search(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn why use automl in this unit test? You can just create a pipeline class to use:
class TestPipeline(BinaryClassificationPipeline):
component_graph = [Imputer, OneHotEncoder, LogisticRegressionClassifier]
pipeline = TestPipeline(parameters={})
...
Or, use one already defined in our test fixtures, like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think the intent was to originally compare fitted pipelines, but then I guess that gets fitted separately anyways. I'll update this.
new_pipeline.fit(X, y) | ||
new_predictions = new_pipeline.predict(X) | ||
assert np.array_equal(best_predictions, new_predictions) | ||
assert np.array_equal(best_pipeline.feature_importance, new_pipeline.feature_importance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that you're comparing the predictions! Can you also compare both static and instance pipeline attributes: name
, parameters
, summary
, component_graph
(can just compare each element with isinstance
or something), describe
@@ -244,7 +245,7 @@ def test_make_pipeline_problem_type_mismatch(): | |||
make_pipeline(pd.DataFrame(), pd.Series(), Transformer, ProblemTypes.MULTICLASS) | |||
|
|||
|
|||
def test_make_pipeline_from_components(): | |||
def test_make_pipeline_from_components(X_y_binary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn this test is great! What are some other cases you could cover? Here's a few thoughts on edge cases:
def make_pipeline_from_components(component_instances, problem_type, custom_name=None):
- What if
component_instances
contains class references instead of instances? Or generally things which aren't evalml component instances? - What if
problem_type
is invalid? - What if
custom_name
is not a string?
In all of those cases, I'd hope we'd get some sort of exception when we try to define the pipeline template class. As it stands, because our PipelineBaseMeta
metaclass is pretty simple and doesn't check these fields, I bet no error comes up for some, so perhaps make_pipeline_from_components
should error out instead.
6a1195d
to
7a0a035
Compare
7a0a035
to
d59c4a6
Compare
Continuing from the initial PR (#1176 ), additional work for #1162.
make_pipeline_from_components
shows up in the API documentationcomponent_graph
isn't directly set