-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clone Pipelines #842
Clone Pipelines #842
Conversation
Codecov Report
@@ Coverage Diff @@
## master #842 +/- ##
=======================================
Coverage 99.69% 99.69%
=======================================
Files 195 195
Lines 7774 7846 +72
=======================================
+ Hits 7750 7822 +72
Misses 24 24
Continue to review full report at Codecov.
|
@@ -26,6 +28,22 @@ def name(cls): | |||
def model_family(cls): | |||
"""Returns ModelFamily of this component""" | |||
|
|||
def clone(self, learned=True, random_state='match'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- Let's change
learned=True
todeep=False
- I like that you thought about random_state here. Let's change this to be an int /
RandomState
with default 0, just like all other callsites in the repo
""" | ||
if learned: | ||
return copy.deepcopy(self) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flattening: since you return on the previous line, no need for this else
cloned_component = component_class(parameters=self.parameters, random_state=self.random_state) | ||
else: | ||
cloned_component = component_class(parameters=self.parameters, random_state=random_state) | ||
return cloned_component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding parameters
as an argument to every component, let's do
return self.__class__(**self.parameters, random_state=random_state)
This is how we instantiate components in PipelineBase.__init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay looking good! You picked up a hard one... like I mentioned the other day, this touches on some real challenges with how our code is structured right now. It'll be great to have this merged.
I left comments about the impl. I see you have a test for each component and classifier -- wow! Will dig into that on the next pass.
Please revert the changes you made to the components, other than ComponentBase
of course. We shouldn't add parameters
as an input to the components. Not necessary when you can use python's **
syntax to pass the args along. See this explanation of how that works if you didn't know already.
min_weight_fraction_leaf=min_weight_fraction_leaf, | ||
"max_depth": max_depth, | ||
"min_samples_split": min_samples_split, | ||
"min_weight_fraction_leaf": min_weight_fraction_leaf} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih pointed out you added some params here which were missing. That's great. We should keep changes like this, so that user-inputted values are always saved (except for random_state
which we treat separately).
I started looking at #522 recently, which tracks fixing these sorts of issues. Heads up I may end up scooping these changes into my PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this and much more is now incorporated into #847
eta=parameters['eta'], | ||
max_depth=parameters['max_depth'], | ||
n_estimators=parameters['n_estimators'], | ||
min_child_weight=parameters['min_child_weight']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay could you please back these changes out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why: if there's no reason to make them, we shouldn't make them :) perhaps this was left over from earlier?
@@ -5,7 +5,7 @@ | |||
from evalml.pipelines.components import CatBoostClassifier | |||
from evalml.utils import SEED_BOUNDS | |||
|
|||
importorskip('catboost', reason='Skipping test because catboost not installed') | |||
catboost = importorskip('catboost', reason='Skipping test because catboost not installed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please back out this change?
@@ -5,7 +5,16 @@ | |||
from evalml.pipelines.components import CatBoostRegressor | |||
from evalml.utils import SEED_BOUNDS | |||
|
|||
importorskip('catboost', reason='Skipping test because catboost not installed') | |||
catboost = importorskip('catboost', reason='Skipping test because catboost not installed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for all changes in this file: please back them out since they're not necessary
@@ -40,6 +40,7 @@ def test_en_init(X_y): | |||
'Elastic Net Classifier': { | |||
"alpha": 0.5, | |||
"l1_ratio": 0.5, | |||
"max_iter": 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change important for this PR? Or is it left over from an earlier state of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an earlier state of the PR, will remove.
@@ -9,7 +9,7 @@ | |||
from evalml.pipelines.components import XGBoostRegressor | |||
from evalml.utils import get_random_seed, get_random_state, import_or_raise | |||
|
|||
importorskip('xgboost', reason='Skipping test because xgboost not installed') | |||
xgboost = importorskip('xgboost', reason='Skipping test because xgboost not installed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as others, please back this out
clf_clone = clf.clone() | ||
with pytest.raises(ValueError, match='Component is not fit'): | ||
clf_clone.predict(X) | ||
clf.parameters == clf_clone.parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant assert clf.parameters == clf_clone.parameters
@@ -314,3 +314,42 @@ def test_component_parameters_all_saved(): | |||
|
|||
expected_parameters = {arg: default for (arg, default) in zip(args, defaults)} | |||
assert parameters == expected_parameters | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay could you please add another test here which does the following:
def test_clone_init():
...
clf = MockFitComponent(**params)
clf_clone = clf.clone()
assert clf.parameters == clf_clone.parameters
assert clf.random_state == clf_clone.random_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to make MockFitComponent
available via a test fixture in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and I forgot to say: its valuable to add this because we need coverage of cloning a component which hasn't been fitted. Your main test test_clone
covers the fitted case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess on reflection, you could just add assertions to your existing test. My key point was that we should have a test somewhere which checks the params and random state match before fitting. Your call on how to organize that.
predicted = clf.predict(X) | ||
assert isinstance(predicted, type(np.array([]))) | ||
|
||
clf_clone = clf.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the call to clone, let's also check assert clf_clone.random_state == clf.random_state
Helpful to have this to ensure that just because we're cloning a fitted pipeline doesn't mean the random_state won't get copied correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order for the results to be consistent between the original component and its clone, the random_state is not copied from the original object, but is instead re-seeded with the same integer. Because of this, assert clf_clone.random_state == clf.random_state
will actually fail if included in the tests, since they're different objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, of course.
If I follow right, this wouldn't affect test_clone_init
. And then in this unit test we could save the first value of the RandomState
instance and use it to check the right seed was used in the clone:
clf = MockFitComponent(**params)
random_state_first_value = clf.random_state.randint(2**30)
...
clf.fit(...)
...
clf_clone = clf.clone()
assert clf_clone.random_state.randint(2**30) == random_state_first_value
@@ -314,3 +314,42 @@ def test_component_parameters_all_saved(): | |||
|
|||
expected_parameters = {arg: default for (arg, default) in zip(args, defaults)} | |||
assert parameters == expected_parameters | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay we also need test coverage for setting the random_state
field, both with a) a non-default int and with b) a custom np.random.RandomState
instance. We have some coverage in place for this at the automl-level I believe.
My suggestion: Define a separate test test_clone_random_state
to do this. For the RandomState
case, after cloning you can assert clf_clone.random_state.randint(2**30) == clf.random_state.randint(2**30)
to ensure the RandomState
are equivalent.
@@ -314,3 +314,42 @@ def test_component_parameters_all_saved(): | |||
|
|||
expected_parameters = {arg: default for (arg, default) in zip(args, defaults)} | |||
assert parameters == expected_parameters | |||
|
|||
|
|||
def test_clone(X_y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of adding the other tests, I suggest you call this test_clone_fitted
since it focuses on cloning fitted components.
|
||
clf.fit(X, y) | ||
predicted = clf.predict(X) | ||
assert isinstance(predicted, type(np.array([]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want assert isinstance(predicted, np.ndarray)
. But actually my suggestion is to not do this here. We have other tests which should do this. And I'm about to add more in my PR for #236
|
||
clf_clone.fit(X, y) | ||
predicted_clone = clf_clone.predict(X) | ||
np.testing.assert_almost_equal(predicted, predicted_clone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Good thinking. We want clones to predict the same values when trained on the same data with the same random seed.
@@ -658,3 +660,70 @@ class PipelineWithDropCol(BinaryClassificationPipeline): | |||
pipeline_with_drop_col.fit(X, y) | |||
pipeline_with_drop_col.score(X, y, ['auc']) | |||
assert list(pipeline_with_drop_col.feature_importances["feature"]) == ['other col'] | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many of my comments from test_components.py
apply to this file too.
Is there any reason we need a separate test for each problem type? Unless I'm overlooking something here, I'm leaning towards no, and that we can just have one unit test. Why: pipeline cloning is implemented in PipelineBase
, not in each pipeline subclass, and component cloning is also implemented in ComponentBase
, not in each component subclass. So there should be no variation in behavior between the problem types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay this is super close! The impl looks great, no further comments there. I particularly like the component impl--so simple now! All my comments were on the tests. Let's get those resolved and we can get this merged!
evalml/pipelines/pipeline_base.py
Outdated
cloned_pipeline.component_graph = cloned_components | ||
cloned_pipeline.estimator = cloned_components[-1] if isinstance(cloned_components[-1], Estimator) else None | ||
|
||
return cloned_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final thought here: I think what you have works, but that we should rely on the pipeline constructor to set up any necessary state:
def clone(self, random_state=0):
return self.__class__(self.parameters, random_state=random_state)
Just like you did for ComponentBase
.
Fine to do this in a separate PR if you prefer, since the impl you have passes the tests.
|
||
clf = MockFitComponent(**params, random_state=np.random.RandomState(2)) | ||
clf_clone = clf.clone(random_state=np.random.RandomState(2)) | ||
assert clf_clone.random_state.randint(2**30) == clf.random_state.randint(2**30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the two code blocks in this test, did you mean to pass in an integer for random_state
instead of a np.random.RandomState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're totally right 🤦
assert pipeline_clone.random_state.randint(2**30) == pipeline.random_state.randint(2**30) | ||
|
||
pipeline = LinearRegressionPipeline(parameters=parameters, random_state=np.random.RandomState(2)) | ||
pipeline_clone = pipeline.clone(random_state=np.random.RandomState(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for components, should this be saying random_state=2
instead?
with pytest.raises(RuntimeError): | ||
pipeline_clone.predict(X) | ||
pipeline_clone.fit(X, y) | ||
X_t_clone = pipeline_clone.predict_proba(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do predict_proba
here instead of predict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from when I was getting weird test failures with some of the pipelines, it gave me a more detailed look into what was going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I left a few more comments but I think this is ready to go! 🎊 In particular check out my suggestion on how to simplify PipelineBase.clone
.
Closes #534