Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone Pipelines #842

Merged
merged 29 commits into from
Jun 16, 2020
Merged

Clone Pipelines #842

merged 29 commits into from
Jun 16, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jun 11, 2020

Closes #534

@codecov
Copy link

codecov bot commented Jun 11, 2020

Codecov Report

Merging #842 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #842   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files         195      195           
  Lines        7774     7846   +72     
=======================================
+ Hits         7750     7822   +72     
  Misses         24       24           
Impacted Files Coverage Δ
evalml/pipelines/components/component_base.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.76% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e8c859...119c6e1. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review June 11, 2020 17:50
@eccabay eccabay changed the title 534 cloning pipelines Clone Pipelines Jun 11, 2020
@@ -26,6 +28,22 @@ def name(cls):
def model_family(cls):
"""Returns ModelFamily of this component"""

def clone(self, learned=True, random_state='match'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  • Let's change learned=True to deep=False
  • I like that you thought about random_state here. Let's change this to be an int / RandomState with default 0, just like all other callsites in the repo

"""
if learned:
return copy.deepcopy(self)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flattening: since you return on the previous line, no need for this else

cloned_component = component_class(parameters=self.parameters, random_state=self.random_state)
else:
cloned_component = component_class(parameters=self.parameters, random_state=random_state)
return cloned_component
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding parameters as an argument to every component, let's do

return self.__class__(**self.parameters, random_state=random_state)

This is how we instantiate components in PipelineBase.__init__.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay looking good! You picked up a hard one... like I mentioned the other day, this touches on some real challenges with how our code is structured right now. It'll be great to have this merged.

I left comments about the impl. I see you have a test for each component and classifier -- wow! Will dig into that on the next pass.

Please revert the changes you made to the components, other than ComponentBase of course. We shouldn't add parameters as an input to the components. Not necessary when you can use python's ** syntax to pass the args along. See this explanation of how that works if you didn't know already.

min_weight_fraction_leaf=min_weight_fraction_leaf,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"min_weight_fraction_leaf": min_weight_fraction_leaf}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih pointed out you added some params here which were missing. That's great. We should keep changes like this, so that user-inputted values are always saved (except for random_state which we treat separately).

I started looking at #522 recently, which tracks fixing these sorts of issues. Heads up I may end up scooping these changes into my PR :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this and much more is now incorporated into #847

eta=parameters['eta'],
max_depth=parameters['max_depth'],
n_estimators=parameters['n_estimators'],
min_child_weight=parameters['min_child_weight'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay could you please back these changes out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why: if there's no reason to make them, we shouldn't make them :) perhaps this was left over from earlier?

@@ -5,7 +5,7 @@
from evalml.pipelines.components import CatBoostClassifier
from evalml.utils import SEED_BOUNDS

importorskip('catboost', reason='Skipping test because catboost not installed')
catboost = importorskip('catboost', reason='Skipping test because catboost not installed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please back out this change?

@@ -5,7 +5,16 @@
from evalml.pipelines.components import CatBoostRegressor
from evalml.utils import SEED_BOUNDS

importorskip('catboost', reason='Skipping test because catboost not installed')
catboost = importorskip('catboost', reason='Skipping test because catboost not installed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all changes in this file: please back them out since they're not necessary

@@ -40,6 +40,7 @@ def test_en_init(X_y):
'Elastic Net Classifier': {
"alpha": 0.5,
"l1_ratio": 0.5,
"max_iter": 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change important for this PR? Or is it left over from an earlier state of the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an earlier state of the PR, will remove.

@@ -9,7 +9,7 @@
from evalml.pipelines.components import XGBoostRegressor
from evalml.utils import get_random_seed, get_random_state, import_or_raise

importorskip('xgboost', reason='Skipping test because xgboost not installed')
xgboost = importorskip('xgboost', reason='Skipping test because xgboost not installed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as others, please back this out

clf_clone = clf.clone()
with pytest.raises(ValueError, match='Component is not fit'):
clf_clone.predict(X)
clf.parameters == clf_clone.parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant assert clf.parameters == clf_clone.parameters

@@ -314,3 +314,42 @@ def test_component_parameters_all_saved():

expected_parameters = {arg: default for (arg, default) in zip(args, defaults)}
assert parameters == expected_parameters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay could you please add another test here which does the following:

def test_clone_init():
    ...
    clf = MockFitComponent(**params)
    clf_clone = clf.clone()
    assert clf.parameters == clf_clone.parameters
    assert clf.random_state == clf_clone.random_state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to make MockFitComponent available via a test fixture in this file

Copy link
Contributor

@dsherry dsherry Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I forgot to say: its valuable to add this because we need coverage of cloning a component which hasn't been fitted. Your main test test_clone covers the fitted case

Copy link
Contributor

@dsherry dsherry Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess on reflection, you could just add assertions to your existing test. My key point was that we should have a test somewhere which checks the params and random state match before fitting. Your call on how to organize that.

predicted = clf.predict(X)
assert isinstance(predicted, type(np.array([])))

clf_clone = clf.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the call to clone, let's also check assert clf_clone.random_state == clf.random_state

Helpful to have this to ensure that just because we're cloning a fitted pipeline doesn't mean the random_state won't get copied correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for the results to be consistent between the original component and its clone, the random_state is not copied from the original object, but is instead re-seeded with the same integer. Because of this, assert clf_clone.random_state == clf.random_state will actually fail if included in the tests, since they're different objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, of course.

If I follow right, this wouldn't affect test_clone_init. And then in this unit test we could save the first value of the RandomState instance and use it to check the right seed was used in the clone:

clf = MockFitComponent(**params)
random_state_first_value = clf.random_state.randint(2**30)
...
clf.fit(...)
...
clf_clone = clf.clone()
assert clf_clone.random_state.randint(2**30) == random_state_first_value

@@ -314,3 +314,42 @@ def test_component_parameters_all_saved():

expected_parameters = {arg: default for (arg, default) in zip(args, defaults)}
assert parameters == expected_parameters


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay we also need test coverage for setting the random_state field, both with a) a non-default int and with b) a custom np.random.RandomState instance. We have some coverage in place for this at the automl-level I believe.

My suggestion: Define a separate test test_clone_random_state to do this. For the RandomState case, after cloning you can assert clf_clone.random_state.randint(2**30) == clf.random_state.randint(2**30) to ensure the RandomState are equivalent.

@@ -314,3 +314,42 @@ def test_component_parameters_all_saved():

expected_parameters = {arg: default for (arg, default) in zip(args, defaults)}
assert parameters == expected_parameters


def test_clone(X_y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of adding the other tests, I suggest you call this test_clone_fitted since it focuses on cloning fitted components.


clf.fit(X, y)
predicted = clf.predict(X)
assert isinstance(predicted, type(np.array([])))
Copy link
Contributor

@dsherry dsherry Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want assert isinstance(predicted, np.ndarray). But actually my suggestion is to not do this here. We have other tests which should do this. And I'm about to add more in my PR for #236


clf_clone.fit(X, y)
predicted_clone = clf_clone.predict(X)
np.testing.assert_almost_equal(predicted, predicted_clone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Good thinking. We want clones to predict the same values when trained on the same data with the same random seed.

@@ -658,3 +660,70 @@ class PipelineWithDropCol(BinaryClassificationPipeline):
pipeline_with_drop_col.fit(X, y)
pipeline_with_drop_col.score(X, y, ['auc'])
assert list(pipeline_with_drop_col.feature_importances["feature"]) == ['other col']


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many of my comments from test_components.py apply to this file too.

Is there any reason we need a separate test for each problem type? Unless I'm overlooking something here, I'm leaning towards no, and that we can just have one unit test. Why: pipeline cloning is implemented in PipelineBase, not in each pipeline subclass, and component cloning is also implemented in ComponentBase, not in each component subclass. So there should be no variation in behavior between the problem types.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay this is super close! The impl looks great, no further comments there. I particularly like the component impl--so simple now! All my comments were on the tests. Let's get those resolved and we can get this merged!

cloned_pipeline.component_graph = cloned_components
cloned_pipeline.estimator = cloned_components[-1] if isinstance(cloned_components[-1], Estimator) else None

return cloned_pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final thought here: I think what you have works, but that we should rely on the pipeline constructor to set up any necessary state:

def clone(self, random_state=0):
    return self.__class__(self.parameters, random_state=random_state)

Just like you did for ComponentBase.

Fine to do this in a separate PR if you prefer, since the impl you have passes the tests.


clf = MockFitComponent(**params, random_state=np.random.RandomState(2))
clf_clone = clf.clone(random_state=np.random.RandomState(2))
assert clf_clone.random_state.randint(2**30) == clf.random_state.randint(2**30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the two code blocks in this test, did you mean to pass in an integer for random_state instead of a np.random.RandomState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're totally right 🤦

assert pipeline_clone.random_state.randint(2**30) == pipeline.random_state.randint(2**30)

pipeline = LinearRegressionPipeline(parameters=parameters, random_state=np.random.RandomState(2))
pipeline_clone = pipeline.clone(random_state=np.random.RandomState(2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for components, should this be saying random_state=2 instead?

with pytest.raises(RuntimeError):
pipeline_clone.predict(X)
pipeline_clone.fit(X, y)
X_t_clone = pipeline_clone.predict_proba(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do predict_proba here instead of predict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from when I was getting weird test failures with some of the pipelines, it gave me a more detailed look into what was going on.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay I left a few more comments but I think this is ready to go! 🎊 In particular check out my suggestion on how to simplify PipelineBase.clone.

@eccabay eccabay merged commit f6a9815 into master Jun 16, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
@dsherry dsherry deleted the 534_cloning_pipelines branch October 29, 2020 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to clone components and pipelines
3 participants