Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pipeline / component creation from DataCheckActions #1907

Merged
merged 4 commits into from Mar 2, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 1, 2021

Closes #1879

Currently, the only action we have is DropCol 馃槄

@angela97lin angela97lin self-assigned this Mar 1, 2021
@angela97lin angela97lin added this to the Sprint 2021 Feb B milestone Mar 1, 2021
Comment on lines +1 to +466
'Imputer': {
'categorical_impute_strategy': 'most_frequent',
'numeric_impute_strategy': 'median',
'categorical_fill_value': None,
'numeric_fill_value': None},
'Random Forest Classifier': {
'n_estimators': 100,
'max_depth': 6,
'n_jobs': -1}
}
assert pipeline.parameters == expected_parameters
assert pipeline.random_seed == 15

class DummyEstimator(Estimator):
name = "Dummy!"
model_family = "foo"
supported_problem_types = [ProblemTypes.BINARY]
parameters = {'bar': 'baz'}
random_seed = 42
pipeline = make_pipeline_from_components([DummyEstimator(random_seed=3)], ProblemTypes.BINARY,
random_seed=random_seed)
components_list = [c for c in pipeline]
assert len(components_list) == 1
assert isinstance(components_list[0], DummyEstimator)
assert components_list[0].random_seed == random_seed
expected_parameters = {'Dummy!': {'bar': 'baz'}}
assert pipeline.parameters == expected_parameters
assert pipeline.random_seed == random_seed

X, y = X_y_binary
pipeline = logistic_regression_binary_pipeline_class(parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
random_seed=42)
component_instances = [c for c in pipeline]
new_pipeline = make_pipeline_from_components(component_instances, ProblemTypes.BINARY)
pipeline.fit(X, y)
predictions = pipeline.predict(X)
new_pipeline.fit(X, y)
new_predictions = new_pipeline.predict(X)
assert np.array_equal(predictions, new_predictions)
assert np.array_equal(pipeline.feature_importance, new_pipeline.feature_importance)
assert new_pipeline.name == 'Templated Pipeline'
assert pipeline.parameters == new_pipeline.parameters
for component, new_component in zip(pipeline._component_graph, new_pipeline._component_graph):
assert isinstance(new_component, type(component))
assert pipeline.describe() == new_pipeline.describe()


@pytest.mark.parametrize("problem_type", [ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.REGRESSION])
def test_stacked_estimator_in_pipeline(problem_type, X_y_binary, X_y_multi, X_y_regression,
stackable_classifiers,
stackable_regressors,
logistic_regression_binary_pipeline_class,
logistic_regression_multiclass_pipeline_class,
linear_regression_pipeline_class):
if problem_type == ProblemTypes.BINARY:
X, y = X_y_binary
base_pipeline_class = BinaryClassificationPipeline
stacking_component_name = StackedEnsembleClassifier.name
input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers]
comparison_pipeline = logistic_regression_binary_pipeline_class(parameters={"Logistic Regression Classifier": {"n_jobs": 1}})
objective = 'Log Loss Binary'
elif problem_type == ProblemTypes.MULTICLASS:
X, y = X_y_multi
base_pipeline_class = MulticlassClassificationPipeline
stacking_component_name = StackedEnsembleClassifier.name
input_pipelines = [make_pipeline_from_components([classifier], problem_type) for classifier in stackable_classifiers]
comparison_pipeline = logistic_regression_multiclass_pipeline_class(parameters={"Logistic Regression Classifier": {"n_jobs": 1}})
objective = 'Log Loss Multiclass'
elif problem_type == ProblemTypes.REGRESSION:
X, y = X_y_regression
base_pipeline_class = RegressionPipeline
stacking_component_name = StackedEnsembleRegressor.name
input_pipelines = [make_pipeline_from_components([regressor], problem_type) for regressor in stackable_regressors]
comparison_pipeline = linear_regression_pipeline_class(parameters={"Linear Regressor": {"n_jobs": 1}})
objective = 'R2'
parameters = {
stacking_component_name: {
"input_pipelines": input_pipelines,
"n_jobs": 1
}
}
graph = ['Simple Imputer', stacking_component_name]

class StackedPipeline(base_pipeline_class):
component_graph = graph
model_family = ModelFamily.ENSEMBLE

pipeline = StackedPipeline(parameters=parameters)
pipeline.fit(X, y)
comparison_pipeline.fit(X, y)
assert not np.isnan(pipeline.predict(X).to_series()).values.any()

pipeline_score = pipeline.score(X, y, [objective])[objective]
comparison_pipeline_score = comparison_pipeline.score(X, y, [objective])[objective]

if problem_type == ProblemTypes.BINARY or problem_type == ProblemTypes.MULTICLASS:
assert not np.isnan(pipeline.predict_proba(X).to_dataframe()).values.any()
assert (pipeline_score <= comparison_pipeline_score)
else:
assert (pipeline_score >= comparison_pipeline_score)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_pipelines.py since these are stored in evalml/pipelines/utils.py and test_pipelines.py is huge

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! this looks good to me. When the data check actions start adding up we can revisit and think about if we want to stick with if statements or another implementation.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Yep, agree with @jeremyliweishih, when there are more actions, it'd be useful to look into the keyed dictionary return rather than if/else statements, just to make it a little cleaner. No changes needed now!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin This looks good!

The design doc mentions adding a preprended_components argument to make_pipeline. Is there an issue for that? Or are you planning on doing it in #1883 ?

@angela97lin
Copy link
Contributor Author

@freddyaboulton Yupperino, I was thinking about doing it as part of #1883, though I guess no reason why I couldn't do it here/separately, heh. Gonna stick with that plan rather than ask for reapprovals on more impl!

@angela97lin angela97lin merged commit d60b851 into main Mar 2, 2021
@angela97lin angela97lin deleted the 1879_pipeline_action branch March 2, 2021 20:17
@dsherry dsherry mentioned this pull request Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce pipeline / component creation from DataCheckActions
4 participants