Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added utility function to create pipeline instance from a list of component instances #1176

Merged
merged 6 commits into from
Sep 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Release Notes
* Added the corresponding probability threshold for each point displayed in `graph_roc_curve` :pr:`1161`
* Added support for multiclass classification for `roc_curve` :pr:`1164`
* Added `categories` accessor to `OneHotEncoder` for listing the categories associated with a feature :pr:`1182`
* Added utility function to create pipeline instances from a list of component instances :pr:`1176`
* Fixes
* Fixed XGBoost column names for partial dependence methods :pr:`1104`
* Removed dead code validating column type from `TextFeaturizer` :pr:`1122`
Expand Down
51 changes: 41 additions & 10 deletions evalml/pipelines/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
CatBoostRegressor,
DateTimeFeaturizer,
DropNullColumns,
Estimator,
Imputer,
OneHotEncoder,
StandardScaler
Expand Down Expand Up @@ -60,6 +61,16 @@ def _get_preprocessing_components(X, y, problem_type, estimator_class):
return pp_components


def _get_pipeline_base_class(problem_type):
christopherbunn marked this conversation as resolved.
Show resolved Hide resolved
"""Returns pipeline base class for problem_type"""
if problem_type == ProblemTypes.BINARY:
return BinaryClassificationPipeline
elif problem_type == ProblemTypes.MULTICLASS:
return MulticlassClassificationPipeline
elif problem_type == ProblemTypes.REGRESSION:
return RegressionPipeline


def make_pipeline(X, y, estimator, problem_type):
"""Given input data, target data, an estimator class and the problem type,
generates a pipeline class with a preprocessing chain which was recommended based on the inputs.
Expand All @@ -85,20 +96,40 @@ def make_pipeline(X, y, estimator, problem_type):
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)

def get_pipeline_base_class(problem_type):
"""Returns pipeline base class for problem_type"""
if problem_type == ProblemTypes.BINARY:
return BinaryClassificationPipeline
elif problem_type == ProblemTypes.MULTICLASS:
return MulticlassClassificationPipeline
elif problem_type == ProblemTypes.REGRESSION:
return RegressionPipeline

base_class = get_pipeline_base_class(problem_type)
base_class = _get_pipeline_base_class(problem_type)

class GeneratedPipeline(base_class):
custom_name = f"{estimator.name} w/ {' + '.join([component.name for component in preprocessing_components])}"
component_graph = complete_component_graph
custom_hyperparameters = hyperparameters

return GeneratedPipeline


def make_pipeline_from_components(component_instances, problem_type, custom_name=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if fitted components are passed in instead of unfitted components?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn one more thing I just noticed: this doesn't show up in the API ref.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, fitted components that are passed into this function remain fitted. However, the resulting pipeline doesn't show as fitted if all of the components are fitted. Should it show as fitted?
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: the API ref, not sure why it's not showing up but I'll wrap it up into the docs improvement PR

"""Given a list of component instances and the problem type, a pipeline instance is generated with the component instances.
The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type. A custom name for
the pipeline can optionally be specified; otherwise the default pipeline name will be 'Templated Pipeline'.

Arguments:
component_instances (list): a list of all of the components to include in the pipeline
problem_type (str or ProblemTypes): problem type for the pipeline to generate
custom_name (string): a name for the new pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to say that the default name is Templated Pipeline


Returns:
Pipeline instance with component instances and specified estimator

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn could you please include an example usage here? I think that'll help people understand what this does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put up a new PR with an example use 👍

if not isinstance(component_instances[-1], Estimator):
raise ValueError("Pipeline needs to have an estimator at the last position of the component list")
Comment on lines +123 to +124
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this check in for the last component to be an estimator. I know that in #1162 it said that there is the possibility that we will need to be able to build a pipeline without an estimator. That should be addressed when making a PR to resolve #712.


pipeline_name = custom_name
problem_type = handle_problem_types(problem_type)

class TemplatedPipeline(_get_pipeline_base_class(problem_type)):
custom_name = pipeline_name
component_graph = [c.__class__ for c in component_instances]

pipeline_instance = TemplatedPipeline({})
pipeline_instance.component_graph = component_instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn yeah this works. I think we should update this impl though. Technically, setting the component_graph directly is bad.

class TemplatedPipeline(_get_pipeline_base_class(problem_type)):
    custom_name = pipeline_name
    component_graph = [c.__class__ for c in component_instances]
return TemplatedPipeline({c.name: c.parameters for c in component_instances})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll update the implementation in the new PR.

return pipeline_instance
44 changes: 43 additions & 1 deletion evalml/tests/pipeline_tests/test_pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
DropNullColumns,
ElasticNetClassifier,
ElasticNetRegressor,
Estimator,
Imputer,
LinearRegressor,
LogisticRegressionClassifier,
Expand All @@ -41,7 +42,11 @@
_all_estimators_used_in_search,
allowed_model_families
)
from evalml.pipelines.utils import get_estimators, make_pipeline
from evalml.pipelines.utils import (
get_estimators,
make_pipeline,
make_pipeline_from_components
)
from evalml.problem_types import ProblemTypes
from evalml.utils.gen_utils import (
categorical_dtypes,
Expand Down Expand Up @@ -240,6 +245,43 @@ def test_make_pipeline_problem_type_mismatch():
make_pipeline(pd.DataFrame(), pd.Series(), Transformer, ProblemTypes.MULTICLASS)


def test_make_pipeline_from_components():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an existing function that is called make_pipeline, so I split this off into its own name. We could potentially overload the previous function, but it seemed cleaner to me to separate it off.

with pytest.raises(ValueError, match="Pipeline needs to have an estimator at the last position of the component list"):
make_pipeline_from_components([Imputer], problem_type='binary')

imp = Imputer(numeric_impute_strategy='median')
est = RandomForestClassifier()
pipeline = make_pipeline_from_components([imp, est], ProblemTypes.BINARY, custom_name='My Pipeline')
components_list = pipeline.component_graph
assert components_list == [imp, est]
assert pipeline.problem_type == ProblemTypes.BINARY
assert pipeline.custom_name == 'My Pipeline'
expected_parameters = {
'Imputer': {
'categorical_impute_strategy': 'most_frequent',
'numeric_impute_strategy': 'median',
'categorical_fill_value': None,
'numeric_fill_value': None},
'Random Forest Classifier': {
'n_estimators': 100,
'max_depth': 6,
'n_jobs': -1}
}
assert pipeline.parameters == expected_parameters

class DummyEstimator(Estimator):
name = "Dummy!"
model_family = "foo"
supported_problem_types = [ProblemTypes.BINARY]
parameters = {'bar': 'baz'}
pipeline = make_pipeline_from_components([DummyEstimator()], ProblemTypes.BINARY)
components_list = pipeline.component_graph
assert len(components_list) == 1
assert isinstance(components_list[0], DummyEstimator)
expected_parameters = {'Dummy!': {'bar': 'baz'}}
assert pipeline.parameters == expected_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check that you can fit/predict with this pipeline instance.

Additionally, I'd like to see a test which a) creates a pipeline normally, fits it on some data and generates predictions, b) uses make_pipeline_from_components with the component graph from the first pipeline, fits that instance on the same data and generates predictions on the same data and c) asserts the predictions are identical.



def test_required_fields():
class TestPipelineWithoutComponentGraph(PipelineBase):
pass
Expand Down