Skip to content

Commit

Permalink
Integrate ComponentGraphs into Pipelines (#1543)
Browse files Browse the repository at this point in the history
* Add ComponentGraph to PipelineBase

* Add tests for nonlinear pipelines in test_pipelines

* Tweak classification pipelines to work with ComponentGraphs

* Add automl tests

* Update pipeline doc in user guide

* Fix bug with repeat estimators during fit
  • Loading branch information
eccabay committed Dec 18, 2020
1 parent 162992d commit 3eb27b8
Show file tree
Hide file tree
Showing 17 changed files with 844 additions and 150 deletions.
2 changes: 1 addition & 1 deletion docs/source/_templates/pipeline_base_class.rst
Expand Up @@ -5,7 +5,7 @@
.. autoclass:: {{ objname }}
{% set class_attributes = ['name', 'custom_name', 'summary', 'component_graph', 'problem_type',
'model_family', 'hyperparameters', 'custom_hyperparameters',
'default_parameters'] %}
'linearized_component_graph', 'default_parameters'] %}


{% block attributes %}
Expand Down
2 changes: 2 additions & 0 deletions docs/source/release_notes.rst
Expand Up @@ -21,6 +21,7 @@ Release Notes
* Added multiclass support for ``partial_dependence`` and ``graph_partial_dependence`` :pr:`1554`
* Added ``TimeSeriesBinaryClassificationPipeline`` and ``TimeSeriesMulticlassClassificationPipeline`` classes :pr:`1528`
* Added ``make_data_splitter`` method for easier automl data split customization :pr:`1568`
* Integrated ``ComponentGraph`` class into Pipelines for full non-linear pipeline support :pr:`1543`
* Fixes
* Fix Windows CI jobs: install ``numba`` via conda, required for ``shap`` :pr:`1490`
* Added custom-index support for `reset-index-get_prediction_vs_actual_over_time_data` :pr:`1494`
Expand All @@ -43,6 +44,7 @@ Release Notes

**Breaking Changes**
* Updated minimal dependencies: ``numpy>=1.19.1``, ``pandas>=1.1.0``, ``scikit-learn>=0.23.1``, ``scikit-optimize>=0.8.1``
* Pipeline component instances can no longer be iterated through using ``Pipeline.component_graph`` :pr:`1543`



Expand Down
67 changes: 47 additions & 20 deletions docs/source/user_guide/pipelines.ipynb
Expand Up @@ -18,7 +18,9 @@
"metadata": {},
"source": [
"## Class Definition\n",
"Pipeline definitions must inherit from the proper pipeline base class, `RegressionPipeline`, `BinaryClassificationPipeline` or `MulticlassClassificationPipeline`. They must also include a `component_graph` list as a class variable containing the sequence of components to be fit and evaluated. The `component_graph` list is used to determine the ordered list of components that should be instantiated when a pipeline instance is created. Each component in `component_graph` can be provided as a reference to the component class for custom components, and as either a string name or as a reference to the component class for components defined in EvalML."
"Pipeline definitions must inherit from the proper pipeline base class, `RegressionPipeline`, `BinaryClassificationPipeline` or `MulticlassClassificationPipeline`. They must also include a `component_graph` class variable, which can either be a list or a dictionary containing a sequence of components to be fit and evaluated.\n",
"\n",
"A `component_graph` list is the default representation, which represents a linear order of transforming components with an estimator as the final component. A `component_graph` dictionary is used to represent a non-linear graph of components, where the key is a unique name for each component and the value is a list with the component's class as the first element and any parents of the component as the following element(s). For either `component_graph` format, each component can be provided as a reference to the component class for custom components, and as either a string name or as a reference to the component class for components defined in EvalML."
]
},
{
Expand All @@ -33,6 +35,22 @@
" component_graph = ['Imputer', 'Random Forest Classifier']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class CustomNonlinearMulticlassClassificationPipeline(MulticlassClassificationPipeline):\n",
" component_graph = {\n",
" 'Imputer': ['Imputer'],\n",
" 'Encoder': ['One Hot Encoder', 'Imputer'],\n",
" 'Random Forest Clf': ['Random Forest Classifier', 'Encoder'],\n",
" 'Elastic Net Clf': ['Elastic Net Classifier', 'Encoder'],\n",
" 'Final Estimator': ['Logistic Regression Classifier', 'Random Forest Clf', 'Elastic Net Clf']\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -222,6 +240,16 @@
"cp.graph()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nonlinear_cp = CustomNonlinearMulticlassClassificationPipeline({})\n",
"nonlinear_cp.graph()"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -239,29 +267,26 @@
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Component Graph\n",
"\n",
"You can use the pipeline's `component_graph` attribute to access a component at a specific index:"
"nonlinear_cp.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"first_component = cp.component_graph[0]\n",
"print (first_component.name)"
"## Component Graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, you can use `pipeline.get_component(name)` and provide the component name instead (API reference [here](../generated/methods/evalml.pipelines.PipelineBase.get_component.ipynb)):"
"You can use `pipeline.get_component(name)` and provide the component name to access any component (API reference [here](../generated/methods/evalml.pipelines.PipelineBase.get_component.ipynb)):"
]
},
{
Expand All @@ -274,21 +299,21 @@
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Pipeline Estimator\n",
"\n",
"EvalML enforces that the last component of a pipeline is an estimator. You can access this estimator directly by using either `pipeline.component_graph[-1]` or `pipeline.estimator`."
"nonlinear_cp.get_component('Elastic Net Clf')"
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"cp.component_graph[-1]"
"## Pipeline Estimator\n",
"\n",
"EvalML enforces that the last component of a linear pipeline is an estimator. You can access this estimator directly by using `pipeline.estimator`."
]
},
{
Expand Down Expand Up @@ -324,7 +349,9 @@
"source": [
"## Generate Code\n",
"\n",
"Once you have a pipeline defined in EvalML, you can generate string Python code to recreate this pipeline, which can then be saved and run elsewhere with EvalML. `generate_pipeline_code` requires a pipeline instance as the input. It can also handle custom components, but it won't return the code required to define the component."
"Once you have a pipeline defined in EvalML, you can generate string Python code to recreate this pipeline, which can then be saved and run elsewhere with EvalML. `generate_pipeline_code` requires a pipeline instance as the input. It can also handle custom components, but it won't return the code required to define the component.\n",
"\n",
"Note that code generation is not yet supported for nonlinear pipelines"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion evalml/automl/automl_algorithm/iterative_algorithm.py
Expand Up @@ -124,7 +124,7 @@ def _transform_parameters(self, pipeline_class, proposed_parameters):
parameters = {}
if self._pipeline_params:
parameters['pipeline'] = self._pipeline_params
component_graph = [handle_component_class(c) for c in pipeline_class.component_graph]
component_graph = [handle_component_class(c) for c in pipeline_class.linearized_component_graph]
for component_class in component_graph:
component_parameters = proposed_parameters.get(component_class.name, {})
init_params = inspect.signature(component_class.__init__).parameters
Expand Down
2 changes: 1 addition & 1 deletion evalml/model_family/model_family.py
Expand Up @@ -22,7 +22,7 @@ class ModelFamily(Enum):
EXTRA_TREES = 'extra_trees'
"""Extra Trees model family."""

ENSEMBLE = 'ensemble',
ENSEMBLE = 'ensemble'
"""Ensemble model family."""

DECISION_TREE = 'decision_tree'
Expand Down
3 changes: 1 addition & 2 deletions evalml/pipelines/binary_classification_pipeline.py
Expand Up @@ -27,15 +27,14 @@ def _predict(self, X, objective=None):
Returns:
pd.Series: Estimated labels
"""
X_t = self.compute_estimator_features(X)

if objective is not None:
objective = get_objective(objective, return_instance=True)
if not objective.is_defined_for_problem_type(self.problem_type):
raise ValueError("You can only use a binary classification objective to make predictions for a binary classification pipeline.")

if self.threshold is None:
return self.estimator.predict(X_t)
return self._component_graph.predict(X)
ypred_proba = self.predict_proba(X)
ypred_proba = ypred_proba.iloc[:, 1]
if objective is None:
Expand Down
3 changes: 1 addition & 2 deletions evalml/pipelines/classification_pipeline.py
Expand Up @@ -78,8 +78,7 @@ def _predict(self, X, objective=None):
Returns:
pd.Series: Estimated labels
"""
X_t = self.compute_estimator_features(X)
return self.estimator.predict(X_t)
return self._component_graph.predict(X)

def predict(self, X, objective=None):
"""Make predictions using selected features.
Expand Down
49 changes: 30 additions & 19 deletions evalml/pipelines/component_graph.py
@@ -1,5 +1,6 @@
import networkx as nx
import pandas as pd
import woodwork as ww
from networkx.algorithms.dag import topological_sort
from networkx.exception import NetworkXUnfeasible

Expand All @@ -25,8 +26,7 @@ def __init__(self, component_dict=None, random_state=0):
raise ValueError('All component information should be passed in as a list')
component_class = handle_component_class(component_info[0])
self.component_instances[component_name] = component_class
self.compute_order = []
self._recompute_order()
self.compute_order = self.generate_order(self.component_dict)
self.input_feature_names = {}

@classmethod
Expand Down Expand Up @@ -96,8 +96,17 @@ def fit_features(self, X, y):
X (pd.DataFrame): The input training data of shape [n_samples, n_features]
y (pd.Series): The target training data of length [n_samples]
"""
self._compute_features(self.compute_order[:-1], X, y, fit=True)
return self
if len(self.compute_order) <= 1:
return X

component_outputs = self._compute_features(self.compute_order[:-1], X, y=y, fit=True)
final_component_inputs = []
for parent in self.get_parents(self.compute_order[-1]):
parent_output = component_outputs.get(parent, component_outputs.get(f'{parent}.x'))
if isinstance(parent_output, pd.Series):
parent_output = pd.DataFrame(parent_output, columns=[parent])
final_component_inputs.append(parent_output)
return pd.concat(final_component_inputs, axis=1)

def predict(self, X):
"""Make predictions using selected features.
Expand Down Expand Up @@ -127,7 +136,7 @@ def compute_final_component_features(self, X, y=None):
if len(self.compute_order) <= 1:
return X

component_outputs = self._compute_features(self.compute_order, X, y=y, fit=False)
component_outputs = self._compute_features(self.compute_order[:-1], X, y=y, fit=False)
final_component_inputs = []
for parent in self.get_parents(self.compute_order[-1]):
parent_output = component_outputs.get(parent, component_outputs.get(f'{parent}.x'))
Expand All @@ -151,7 +160,8 @@ def _compute_features(self, component_list, X, y=None, fit=False):
"""
if len(component_list) == 0:
return X

if isinstance(X, ww.DataTable):
X = X.to_dataframe()
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)

Expand Down Expand Up @@ -189,7 +199,7 @@ def _compute_features(self, component_list, X, y=None, fit=False):
else:
if fit:
component_instance.fit(input_x, input_y)
if not (fit and component_instance.name == self.get_last_component().name): # Don't call predict on the final component during fit
if not (fit and component_name == self.compute_order[-1]): # Don't call predict on the final component during fit
output = component_instance.predict(input_x)
else:
output = None
Expand Down Expand Up @@ -305,40 +315,41 @@ def graph(self, name=None, graph_format=None):
for key, val in component_class.parameters.items()]) # noqa: W605
label = '%s |%s\l' % (component_name, parameters) # noqa: W605
graph.node(component_name, shape='record', label=label)
edges = self._get_edges()
edges = self._get_edges(self.component_dict)
graph.edges(edges)
return graph

def _get_edges(self):
@staticmethod
def _get_edges(component_dict):
edges = []
for component_name, component_info in self.component_dict.items():
for component_name, component_info in component_dict.items():
if len(component_info) > 1:
for parent in component_info[1:]:
if parent[-2:] == '.x' or parent[-2:] == '.y':
parent = parent[:-2]
edges.append((parent, component_name))
return edges

def _recompute_order(self):
@classmethod
def generate_order(cls, component_dict):
"""Regenerated the topologically sorted order of the graph"""
edges = self._get_edges()
if len(self.component_dict) == 1:
self.compute_order = list(self.component_dict.keys())
return
edges = cls._get_edges(component_dict)
if len(component_dict) == 1:
return list(component_dict.keys())
if len(edges) == 0:
self.compute_order = []
return
return []
digraph = nx.DiGraph()
digraph.add_edges_from(edges)
if not nx.is_weakly_connected(digraph):
raise ValueError('The given graph is not completely connected')
try:
self.compute_order = list(topological_sort(digraph))
compute_order = list(topological_sort(digraph))
except NetworkXUnfeasible:
raise ValueError('The given graph contains a cycle')
end_components = [component for component in self.compute_order if len(nx.descendants(digraph, component)) == 0]
end_components = [component for component in compute_order if len(nx.descendants(digraph, component)) == 0]
if len(end_components) != 1:
raise ValueError('The given graph has more than one final (childless) component')
return compute_order

def __iter__(self):
self._i = 0
Expand Down

0 comments on commit 3eb27b8

Please sign in to comment.