Integrate ComponentGraphs into Pipelines (#1543)

* Add ComponentGraph to PipelineBase * Add tests for nonlinear pipelines in test_pipelines * Tweak classification pipelines to work with ComponentGraphs * Add automl tests * Update pipeline doc in user guide * Fix bug with repeat estimators during fit
alteryx · Dec 18, 2020 · 3eb27b8 · 3eb27b8
1 parent 162992d
commit 3eb27b8
Show file tree

Hide file tree

Showing 17 changed files with 844 additions and 150 deletions.
diff --git a/docs/source/_templates/pipeline_base_class.rst b/docs/source/_templates/pipeline_base_class.rst
@@ -5,7 +5,7 @@
 .. autoclass:: {{ objname }}
    {% set class_attributes = ['name', 'custom_name', 'summary', 'component_graph', 'problem_type',
                               'model_family', 'hyperparameters', 'custom_hyperparameters',
-                              'default_parameters'] %}
+                              'linearized_component_graph', 'default_parameters'] %}
 
 
    {% block attributes %}

diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
@@ -21,6 +21,7 @@ Release Notes
         * Added multiclass support for ``partial_dependence`` and ``graph_partial_dependence`` :pr:`1554`
         * Added ``TimeSeriesBinaryClassificationPipeline`` and ``TimeSeriesMulticlassClassificationPipeline`` classes :pr:`1528`
         * Added ``make_data_splitter`` method for easier automl data split customization :pr:`1568`
+        * Integrated ``ComponentGraph`` class into Pipelines for full non-linear pipeline support :pr:`1543`
     * Fixes
         * Fix Windows CI jobs: install ``numba`` via conda, required for ``shap`` :pr:`1490`
         * Added custom-index support for `reset-index-get_prediction_vs_actual_over_time_data` :pr:`1494`
@@ -43,6 +44,7 @@ Release Notes
 
     **Breaking Changes**
         * Updated minimal dependencies: ``numpy>=1.19.1``, ``pandas>=1.1.0``, ``scikit-learn>=0.23.1``, ``scikit-optimize>=0.8.1``
+        * Pipeline component instances can no longer be iterated through using ``Pipeline.component_graph`` :pr:`1543`
 
 
 

diff --git a/docs/source/user_guide/pipelines.ipynb b/docs/source/user_guide/pipelines.ipynb
@@ -18,7 +18,9 @@
    "metadata": {},
    "source": [
     "## Class Definition\n",
-    "Pipeline definitions must inherit from the proper pipeline base class, `RegressionPipeline`, `BinaryClassificationPipeline` or `MulticlassClassificationPipeline`. They must also include a `component_graph` list as a class variable containing the sequence of components to be fit and evaluated. The `component_graph` list is used to determine the ordered list of components that should be instantiated when a pipeline instance is created. Each component in `component_graph` can be provided as a reference to the component class for custom components, and as either a string name or as a reference to the component class for components defined in EvalML."
+    "Pipeline definitions must inherit from the proper pipeline base class, `RegressionPipeline`, `BinaryClassificationPipeline` or `MulticlassClassificationPipeline`. They must also include a `component_graph` class variable, which can either be a list or a dictionary containing a sequence of components to be fit and evaluated.\n",
+    "\n",
+    "A `component_graph` list is the default representation, which represents a linear order of transforming components with an estimator as the final component. A `component_graph` dictionary is used to represent a non-linear graph of components, where the key is a unique name for each component and the value is a list with the component's class as the first element and any parents of the component as the following element(s). For either `component_graph` format, each component can be provided as a reference to the component class for custom components, and as either a string name or as a reference to the component class for components defined in EvalML."
    ]
   },
   {
@@ -33,6 +35,22 @@
     "    component_graph = ['Imputer', 'Random Forest Classifier']"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class CustomNonlinearMulticlassClassificationPipeline(MulticlassClassificationPipeline):\n",
+    "    component_graph = {\n",
+    "        'Imputer': ['Imputer'],\n",
+    "        'Encoder': ['One Hot Encoder', 'Imputer'],\n",
+    "        'Random Forest Clf': ['Random Forest Classifier', 'Encoder'],\n",
+    "        'Elastic Net Clf': ['Elastic Net Classifier', 'Encoder'],\n",
+    "        'Final Estimator': ['Logistic Regression Classifier', 'Random Forest Clf', 'Elastic Net Clf']\n",
+    "    }"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -222,6 +240,16 @@
     "cp.graph()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nonlinear_cp = CustomNonlinearMulticlassClassificationPipeline({})\n",
+    "nonlinear_cp.graph()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -239,29 +267,26 @@
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## Component Graph\n",
-    "\n",
-    "You can use the pipeline's `component_graph` attribute to access a component at a specific index:"
+    "nonlinear_cp.describe()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "first_component = cp.component_graph[0]\n",
-    "print (first_component.name)"
+    "## Component Graph"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alternatively, you can use `pipeline.get_component(name)` and provide the component name instead (API reference [here](../generated/methods/evalml.pipelines.PipelineBase.get_component.ipynb)):"
+    "You can use `pipeline.get_component(name)` and provide the component name to access any component (API reference [here](../generated/methods/evalml.pipelines.PipelineBase.get_component.ipynb)):"
    ]
   },
   {
@@ -274,21 +299,21 @@
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## Pipeline Estimator\n",
-    "\n",
-    "EvalML enforces that the last component of a pipeline is an estimator. You can access this estimator directly by using either `pipeline.component_graph[-1]` or `pipeline.estimator`."
+    "nonlinear_cp.get_component('Elastic Net Clf')"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "cp.component_graph[-1]"
+    "## Pipeline Estimator\n",
+    "\n",
+    "EvalML enforces that the last component of a linear pipeline is an estimator. You can access this estimator directly by using `pipeline.estimator`."
    ]
   },
   {
@@ -324,7 +349,9 @@
    "source": [
     "## Generate Code\n",
     "\n",
-    "Once you have a pipeline defined in EvalML, you can generate string Python code to recreate this pipeline, which can then be saved and run elsewhere with EvalML. `generate_pipeline_code` requires a pipeline instance as the input. It can also handle custom components, but it won't return the code required to define the component."
+    "Once you have a pipeline defined in EvalML, you can generate string Python code to recreate this pipeline, which can then be saved and run elsewhere with EvalML. `generate_pipeline_code` requires a pipeline instance as the input. It can also handle custom components, but it won't return the code required to define the component.\n",
+    "\n",
+    "Note that code generation is not yet supported for nonlinear pipelines"
    ]
   },
   {

diff --git a/evalml/automl/automl_algorithm/iterative_algorithm.py b/evalml/automl/automl_algorithm/iterative_algorithm.py
@@ -124,7 +124,7 @@ def _transform_parameters(self, pipeline_class, proposed_parameters):
         parameters = {}
         if self._pipeline_params:
             parameters['pipeline'] = self._pipeline_params
-        component_graph = [handle_component_class(c) for c in pipeline_class.component_graph]
+        component_graph = [handle_component_class(c) for c in pipeline_class.linearized_component_graph]
         for component_class in component_graph:
             component_parameters = proposed_parameters.get(component_class.name, {})
             init_params = inspect.signature(component_class.__init__).parameters

diff --git a/evalml/model_family/model_family.py b/evalml/model_family/model_family.py
@@ -22,7 +22,7 @@ class ModelFamily(Enum):
     EXTRA_TREES = 'extra_trees'
     """Extra Trees model family."""
 
-    ENSEMBLE = 'ensemble',
+    ENSEMBLE = 'ensemble'
     """Ensemble model family."""
 
     DECISION_TREE = 'decision_tree'

diff --git a/evalml/pipelines/binary_classification_pipeline.py b/evalml/pipelines/binary_classification_pipeline.py
@@ -27,15 +27,14 @@ def _predict(self, X, objective=None):
         Returns:
             pd.Series: Estimated labels
         """
-        X_t = self.compute_estimator_features(X)
 
         if objective is not None:
             objective = get_objective(objective, return_instance=True)
             if not objective.is_defined_for_problem_type(self.problem_type):
                 raise ValueError("You can only use a binary classification objective to make predictions for a binary classification pipeline.")
 
         if self.threshold is None:
-            return self.estimator.predict(X_t)
+            return self._component_graph.predict(X)
         ypred_proba = self.predict_proba(X)
         ypred_proba = ypred_proba.iloc[:, 1]
         if objective is None:

diff --git a/evalml/pipelines/classification_pipeline.py b/evalml/pipelines/classification_pipeline.py
@@ -78,8 +78,7 @@ def _predict(self, X, objective=None):
         Returns:
             pd.Series: Estimated labels
         """
-        X_t = self.compute_estimator_features(X)
-        return self.estimator.predict(X_t)
+        return self._component_graph.predict(X)
 
     def predict(self, X, objective=None):
         """Make predictions using selected features.

diff --git a/evalml/pipelines/component_graph.py b/evalml/pipelines/component_graph.py
@@ -1,5 +1,6 @@
 import networkx as nx
 import pandas as pd
+import woodwork as ww
 from networkx.algorithms.dag import topological_sort
 from networkx.exception import NetworkXUnfeasible
 
@@ -25,8 +26,7 @@ def __init__(self, component_dict=None, random_state=0):
                 raise ValueError('All component information should be passed in as a list')
             component_class = handle_component_class(component_info[0])
             self.component_instances[component_name] = component_class
-        self.compute_order = []
-        self._recompute_order()
+        self.compute_order = self.generate_order(self.component_dict)
         self.input_feature_names = {}
 
     @classmethod
@@ -96,8 +96,17 @@ def fit_features(self, X, y):
             X (pd.DataFrame): The input training data of shape [n_samples, n_features]
             y (pd.Series): The target training data of length [n_samples]
         """
-        self._compute_features(self.compute_order[:-1], X, y, fit=True)
-        return self
+        if len(self.compute_order) <= 1:
+            return X
+
+        component_outputs = self._compute_features(self.compute_order[:-1], X, y=y, fit=True)
+        final_component_inputs = []
+        for parent in self.get_parents(self.compute_order[-1]):
+            parent_output = component_outputs.get(parent, component_outputs.get(f'{parent}.x'))
+            if isinstance(parent_output, pd.Series):
+                parent_output = pd.DataFrame(parent_output, columns=[parent])
+            final_component_inputs.append(parent_output)
+        return pd.concat(final_component_inputs, axis=1)
 
     def predict(self, X):
         """Make predictions using selected features.
@@ -127,7 +136,7 @@ def compute_final_component_features(self, X, y=None):
         if len(self.compute_order) <= 1:
             return X
 
-        component_outputs = self._compute_features(self.compute_order, X, y=y, fit=False)
+        component_outputs = self._compute_features(self.compute_order[:-1], X, y=y, fit=False)
         final_component_inputs = []
         for parent in self.get_parents(self.compute_order[-1]):
             parent_output = component_outputs.get(parent, component_outputs.get(f'{parent}.x'))
@@ -151,7 +160,8 @@ def _compute_features(self, component_list, X, y=None, fit=False):
         """
         if len(component_list) == 0:
             return X
-
+        if isinstance(X, ww.DataTable):
+            X = X.to_dataframe()
         if not isinstance(X, pd.DataFrame):
             X = pd.DataFrame(X)
 
@@ -189,7 +199,7 @@ def _compute_features(self, component_list, X, y=None, fit=False):
             else:
                 if fit:
                     component_instance.fit(input_x, input_y)
-                if not (fit and component_instance.name == self.get_last_component().name):  # Don't call predict on the final component during fit
+                if not (fit and component_name == self.compute_order[-1]):  # Don't call predict on the final component during fit
                     output = component_instance.predict(input_x)
                 else:
                     output = None
@@ -305,40 +315,41 @@ def graph(self, name=None, graph_format=None):
                                         for key, val in component_class.parameters.items()])  # noqa: W605
                 label = '%s |%s\l' % (component_name, parameters)  # noqa: W605
             graph.node(component_name, shape='record', label=label)
-        edges = self._get_edges()
+        edges = self._get_edges(self.component_dict)
         graph.edges(edges)
         return graph
 
-    def _get_edges(self):
+    @staticmethod
+    def _get_edges(component_dict):
         edges = []
-        for component_name, component_info in self.component_dict.items():
+        for component_name, component_info in component_dict.items():
             if len(component_info) > 1:
                 for parent in component_info[1:]:
                     if parent[-2:] == '.x' or parent[-2:] == '.y':
                         parent = parent[:-2]
                     edges.append((parent, component_name))
         return edges
 
-    def _recompute_order(self):
+    @classmethod
+    def generate_order(cls, component_dict):
         """Regenerated the topologically sorted order of the graph"""
-        edges = self._get_edges()
-        if len(self.component_dict) == 1:
-            self.compute_order = list(self.component_dict.keys())
-            return
+        edges = cls._get_edges(component_dict)
+        if len(component_dict) == 1:
+            return list(component_dict.keys())
         if len(edges) == 0:
-            self.compute_order = []
-            return
+            return []
         digraph = nx.DiGraph()
         digraph.add_edges_from(edges)
         if not nx.is_weakly_connected(digraph):
             raise ValueError('The given graph is not completely connected')
         try:
-            self.compute_order = list(topological_sort(digraph))
+            compute_order = list(topological_sort(digraph))
         except NetworkXUnfeasible:
             raise ValueError('The given graph contains a cycle')
-        end_components = [component for component in self.compute_order if len(nx.descendants(digraph, component)) == 0]
+        end_components = [component for component in compute_order if len(nx.descendants(digraph, component)) == 0]
         if len(end_components) != 1:
             raise ValueError('The given graph has more than one final (childless) component')
+        return compute_order
 
     def __iter__(self):
         self._i = 0