Update components and pipelines to return Woodwork data structures (#…

…1668) * init * updated imputer init, starting to update tests... * what a mess! messing around with simpleimputer logic and type inference * clean up imputer tests * update datetime featurizer * update per_column_imputer * fix per col imputer tests * fix drop null cols tests * fix ohe tests * fix pca * fix lda * fix lsa and text featurizer * update featuretools * update col selector transformer * update baseline tests * update baseline regressor * update target encoder * update delated feature transformer * fix estimator tests * fix some component tests, more to go * continue fixing tests, more to go * fix one more test * fix component tests * fix more pipeline tests * fix stacked ensemble component tests * fix more tests in automl * fix component graph and regression pipeline tests * fix time series pipeline tests * fix more component tests * fix some more tests * fix baseline classification test * fixing more automl and pipeline test * fix time series baseline regressor tests * fix baseline regression pipeline tests and cbm and component graph tests * fix prediction explanation algo tests * fix explainer tests and pipeline misc tests * holy potato fix partial dependence tests * remove unnecessary try/finally block * update regression test to use OHE instead of target * push for tests * hmmm... adding code to component graph to handle carrying original logical types set * add check for data column and data check in component graph * update component graph to handle naming * fix docs * uncomment test * fix pipelines docs * mini cleanup here * fix tests * a bit of cleanup * fix tests * remove catboost changes from this branch * cleanup some comments * clean up some estimators * more minor cleanup * a little more cleanup * even more cleanup * fix feature selector * clean up and add durations flag to pytest * oops fix typos * update partial dependence impl * fix knn * clean up graphs * some more cleanup * cleanup gen utils * fix tests * major cleanup, condense component graph * oops fix test * fix tests * fix more tests * rename helper and add docstrings * cleaning up docstrings and linting * more docstring updates * more cleanup of impl and docstrings * more cleanup * some cleanup of unnecessary code in standard scaler * make classification and time series classification predict same * fix tests and more cleanup * oops fix test * oops fix imputer * actually fixing tests * fix delayed feature transformer not returning * clean up mock * combine prediction compution to one function * oops, fix typo * some final touchups * docstring update * updating component graph impl and adding test * fix docs * fi * lint and document * fix some tests * fix docstr * update tests * test docstr update * more cleanup, update partial dep impl * some more cleanup of feature selector and baseline tests * clean up components notebook * the tinest of docstring caps cleanup
alteryx · Jan 27, 2021 · 01bf21f · 01bf21f
1 parent 47b4874
commit 01bf21f
Show file tree

Hide file tree

Showing 104 changed files with 1,380 additions and 1,349 deletions.
diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
@@ -49,6 +49,7 @@ Release Notes
     * Changes
         * Added labeling to ``graph_confusion_matrix`` :pr:`1632`
         * Rerunning search for ``AutoMLSearch`` results in a message thrown rather than failing the search, and removed ``has_searched`` property :pr:`1647`
+        * Updated components and pipelines to return ``Woodwork`` data structures :pr:`1668`
         * Changed tuner class to allow and ignore single parameter values as input :pr:`1686`
         * Capped LightGBM version limit to remove bug in docs :pr:`1711`
         * Removed support for `np.random.RandomState` in EvalML :pr:`1727`
@@ -64,6 +65,7 @@ Release Notes
 
     **Breaking Changes**
         * Removed ``has_searched`` property from ``AutoMLSearch`` :pr:`1647`
+        * Components and pipelines return ``Woodwork`` data structures instead of ``pandas`` data structures :pr:`1668`
         * Removed support for `np.random.RandomState` in EvalML. Rather than passing ``np.random.RandomState`` as component and pipeline random_state values, we use int random_seed :pr:`1727`
 
 

diff --git a/docs/source/user_guide/components.ipynb b/docs/source/user_guide/components.ipynb
@@ -148,8 +148,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import pandas as pd\n",
     "from evalml.pipelines.components import Transformer\n",
+    "from evalml.utils.gen_utils import (\n",
+    "    _convert_to_woodwork_structure,\n",
+    "    _convert_woodwork_types_wrapper\n",
+    ")\n",
     "\n",
     "class DropNullColumns(Transformer):\n",
     "    \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
@@ -175,10 +178,19 @@
     "                         random_state=random_state)\n",
     "\n",
     "    def fit(self, X, y=None):\n",
+    "        \"\"\"Fits DropNullColumns component to data\n",
+    "\n",
+    "        Arguments:\n",
+    "            X (list, ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features]\n",
+    "            y (list, ww.DataColumn, pd.Series, np.ndarray, optional): The target training data of length [n_samples]\n",
+    "\n",
+    "        Returns:\n",
+    "            self\n",
+    "        \"\"\"\n",
     "        pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
-    "        percent_null = X.isnull().mean()\n",
+    "        X_t = _convert_to_woodwork_structure(X)\n",
+    "        X_t = _convert_woodwork_types_wrapper(X_t.to_dataframe())\n",
+    "        percent_null = X_t.isnull().mean()\n",
     "        if pct_null_threshold == 0.0:\n",
     "            null_cols = percent_null[percent_null > 0]\n",
     "        else:\n",
@@ -188,16 +200,16 @@
     "\n",
     "    def transform(self, X, y=None):\n",
     "        \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n",
+    "\n",
     "        Arguments:\n",
-    "            X (pd.DataFrame): Data to transform\n",
-    "            y (pd.Series, optional): Targets\n",
+    "            X (ww.DataTable, pd.DataFrame): Data to transform\n",
+    "            y (ww.DataColumn, pd.Series, optional): Ignored.\n",
+    "\n",
     "        Returns:\n",
-    "            pd.DataFrame: Transformed X\n",
+    "            ww.DataTable: Transformed X\n",
     "        \"\"\"\n",
-    "\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
-    "        return X.drop(columns=self._cols_to_drop, axis=1)"
+    "        X_t = _convert_to_woodwork_structure(X)\n",
+    "        return X_t.drop(self._cols_to_drop)"
    ]
   },
   {
@@ -214,9 +226,9 @@
     "\n",
     "-  `__init__()` - the `__init__()` method of your transformer will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_state` value. You can see that `component_obj` is set to `None` above and we will discuss `component_obj` in depth later on.\n",
     "\n",
-    "- `fit()` - the `fit()` method is responsible for fitting your component on training data.\n",
+    "- `fit()` - the `fit()` method is responsible for fitting your component on training data. It should return the component object.\n",
     "\n",
-    "- `transform()` - after fitting a component, the `transform()` method will take in new data and transform accordingly. Note: a component must call `fit()` before `transform()`.\n",
+    "- `transform()` - after fitting a component, the `transform()` method will take in new data and transform accordingly. It should return a Woodwork DataTable. Note: a component must call `fit()` before `transform()`.\n",
     "\n",
     "You can also call or override `fit_transform()` that combines `fit()` and `transform()` into one method."
    ]
@@ -252,14 +264,14 @@
     "    name = \"Baseline Regressor\"\n",
     "    hyperparameter_ranges = {}\n",
     "    model_family = ModelFamily.BASELINE\n",
-    "    supported_problem_types = [ProblemTypes.REGRESSION]\n",
+    "    supported_problem_types = [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]\n",
     "\n",
     "    def __init__(self, strategy=\"mean\", random_state=0, **kwargs):\n",
     "        \"\"\"Baseline regressor that uses a simple strategy to make predictions.\n",
     "\n",
     "        Arguments:\n",
     "            strategy (str): Method used to predict. Valid options are \"mean\", \"median\". Defaults to \"mean\".\n",
-    "            random_state (int): Seed for the random number generator\n",
+    "            random_state (int): Seed for the random number generator. Defaults to 0.\n",
     "\n",
     "        \"\"\"\n",
     "        if strategy not in [\"mean\", \"median\"]:\n",
@@ -276,9 +288,9 @@
     "    def fit(self, X, y=None):\n",
     "        if y is None:\n",
     "            raise ValueError(\"Cannot fit Baseline regressor if y is None\")\n",
-    "\n",
-    "        if not isinstance(y, pd.Series):\n",
-    "            y = pd.Series(y)\n",
+    "        X = _convert_to_woodwork_structure(X)\n",
+    "        y = _convert_to_woodwork_structure(y)\n",
+    "        y = _convert_woodwork_types_wrapper(y.to_series())\n",
     "\n",
     "        if self.parameters[\"strategy\"] == \"mean\":\n",
     "            self._prediction_value = y.mean()\n",
@@ -288,7 +300,9 @@
     "        return self\n",
     "\n",
     "    def predict(self, X):\n",
-    "        return pd.Series([self._prediction_value] * len(X))\n",
+    "        X = _convert_to_woodwork_structure(X)\n",
+    "        predictions = pd.Series([self._prediction_value] * len(X))\n",
+    "        return _convert_to_woodwork_structure(predictions)\n",
     "\n",
     "    @property\n",
     "    def feature_importance(self):\n",
@@ -298,7 +312,7 @@
     "            np.ndarray (float): An array of zeroes\n",
     "\n",
     "        \"\"\"\n",
-    "        return np.zeros(self._num_features)"
+    "        return np.zeros(self._num_features)\n"
    ]
   },
   {
@@ -402,45 +416,6 @@
     "AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. "
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.linear_model import LinearRegression as SKLinearRegression\n",
-    "\n",
-    "from evalml.model_family import ModelFamily\n",
-    "from evalml.pipelines.components.estimators import Estimator\n",
-    "from evalml.problem_types import ProblemTypes\n",
-    "\n",
-    "class LinearRegressor(Estimator):\n",
-    "    \"\"\"Linear Regressor.\"\"\"\n",
-    "    name = \"Linear Regressor\"\n",
-    "    hyperparameter_ranges = {\n",
-    "        'fit_intercept': [True, False],\n",
-    "        'normalize': [True, False]\n",
-    "    }\n",
-    "    model_family = ModelFamily.LINEAR_MODEL\n",
-    "    supported_problem_types = [ProblemTypes.REGRESSION]\n",
-    "\n",
-    "    def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs):\n",
-    "        parameters = {\n",
-    "            'fit_intercept': fit_intercept,\n",
-    "            'normalize': normalize,\n",
-    "            'n_jobs': n_jobs\n",
-    "        }\n",
-    "        parameters.update(kwargs)\n",
-    "        linear_regressor = SKLinearRegression(**parameters)\n",
-    "        super().__init__(parameters=parameters,\n",
-    "                         component_obj=linear_regressor,\n",
-    "                         random_state=random_state)\n",
-    "\n",
-    "    @property\n",
-    "    def feature_importance(self):\n",
-    "        return self._component_obj.coef_"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -471,8 +446,7 @@
    "outputs": [],
    "source": [
     "# this string can then be copy and pasted into a separate window and executed as python code\n",
-    "exec(code)\n",
-    "logisticRegressionClassifier"
+    "exec(code)"
    ]
   },
   {
@@ -481,60 +455,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# custom component\n",
-    "from evalml.pipelines.components import Transformer\n",
-    "import pandas as pd\n",
+    "# We can also do this for custom components\n",
     "from evalml.pipelines.components.utils import generate_component_code\n",
     "\n",
-    "class MyDropNullColumns(Transformer):\n",
-    "    \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
-    "    name = \"My Drop Null Columns Transformer\"\n",
-    "    hyperparameter_ranges = {}\n",
-    "\n",
-    "    def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs):\n",
-    "        \"\"\"Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.\n",
-    "\n",
-    "        Arguments:\n",
-    "            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.\n",
-    "                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.\n",
-    "                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.\n",
-    "        \"\"\"\n",
-    "        if pct_null_threshold < 0 or pct_null_threshold > 1:\n",
-    "            raise ValueError(\"pct_null_threshold must be a float between 0 and 1, inclusive.\")\n",
-    "        parameters = {\"pct_null_threshold\": pct_null_threshold}\n",
-    "        parameters.update(kwargs)\n",
-    "\n",
-    "        self._cols_to_drop = None\n",
-    "        super().__init__(parameters=parameters,\n",
-    "                         component_obj=None,\n",
-    "                         random_state=random_state)\n",
-    "\n",
-    "    def fit(self, X, y=None):\n",
-    "        pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
-    "        percent_null = X.isnull().mean()\n",
-    "        if pct_null_threshold == 0.0:\n",
-    "            null_cols = percent_null[percent_null > 0]\n",
-    "        else:\n",
-    "            null_cols = percent_null[percent_null >= pct_null_threshold]\n",
-    "        self._cols_to_drop = list(null_cols.index)\n",
-    "        return self\n",
-    "\n",
-    "    def transform(self, X, y=None):\n",
-    "        \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n",
-    "        Arguments:\n",
-    "            X (pd.DataFrame): Data to transform\n",
-    "            y (pd.Series, optional): Targets\n",
-    "        Returns:\n",
-    "            pd.DataFrame: Transformed X\n",
-    "        \"\"\"\n",
-    "\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
-    "        return X.drop(columns=self._cols_to_drop, axis=1)\n",
-    "    \n",
-    "myDropNull = MyDropNullColumns()\n",
+    "myDropNull = DropNullColumns()\n",
     "print(generate_component_code(myDropNull))"
    ]
   },

diff --git a/docs/source/user_guide/pipelines.ipynb b/docs/source/user_guide/pipelines.ipynb
@@ -389,6 +389,7 @@
     "from evalml.pipelines.utils import generate_pipeline_code\n",
     "from evalml.pipelines import MulticlassClassificationPipeline\n",
     "import pandas as pd\n",
+    "from evalml.utils import _convert_to_woodwork_structure, _convert_woodwork_types_wrapper\n",
     "\n",
     "class MyDropNullColumns(Transformer):\n",
     "    \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
@@ -415,8 +416,8 @@
     "\n",
     "    def fit(self, X, y=None):\n",
     "        pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
+    "        X = _convert_to_woodwork_structure(X)\n",
+    "        X = _convert_woodwork_types_wrapper(X.to_dataframe())\n",
     "        percent_null = X.isnull().mean()\n",
     "        if pct_null_threshold == 0.0:\n",
     "            null_cols = percent_null[percent_null > 0]\n",
@@ -434,9 +435,9 @@
     "            pd.DataFrame: Transformed X\n",
     "        \"\"\"\n",
     "\n",
-    "        if not isinstance(X, pd.DataFrame):\n",
-    "            X = pd.DataFrame(X)\n",
-    "        return X.drop(columns=self._cols_to_drop, axis=1)\n",
+    "        X = _convert_to_woodwork_structure(X)\n",
+    "        return X.drop(columns=self._cols_to_drop)\n",
+    "\n",
     "\n",
     "class CustomPipeline(MulticlassClassificationPipeline):\n",
     "    name = \"Custom Pipeline\"\n",

diff --git a/evalml/automl/automl_algorithm/automl_algorithm.py b/evalml/automl/automl_algorithm/automl_algorithm.py
@@ -25,7 +25,7 @@ def __init__(self,
             allowed_pipelines (list(class)): A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed.
             max_iterations (int): The maximum number of iterations to be evaluated.
             tuner_class (class): A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.
-            random_state (int): The random seed. Defaults to 0.
+            random_state (int): Seed for the random number generator. Defaults to 0.
         """
         self.random_state = get_random_seed(random_state)
         self.allowed_pipelines = allowed_pipelines or []

diff --git a/evalml/automl/automl_algorithm/iterative_algorithm.py b/evalml/automl/automl_algorithm/iterative_algorithm.py
@@ -31,7 +31,7 @@ def __init__(self,
             allowed_pipelines (list(class)): A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed.
             max_iterations (int): The maximum number of iterations to be evaluated.
             tuner_class (class): A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.
-            random_state (int): The random seed. Defaults to 0.
+            random_state (int): Seed for the random number generator. Defaults to 0.
             pipelines_per_batch (int): The number of pipelines to be evaluated in each batch, after the first batch.
             n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
             number_features (int): The number of columns in the input features.

diff --git a/evalml/automl/automl_search.py b/evalml/automl/automl_search.py
@@ -157,7 +157,7 @@ def __init__(self,
             additional_objectives (list): Custom set of objectives to score on.
                 Will override default objectives for problem type if not empty.
 
-            random_state (int): The random seed. Defaults to 0.
+            random_state (int): Seed for the random number generator. Defaults to 0.
 
             n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
                 None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
@@ -566,8 +566,8 @@ def _tune_binary_threshold(self, pipeline, X_threshold_tuning, y_threshold_tunin
 
         Arguments:
             pipeline (Pipeline): Pipeline instance to threshold
-            X_threshold_tuning (ww DataTable): X data to tune pipeline to
-            y_threshold_tuning (ww DataColumn): Target data to tune pipeline to
+            X_threshold_tuning (ww.DataTable): X data to tune pipeline to
+            y_threshold_tuning (ww.DataColumn): Target data to tune pipeline to
 
         Returns:
             Trained pipeline instance
@@ -576,10 +576,7 @@ def _tune_binary_threshold(self, pipeline, X_threshold_tuning, y_threshold_tunin
             pipeline.threshold = 0.5
             if X_threshold_tuning:
                 y_predict_proba = pipeline.predict_proba(X_threshold_tuning)
-                if isinstance(y_predict_proba, pd.DataFrame):
-                    y_predict_proba = y_predict_proba.iloc[:, 1]
-                else:
-                    y_predict_proba = y_predict_proba[:, 1]
+                y_predict_proba = y_predict_proba.iloc[:, 1]
                 pipeline.threshold = self.objective.optimize_threshold(y_predict_proba, y_threshold_tuning, X=X_threshold_tuning)
         return pipeline
 
@@ -849,7 +846,7 @@ def get_pipeline(self, pipeline_id, random_state=0):
 
         Arguments:
             pipeline_id (int): pipeline to retrieve
-            random_state (int): The random seed. Defaults to 0.
+            random_state (int): Seed for the random number generator. Defaults to 0.
 
         Returns:
             PipelineBase: untrained pipeline instance associated with the provided ID

diff --git a/evalml/automl/utils.py b/evalml/automl/utils.py
@@ -39,17 +39,17 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=
     """Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.
 
     Arguments:
-        X (pd.DataFrame, ww.DataTable): The input training data of shape [n_samples, n_features].
-        y (pd.Series, ww.DataColumn): The target training data of length [n_samples].
-        problem_type (ProblemType): the type of machine learning problem.
+        X (ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features].
+        y (ww.DataColumn, pd.Series): The target training data of length [n_samples].
+        problem_type (ProblemType): The type of machine learning problem.
         problem_configuration (dict, None): Additional parameters needed to configure the search. For example,
-            in time series problems, values should be passed in for the gap and max_delay variables.
-        n_splits (int, None): the number of CV splits, if applicable. Default 3.
-        shuffle (bool): whether or not to shuffle the data before splitting, if applicable. Default True.
-        random_state (int): The random seed. Defaults to 0.
+            in time series problems, values should be passed in for the gap and max_delay variables. Defaults to None.
+        n_splits (int, None): The number of CV splits, if applicable. Defaults to 3.
+        shuffle (bool): Whether or not to shuffle the data before splitting, if applicable. Defaults to True.
+        random_state (int): Seed for the random number generator. Defaults to 0.
 
     Returns:
-        sklearn.model_selection.BaseCrossValidator: data splitting method.
+        sklearn.model_selection.BaseCrossValidator: Data splitting method.
     """
     problem_type = handle_problem_types(problem_type)
     data_splitter = None