Skip to content

Commit

Permalink
Update components and pipelines to return Woodwork data structures (#…
Browse files Browse the repository at this point in the history
…1668)

* init

* updated imputer init, starting to update tests...

* what a mess! messing around with simpleimputer logic and type inference

* clean up imputer tests

* update datetime featurizer

* update per_column_imputer

* fix per col imputer tests

* fix drop null cols tests

* fix ohe tests

* fix pca

* fix lda

* fix lsa and text featurizer

* update featuretools

* update col selector transformer

* update baseline tests

* update baseline regressor

* update target encoder

* update delated feature transformer

* fix estimator tests

* fix some component tests, more to go

* continue fixing tests, more to go

* fix one more test

* fix component tests

* fix more pipeline tests

* fix stacked ensemble component tests

* fix more tests in automl

* fix component graph and regression pipeline tests

* fix time series pipeline tests

* fix more component tests

* fix some more tests

* fix baseline classification test

* fixing more automl and pipeline test

* fix time series baseline regressor tests

* fix baseline regression pipeline tests and cbm and component graph tests

* fix prediction explanation algo tests

* fix explainer tests and pipeline misc tests

* holy potato fix partial dependence tests

* remove unnecessary try/finally block

* update regression test to use OHE instead of target

* push for tests

* hmmm... adding code to component graph to handle carrying original logical types set

* add check for data column and data check in component graph

* update component graph to handle naming

* fix docs

* uncomment test

* fix pipelines docs

* mini cleanup here

* fix tests

* a bit of cleanup

* fix tests

* remove catboost changes from this branch

* cleanup some comments

* clean up some estimators

* more minor cleanup

* a little more cleanup

* even more cleanup

* fix feature selector

* clean up and add durations flag to pytest

* oops fix typos

* update partial dependence impl

* fix knn

* clean up graphs

* some more cleanup

* cleanup gen utils

* fix tests

* major cleanup, condense component graph

* oops fix test

* fix tests

* fix more tests

* rename helper and add docstrings

* cleaning up docstrings and linting

* more docstring updates

* more cleanup of impl and docstrings

* more cleanup

* some cleanup of unnecessary code in standard scaler

* make classification and time series classification predict same

* fix tests and more cleanup

* oops fix test

* oops fix imputer

* actually fixing tests

* fix delayed feature transformer not returning

* clean up mock

* combine prediction compution to one function

* oops, fix typo

* some final touchups

* docstring update

* updating component graph impl and adding test

* fix docs

* fi

* lint and document

* fix some tests

* fix docstr

* update tests

* test docstr update

* more cleanup, update partial dep impl

* some more cleanup of feature selector and baseline tests

* clean up components notebook

* the tinest of docstring caps cleanup
  • Loading branch information
angela97lin committed Jan 27, 2021
1 parent 47b4874 commit 01bf21f
Show file tree
Hide file tree
Showing 104 changed files with 1,380 additions and 1,349 deletions.
2 changes: 2 additions & 0 deletions docs/source/release_notes.rst
Expand Up @@ -49,6 +49,7 @@ Release Notes
* Changes
* Added labeling to ``graph_confusion_matrix`` :pr:`1632`
* Rerunning search for ``AutoMLSearch`` results in a message thrown rather than failing the search, and removed ``has_searched`` property :pr:`1647`
* Updated components and pipelines to return ``Woodwork`` data structures :pr:`1668`
* Changed tuner class to allow and ignore single parameter values as input :pr:`1686`
* Capped LightGBM version limit to remove bug in docs :pr:`1711`
* Removed support for `np.random.RandomState` in EvalML :pr:`1727`
Expand All @@ -64,6 +65,7 @@ Release Notes

**Breaking Changes**
* Removed ``has_searched`` property from ``AutoMLSearch`` :pr:`1647`
* Components and pipelines return ``Woodwork`` data structures instead of ``pandas`` data structures :pr:`1668`
* Removed support for `np.random.RandomState` in EvalML. Rather than passing ``np.random.RandomState`` as component and pipeline random_state values, we use int random_seed :pr:`1727`


Expand Down
150 changes: 37 additions & 113 deletions docs/source/user_guide/components.ipynb
Expand Up @@ -148,8 +148,11 @@
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from evalml.pipelines.components import Transformer\n",
"from evalml.utils.gen_utils import (\n",
" _convert_to_woodwork_structure,\n",
" _convert_woodwork_types_wrapper\n",
")\n",
"\n",
"class DropNullColumns(Transformer):\n",
" \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
Expand All @@ -175,10 +178,19 @@
" random_state=random_state)\n",
"\n",
" def fit(self, X, y=None):\n",
" \"\"\"Fits DropNullColumns component to data\n",
"\n",
" Arguments:\n",
" X (list, ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features]\n",
" y (list, ww.DataColumn, pd.Series, np.ndarray, optional): The target training data of length [n_samples]\n",
"\n",
" Returns:\n",
" self\n",
" \"\"\"\n",
" pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" percent_null = X.isnull().mean()\n",
" X_t = _convert_to_woodwork_structure(X)\n",
" X_t = _convert_woodwork_types_wrapper(X_t.to_dataframe())\n",
" percent_null = X_t.isnull().mean()\n",
" if pct_null_threshold == 0.0:\n",
" null_cols = percent_null[percent_null > 0]\n",
" else:\n",
Expand All @@ -188,16 +200,16 @@
"\n",
" def transform(self, X, y=None):\n",
" \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n",
"\n",
" Arguments:\n",
" X (pd.DataFrame): Data to transform\n",
" y (pd.Series, optional): Targets\n",
" X (ww.DataTable, pd.DataFrame): Data to transform\n",
" y (ww.DataColumn, pd.Series, optional): Ignored.\n",
"\n",
" Returns:\n",
" pd.DataFrame: Transformed X\n",
" ww.DataTable: Transformed X\n",
" \"\"\"\n",
"\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" return X.drop(columns=self._cols_to_drop, axis=1)"
" X_t = _convert_to_woodwork_structure(X)\n",
" return X_t.drop(self._cols_to_drop)"
]
},
{
Expand All @@ -214,9 +226,9 @@
"\n",
"- `__init__()` - the `__init__()` method of your transformer will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_state` value. You can see that `component_obj` is set to `None` above and we will discuss `component_obj` in depth later on.\n",
"\n",
"- `fit()` - the `fit()` method is responsible for fitting your component on training data.\n",
"- `fit()` - the `fit()` method is responsible for fitting your component on training data. It should return the component object.\n",
"\n",
"- `transform()` - after fitting a component, the `transform()` method will take in new data and transform accordingly. Note: a component must call `fit()` before `transform()`.\n",
"- `transform()` - after fitting a component, the `transform()` method will take in new data and transform accordingly. It should return a Woodwork DataTable. Note: a component must call `fit()` before `transform()`.\n",
"\n",
"You can also call or override `fit_transform()` that combines `fit()` and `transform()` into one method."
]
Expand Down Expand Up @@ -252,14 +264,14 @@
" name = \"Baseline Regressor\"\n",
" hyperparameter_ranges = {}\n",
" model_family = ModelFamily.BASELINE\n",
" supported_problem_types = [ProblemTypes.REGRESSION]\n",
" supported_problem_types = [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]\n",
"\n",
" def __init__(self, strategy=\"mean\", random_state=0, **kwargs):\n",
" \"\"\"Baseline regressor that uses a simple strategy to make predictions.\n",
"\n",
" Arguments:\n",
" strategy (str): Method used to predict. Valid options are \"mean\", \"median\". Defaults to \"mean\".\n",
" random_state (int): Seed for the random number generator\n",
" random_state (int): Seed for the random number generator. Defaults to 0.\n",
"\n",
" \"\"\"\n",
" if strategy not in [\"mean\", \"median\"]:\n",
Expand All @@ -276,9 +288,9 @@
" def fit(self, X, y=None):\n",
" if y is None:\n",
" raise ValueError(\"Cannot fit Baseline regressor if y is None\")\n",
"\n",
" if not isinstance(y, pd.Series):\n",
" y = pd.Series(y)\n",
" X = _convert_to_woodwork_structure(X)\n",
" y = _convert_to_woodwork_structure(y)\n",
" y = _convert_woodwork_types_wrapper(y.to_series())\n",
"\n",
" if self.parameters[\"strategy\"] == \"mean\":\n",
" self._prediction_value = y.mean()\n",
Expand All @@ -288,7 +300,9 @@
" return self\n",
"\n",
" def predict(self, X):\n",
" return pd.Series([self._prediction_value] * len(X))\n",
" X = _convert_to_woodwork_structure(X)\n",
" predictions = pd.Series([self._prediction_value] * len(X))\n",
" return _convert_to_woodwork_structure(predictions)\n",
"\n",
" @property\n",
" def feature_importance(self):\n",
Expand All @@ -298,7 +312,7 @@
" np.ndarray (float): An array of zeroes\n",
"\n",
" \"\"\"\n",
" return np.zeros(self._num_features)"
" return np.zeros(self._num_features)\n"
]
},
{
Expand Down Expand Up @@ -402,45 +416,6 @@
"AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression as SKLinearRegression\n",
"\n",
"from evalml.model_family import ModelFamily\n",
"from evalml.pipelines.components.estimators import Estimator\n",
"from evalml.problem_types import ProblemTypes\n",
"\n",
"class LinearRegressor(Estimator):\n",
" \"\"\"Linear Regressor.\"\"\"\n",
" name = \"Linear Regressor\"\n",
" hyperparameter_ranges = {\n",
" 'fit_intercept': [True, False],\n",
" 'normalize': [True, False]\n",
" }\n",
" model_family = ModelFamily.LINEAR_MODEL\n",
" supported_problem_types = [ProblemTypes.REGRESSION]\n",
"\n",
" def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs):\n",
" parameters = {\n",
" 'fit_intercept': fit_intercept,\n",
" 'normalize': normalize,\n",
" 'n_jobs': n_jobs\n",
" }\n",
" parameters.update(kwargs)\n",
" linear_regressor = SKLinearRegression(**parameters)\n",
" super().__init__(parameters=parameters,\n",
" component_obj=linear_regressor,\n",
" random_state=random_state)\n",
"\n",
" @property\n",
" def feature_importance(self):\n",
" return self._component_obj.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -471,8 +446,7 @@
"outputs": [],
"source": [
"# this string can then be copy and pasted into a separate window and executed as python code\n",
"exec(code)\n",
"logisticRegressionClassifier"
"exec(code)"
]
},
{
Expand All @@ -481,60 +455,10 @@
"metadata": {},
"outputs": [],
"source": [
"# custom component\n",
"from evalml.pipelines.components import Transformer\n",
"import pandas as pd\n",
"# We can also do this for custom components\n",
"from evalml.pipelines.components.utils import generate_component_code\n",
"\n",
"class MyDropNullColumns(Transformer):\n",
" \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
" name = \"My Drop Null Columns Transformer\"\n",
" hyperparameter_ranges = {}\n",
"\n",
" def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs):\n",
" \"\"\"Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.\n",
"\n",
" Arguments:\n",
" pct_null_threshold(float): The percentage of NaN values in an input feature to drop.\n",
" Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.\n",
" If equal to 1.0, will drop columns with all null values. Defaults to 0.95.\n",
" \"\"\"\n",
" if pct_null_threshold < 0 or pct_null_threshold > 1:\n",
" raise ValueError(\"pct_null_threshold must be a float between 0 and 1, inclusive.\")\n",
" parameters = {\"pct_null_threshold\": pct_null_threshold}\n",
" parameters.update(kwargs)\n",
"\n",
" self._cols_to_drop = None\n",
" super().__init__(parameters=parameters,\n",
" component_obj=None,\n",
" random_state=random_state)\n",
"\n",
" def fit(self, X, y=None):\n",
" pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" percent_null = X.isnull().mean()\n",
" if pct_null_threshold == 0.0:\n",
" null_cols = percent_null[percent_null > 0]\n",
" else:\n",
" null_cols = percent_null[percent_null >= pct_null_threshold]\n",
" self._cols_to_drop = list(null_cols.index)\n",
" return self\n",
"\n",
" def transform(self, X, y=None):\n",
" \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n",
" Arguments:\n",
" X (pd.DataFrame): Data to transform\n",
" y (pd.Series, optional): Targets\n",
" Returns:\n",
" pd.DataFrame: Transformed X\n",
" \"\"\"\n",
"\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" return X.drop(columns=self._cols_to_drop, axis=1)\n",
" \n",
"myDropNull = MyDropNullColumns()\n",
"myDropNull = DropNullColumns()\n",
"print(generate_component_code(myDropNull))"
]
},
Expand Down
11 changes: 6 additions & 5 deletions docs/source/user_guide/pipelines.ipynb
Expand Up @@ -389,6 +389,7 @@
"from evalml.pipelines.utils import generate_pipeline_code\n",
"from evalml.pipelines import MulticlassClassificationPipeline\n",
"import pandas as pd\n",
"from evalml.utils import _convert_to_woodwork_structure, _convert_woodwork_types_wrapper\n",
"\n",
"class MyDropNullColumns(Transformer):\n",
" \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
Expand All @@ -415,8 +416,8 @@
"\n",
" def fit(self, X, y=None):\n",
" pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" X = _convert_to_woodwork_structure(X)\n",
" X = _convert_woodwork_types_wrapper(X.to_dataframe())\n",
" percent_null = X.isnull().mean()\n",
" if pct_null_threshold == 0.0:\n",
" null_cols = percent_null[percent_null > 0]\n",
Expand All @@ -434,9 +435,9 @@
" pd.DataFrame: Transformed X\n",
" \"\"\"\n",
"\n",
" if not isinstance(X, pd.DataFrame):\n",
" X = pd.DataFrame(X)\n",
" return X.drop(columns=self._cols_to_drop, axis=1)\n",
" X = _convert_to_woodwork_structure(X)\n",
" return X.drop(columns=self._cols_to_drop)\n",
"\n",
"\n",
"class CustomPipeline(MulticlassClassificationPipeline):\n",
" name = \"Custom Pipeline\"\n",
Expand Down
2 changes: 1 addition & 1 deletion evalml/automl/automl_algorithm/automl_algorithm.py
Expand Up @@ -25,7 +25,7 @@ def __init__(self,
allowed_pipelines (list(class)): A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed.
max_iterations (int): The maximum number of iterations to be evaluated.
tuner_class (class): A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.
random_state (int): The random seed. Defaults to 0.
random_state (int): Seed for the random number generator. Defaults to 0.
"""
self.random_state = get_random_seed(random_state)
self.allowed_pipelines = allowed_pipelines or []
Expand Down
2 changes: 1 addition & 1 deletion evalml/automl/automl_algorithm/iterative_algorithm.py
Expand Up @@ -31,7 +31,7 @@ def __init__(self,
allowed_pipelines (list(class)): A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed.
max_iterations (int): The maximum number of iterations to be evaluated.
tuner_class (class): A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.
random_state (int): The random seed. Defaults to 0.
random_state (int): Seed for the random number generator. Defaults to 0.
pipelines_per_batch (int): The number of pipelines to be evaluated in each batch, after the first batch.
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
number_features (int): The number of columns in the input features.
Expand Down
13 changes: 5 additions & 8 deletions evalml/automl/automl_search.py
Expand Up @@ -157,7 +157,7 @@ def __init__(self,
additional_objectives (list): Custom set of objectives to score on.
Will override default objectives for problem type if not empty.
random_state (int): The random seed. Defaults to 0.
random_state (int): Seed for the random number generator. Defaults to 0.
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
Expand Down Expand Up @@ -566,8 +566,8 @@ def _tune_binary_threshold(self, pipeline, X_threshold_tuning, y_threshold_tunin
Arguments:
pipeline (Pipeline): Pipeline instance to threshold
X_threshold_tuning (ww DataTable): X data to tune pipeline to
y_threshold_tuning (ww DataColumn): Target data to tune pipeline to
X_threshold_tuning (ww.DataTable): X data to tune pipeline to
y_threshold_tuning (ww.DataColumn): Target data to tune pipeline to
Returns:
Trained pipeline instance
Expand All @@ -576,10 +576,7 @@ def _tune_binary_threshold(self, pipeline, X_threshold_tuning, y_threshold_tunin
pipeline.threshold = 0.5
if X_threshold_tuning:
y_predict_proba = pipeline.predict_proba(X_threshold_tuning)
if isinstance(y_predict_proba, pd.DataFrame):
y_predict_proba = y_predict_proba.iloc[:, 1]
else:
y_predict_proba = y_predict_proba[:, 1]
y_predict_proba = y_predict_proba.iloc[:, 1]
pipeline.threshold = self.objective.optimize_threshold(y_predict_proba, y_threshold_tuning, X=X_threshold_tuning)
return pipeline

Expand Down Expand Up @@ -849,7 +846,7 @@ def get_pipeline(self, pipeline_id, random_state=0):
Arguments:
pipeline_id (int): pipeline to retrieve
random_state (int): The random seed. Defaults to 0.
random_state (int): Seed for the random number generator. Defaults to 0.
Returns:
PipelineBase: untrained pipeline instance associated with the provided ID
Expand Down
16 changes: 8 additions & 8 deletions evalml/automl/utils.py
Expand Up @@ -39,17 +39,17 @@ def make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=
"""Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.
Arguments:
X (pd.DataFrame, ww.DataTable): The input training data of shape [n_samples, n_features].
y (pd.Series, ww.DataColumn): The target training data of length [n_samples].
problem_type (ProblemType): the type of machine learning problem.
X (ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features].
y (ww.DataColumn, pd.Series): The target training data of length [n_samples].
problem_type (ProblemType): The type of machine learning problem.
problem_configuration (dict, None): Additional parameters needed to configure the search. For example,
in time series problems, values should be passed in for the gap and max_delay variables.
n_splits (int, None): the number of CV splits, if applicable. Default 3.
shuffle (bool): whether or not to shuffle the data before splitting, if applicable. Default True.
random_state (int): The random seed. Defaults to 0.
in time series problems, values should be passed in for the gap and max_delay variables. Defaults to None.
n_splits (int, None): The number of CV splits, if applicable. Defaults to 3.
shuffle (bool): Whether or not to shuffle the data before splitting, if applicable. Defaults to True.
random_state (int): Seed for the random number generator. Defaults to 0.
Returns:
sklearn.model_selection.BaseCrossValidator: data splitting method.
sklearn.model_selection.BaseCrossValidator: Data splitting method.
"""
problem_type = handle_problem_types(problem_type)
data_splitter = None
Expand Down

0 comments on commit 01bf21f

Please sign in to comment.