Automl fails with custom pipelines if data contains datetime feature(s) #1367

dsherry · 2020-10-29T19:09:25Z

Background

We're in a funny intermediate stage with adding support for woodwork in #1229. We've added support to automl search, but pipelines, components, make_pipeline and other utils still expect pandas DataFrames.

Another fact which is important here: creating a woodwork DataTable will modify the types of the underlying pandas DataFrame

Problem

The following code doesn't work, borrowed from @bchen1116 's docs additions in his PR #1284 with some modifications for simplicity:

import evalml
X, y = evalml.demos.load_fraud(n_rows=1000)

estimators = evalml.pipelines.components.utils.get_estimators('binary', [evalml.model_family.ModelFamily.EXTRA_TREES])
pipelines = [evalml.pipelines.utils.make_pipeline(X, y, estimator, 'binary') for estimator in estimators]
automl = evalml.automl.AutoMLSearch(problem_type='binary', allowed_pipelines=pipelines)
automl.search(X, y)

errors out with

AutoMLSearchException: All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x13ea47700>.

Reason: make_pipeline doesn't have access to the pandas DF. Because the datetime feature comes in with type "object" (string), make_pipeline won't know there's a datetime feature, and won't add the DateTimeFeaturizer.

Then, in automl search, because we use woodwork DataTables in there right now, the datetime feature will get converted to type datetime64. When the pipeline is evaluated on that data, it'll break, because the StandardScaler will try to apply scaling to a datetime64-type column and will error out.

Workaround

The following code does work:

import evalml
X, y = evalml.demos.load_fraud(n_rows=1000)
import woodwork as ww
X_dt = ww.DataTable(X)
y_dc = ww.DataColumn(y)

estimators = evalml.pipelines.components.utils.get_estimators('binary', [ModelFamily.EXTRA_TREES])
pipelines = [evalml.pipelines.utils.make_pipeline(X, y, estimator, 'binary') for estimator in estimators]
automl = evalml.automl.AutoMLSearch(problem_type='binary', allowed_pipelines=pipelines)
automl.search(X, y)

That's because calling X_dt = ww.DataTable(X) will modify the underlying pandas DataFrame to have type datetime64 for the datetime feature. make_pipeline will correctly add a DateTimeFeaturizer to the pipeline, which will make datetime features and delete the original datetime feature. So, by the time the StandardScaler is run, there won't be a datetime feature passed in

Fix

Short-term: either use the workaround, or update StandardScaler to only be applied to numeric features, i.e. those with type int/float in the pandas DF.

Long-term: update StandardScaler, and make_pipeline, to use woodwork! StandardScaler should apply to any numeric feature, and make_pipeline should check if there's any datetime-typed features and use that to determine whether to add a DateTimeFeaturizer

This is why we're moving towards using woodwork: so the woodwork DataTable is a single source of truth on the type of each feature.

Next steps

My recommendation for this issue: make the update to StandardScaler now, because we'll have to make a similar update for #1229 anyways, so it won't be wasted work.

The text was updated successfully, but these errors were encountered:

angela97lin · 2020-11-11T02:46:14Z

Closed via #1393

dsherry added the bug Issues tracking problems with existing features. label Oct 29, 2020

dsherry added this to the November 2020 milestone Oct 29, 2020

dsherry mentioned this issue Oct 29, 2020

Freeze Hyperparameters for AutoMLSearch #1284

Merged

angela97lin mentioned this issue Nov 3, 2020

Update pipelines and make_pipelines to accept Woodwork DataTables #1393

Merged

angela97lin closed this as completed Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automl fails with custom pipelines if data contains datetime feature(s) #1367

Automl fails with custom pipelines if data contains datetime feature(s) #1367

dsherry commented Oct 29, 2020

angela97lin commented Nov 11, 2020

Automl fails with custom pipelines if data contains datetime feature(s) #1367

Automl fails with custom pipelines if data contains datetime feature(s) #1367

Comments

dsherry commented Oct 29, 2020

Background

Problem

Workaround

Fix

Next steps

angela97lin commented Nov 11, 2020