Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automl fails with custom pipelines if data contains datetime feature(s) #1367

Closed
dsherry opened this issue Oct 29, 2020 · 1 comment
Closed
Labels
bug Issues tracking problems with existing features.
Milestone

Comments

@dsherry
Copy link
Contributor

dsherry commented Oct 29, 2020

Background

We're in a funny intermediate stage with adding support for woodwork in #1229. We've added support to automl search, but pipelines, components, make_pipeline and other utils still expect pandas DataFrames.

Another fact which is important here: creating a woodwork DataTable will modify the types of the underlying pandas DataFrame

Problem

The following code doesn't work, borrowed from @bchen1116 's docs additions in his PR #1284 with some modifications for simplicity:

import evalml
X, y = evalml.demos.load_fraud(n_rows=1000)

estimators = evalml.pipelines.components.utils.get_estimators('binary', [evalml.model_family.ModelFamily.EXTRA_TREES])
pipelines = [evalml.pipelines.utils.make_pipeline(X, y, estimator, 'binary') for estimator in estimators]
automl = evalml.automl.AutoMLSearch(problem_type='binary', allowed_pipelines=pipelines)
automl.search(X, y)

errors out with

AutoMLSearchException: All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x13ea47700>.

Reason: make_pipeline doesn't have access to the pandas DF. Because the datetime feature comes in with type "object" (string), make_pipeline won't know there's a datetime feature, and won't add the DateTimeFeaturizer.

Then, in automl search, because we use woodwork DataTables in there right now, the datetime feature will get converted to type datetime64. When the pipeline is evaluated on that data, it'll break, because the StandardScaler will try to apply scaling to a datetime64-type column and will error out.

Workaround

The following code does work:

import evalml
X, y = evalml.demos.load_fraud(n_rows=1000)
import woodwork as ww
X_dt = ww.DataTable(X)
y_dc = ww.DataColumn(y)

estimators = evalml.pipelines.components.utils.get_estimators('binary', [ModelFamily.EXTRA_TREES])
pipelines = [evalml.pipelines.utils.make_pipeline(X, y, estimator, 'binary') for estimator in estimators]
automl = evalml.automl.AutoMLSearch(problem_type='binary', allowed_pipelines=pipelines)
automl.search(X, y)

That's because calling X_dt = ww.DataTable(X) will modify the underlying pandas DataFrame to have type datetime64 for the datetime feature. make_pipeline will correctly add a DateTimeFeaturizer to the pipeline, which will make datetime features and delete the original datetime feature. So, by the time the StandardScaler is run, there won't be a datetime feature passed in

Fix

Short-term: either use the workaround, or update StandardScaler to only be applied to numeric features, i.e. those with type int/float in the pandas DF.

Long-term: update StandardScaler, and make_pipeline, to use woodwork! StandardScaler should apply to any numeric feature, and make_pipeline should check if there's any datetime-typed features and use that to determine whether to add a DateTimeFeaturizer

This is why we're moving towards using woodwork: so the woodwork DataTable is a single source of truth on the type of each feature.

Next steps

My recommendation for this issue: make the update to StandardScaler now, because we'll have to make a similar update for #1229 anyways, so it won't be wasted work.

@dsherry dsherry added the bug Issues tracking problems with existing features. label Oct 29, 2020
@dsherry dsherry added this to the November 2020 milestone Oct 29, 2020
@angela97lin
Copy link
Contributor

Closed via #1393

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

No branches or pull requests

2 participants