You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're in a funny intermediate stage with adding support for woodwork in #1229. We've added support to automl search, but pipelines, components, make_pipeline and other utils still expect pandas DataFrames.
Another fact which is important here: creating a woodwork DataTable will modify the types of the underlying pandas DataFrame
Problem
The following code doesn't work, borrowed from @bchen1116 's docs additions in his PR #1284 with some modifications for simplicity:
AutoMLSearchException: All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x13ea47700>.
Reason: make_pipeline doesn't have access to the pandas DF. Because the datetime feature comes in with type "object" (string), make_pipeline won't know there's a datetime feature, and won't add the DateTimeFeaturizer.
Then, in automl search, because we use woodwork DataTables in there right now, the datetime feature will get converted to type datetime64. When the pipeline is evaluated on that data, it'll break, because the StandardScaler will try to apply scaling to a datetime64-type column and will error out.
That's because calling X_dt = ww.DataTable(X) will modify the underlying pandas DataFrame to have type datetime64 for the datetime feature. make_pipeline will correctly add a DateTimeFeaturizer to the pipeline, which will make datetime features and delete the original datetime feature. So, by the time the StandardScaler is run, there won't be a datetime feature passed in
Fix
Short-term: either use the workaround, or update StandardScaler to only be applied to numeric features, i.e. those with type int/float in the pandas DF.
Long-term: update StandardScaler, and make_pipeline, to use woodwork! StandardScaler should apply to any numeric feature, and make_pipeline should check if there's any datetime-typed features and use that to determine whether to add a DateTimeFeaturizer
This is why we're moving towards using woodwork: so the woodwork DataTable is a single source of truth on the type of each feature.
Next steps
My recommendation for this issue: make the update to StandardScaler now, because we'll have to make a similar update for #1229 anyways, so it won't be wasted work.
The text was updated successfully, but these errors were encountered:
Background
We're in a funny intermediate stage with adding support for woodwork in #1229. We've added support to automl
search
, but pipelines, components,make_pipeline
and other utils still expect pandasDataFrame
s.Another fact which is important here: creating a woodwork
DataTable
will modify the types of the underlying pandasDataFrame
Problem
The following code doesn't work, borrowed from @bchen1116 's docs additions in his PR #1284 with some modifications for simplicity:
errors out with
Reason:
make_pipeline
doesn't have access to the pandas DF. Because the datetime feature comes in with type "object" (string),make_pipeline
won't know there's a datetime feature, and won't add theDateTimeFeaturizer
.Then, in automl search, because we use woodwork
DataTable
s in there right now, the datetime feature will get converted to typedatetime64
. When the pipeline is evaluated on that data, it'll break, because theStandardScaler
will try to apply scaling to adatetime64
-type column and will error out.Workaround
The following code does work:
That's because calling
X_dt = ww.DataTable(X)
will modify the underlying pandasDataFrame
to have typedatetime64
for thedatetime
feature.make_pipeline
will correctly add aDateTimeFeaturizer
to the pipeline, which will make datetime features and delete the originaldatetime
feature. So, by the time theStandardScaler
is run, there won't be a datetime feature passed inFix
Short-term: either use the workaround, or update
StandardScaler
to only be applied to numeric features, i.e. those with typeint
/float
in the pandas DF.Long-term: update
StandardScaler
, andmake_pipeline
, to use woodwork!StandardScaler
should apply to any numeric feature, andmake_pipeline
should check if there's any datetime-typed features and use that to determine whether to add aDateTimeFeaturizer
This is why we're moving towards using woodwork: so the woodwork
DataTable
is a single source of truth on the type of each feature.Next steps
My recommendation for this issue: make the update to
StandardScaler
now, because we'll have to make a similar update for #1229 anyways, so it won't be wasted work.The text was updated successfully, but these errors were encountered: