Fit fails when input data has categorical columns #970

rabsr · 2020-10-05T13:21:55Z

Describe the bug

auto-sklearn fails when input data has categorical data. I have changed example_pandas_train_test.py to use OpenML dataset, data_id : 1558 and also updated categorical and numerical list var.
Changes done:

X, y = sklearn.datasets.fetch_openml(data_id=1558, return_X_y=True, as_frame=False)
X = pd.DataFrame(
    data=X,
    columns=['V' + str(i) for i in range(1, 17)]
)
desired_boolean_columns = ['']
desired_categorical_columns = ['V2', 'V3', 'V4', 'V5', 'V7', 'V8', 'V9', 'V11', 'V16']
desired_numerical_columns = ['V1', 'V6', 'V10', 'V12', 'V13', 'V14', 'V15']

As per understanding, categorical columns are not encoded. That's why it fails.

Actual behavior, stacktrace or logfile

Traceback (most recent call last):
  File "example_pandas_train_test.py", line 105, in <module>
    cls.fit(X_train, y_train, X_test, y_test)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/estimators.py", line 587, in fit
    dataset_name=dataset_name,
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/estimators.py", line 346, in fit
    self.automl_.fit(load_models=True, **kwargs)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/automl.py", line 1154, in fit
    load_models=load_models,
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/automl.py", line 607, in fit
    _proc_smac.run_smbo()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/smbo.py", line 374, in run_smbo
    metalearning_configurations = self.get_metalearning_suggestions()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/smbo.py", line 584, in get_metalearning_suggestions
    self.datamanager.perform1HotEncoding()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/data/abstract_data_manager.py", line 94, in perform1HotEncoding
    data=data)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/data/abstract_data_manager.py", line 28, in perform_one_hot_encoding
    rvals.append(encoder.fit_transform(data[0]))
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py", line 117, in fit_transform
    return self.fit(X, y).transform(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py", line 105, in fit
    self.column_transformer.fit(X)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 484, in fit
    self.fit_transform(X, y=y)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 518, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 457, in _fit_transform
    self._iter(fitted=fitted, replace_strings=True), 1))
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 847, in dispatch_one_batch
    self._dispatch(tasks)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 765, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 206, in apply_async
    result = ImmediateResult(func)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 570, in __init__
    self.results = batch()
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 253, in __call__
    for func, args, kwargs in self.items]
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 385, in fit_transform
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 315, in _fit
    **fit_params_steps[name])
  File "Desktop/mls/lib/python3.7/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py", line 30, in fit_transform
    return self.fit(X, y).transform(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py", line 21, in fit
    self.preprocessor.fit(X, y)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py", line 26, in fit
    self._convert_and_check_X(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py", line 16, in _convert_and_check_X
    raise ValueError('Categories should be non-negative numbers. '
ValueError: Categories should be non-negative numbers. NOTE: floats will be casted to integers.

Environment and installation:

Please give details about your installation:

Is your installation in a virtual environment or conda environment? virtual environment
Python version: 3.7
Auto-sklearn version: 0.10.0

The text was updated successfully, but these errors were encountered:

When encoding a pandas array in autosklearn.data.validator, the columns are re-ordered by the ColumnTransformer. This PR re-orders the feature types so that when passing the data to the actual ML pipeline, columns and feature types are sorted the same way.

mfeurer · 2020-10-07T15:21:21Z

Thanks a lot for the bug report. I can reproduce and suggest a fix in #975.

When encoding a pandas array in autosklearn.data.validator, the columns are re-ordered by the ColumnTransformer. This PR re-orders the feature types so that when passing the data to the actual ML pipeline, columns and feature types are sorted the same way.

mfeurer · 2020-11-03T10:25:53Z

This was fixed by #975

ricoms · 2021-04-19T14:54:08Z

I'm still going through this problem, but I can't provide a small working example. I have a dataset of around 200,000 instances. I split between train and test. I managed to train a model on the train dataset, although when later loading the joblib model artifact I got this error when trying to predict on the test dataset.

I'm trying to understand if my test set have something wrong, although I did not find negative values on columns that I set as categorical. I have negative values on numeric columns though, but I don't thing that should be a problem. Any hints or tips on what I should look for?

thanks in advance

mfeurer · 2021-04-21T21:59:28Z

Could you please

open a new issue
give us the version of Auto-skleran you're using
Reduce the training set to a small dataset that trains Auto-sklearn and the test set to two examples or so that make predict fail
Provide a small example that demonstrates how your code fails

?

This will allow us debugging the problem you're facing

mfeurer mentioned this issue Oct 7, 2020

Fix #970, sort feature types #975

Merged

mfeurer closed this as completed Nov 3, 2020

vopani mentioned this issue May 4, 2021

Predict fails with category error #1141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fit fails when input data has categorical columns #970

Fit fails when input data has categorical columns #970

rabsr commented Oct 5, 2020 •

edited

mfeurer commented Oct 7, 2020

mfeurer commented Nov 3, 2020

ricoms commented Apr 19, 2021

mfeurer commented Apr 21, 2021

Fit fails when input data has categorical columns #970

Fit fails when input data has categorical columns #970

Comments

rabsr commented Oct 5, 2020 • edited

Describe the bug

Actual behavior, stacktrace or logfile

Environment and installation:

mfeurer commented Oct 7, 2020

mfeurer commented Nov 3, 2020

ricoms commented Apr 19, 2021

mfeurer commented Apr 21, 2021

rabsr commented Oct 5, 2020 •

edited