Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit fails when input data has categorical columns #970

Closed
rabsr opened this issue Oct 5, 2020 · 4 comments
Closed

Fit fails when input data has categorical columns #970

rabsr opened this issue Oct 5, 2020 · 4 comments

Comments

@rabsr
Copy link
Contributor

rabsr commented Oct 5, 2020

Describe the bug

auto-sklearn fails when input data has categorical data. I have changed example_pandas_train_test.py to use OpenML dataset, data_id : 1558 and also updated categorical and numerical list var.
Changes done:

X, y = sklearn.datasets.fetch_openml(data_id=1558, return_X_y=True, as_frame=False)
X = pd.DataFrame(
    data=X,
    columns=['V' + str(i) for i in range(1, 17)]
)
desired_boolean_columns = ['']
desired_categorical_columns = ['V2', 'V3', 'V4', 'V5', 'V7', 'V8', 'V9', 'V11', 'V16']
desired_numerical_columns = ['V1', 'V6', 'V10', 'V12', 'V13', 'V14', 'V15']

As per understanding, categorical columns are not encoded. That's why it fails.

Actual behavior, stacktrace or logfile

Traceback (most recent call last):
  File "example_pandas_train_test.py", line 105, in <module>
    cls.fit(X_train, y_train, X_test, y_test)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/estimators.py", line 587, in fit
    dataset_name=dataset_name,
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/estimators.py", line 346, in fit
    self.automl_.fit(load_models=True, **kwargs)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/automl.py", line 1154, in fit
    load_models=load_models,
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/automl.py", line 607, in fit
    _proc_smac.run_smbo()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/smbo.py", line 374, in run_smbo
    metalearning_configurations = self.get_metalearning_suggestions()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/smbo.py", line 584, in get_metalearning_suggestions
    self.datamanager.perform1HotEncoding()
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/data/abstract_data_manager.py", line 94, in perform1HotEncoding
    data=data)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/data/abstract_data_manager.py", line 28, in perform_one_hot_encoding
    rvals.append(encoder.fit_transform(data[0]))
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py", line 117, in fit_transform
    return self.fit(X, y).transform(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py", line 105, in fit
    self.column_transformer.fit(X)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 484, in fit
    self.fit_transform(X, y=y)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 518, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 457, in _fit_transform
    self._iter(fitted=fitted, replace_strings=True), 1))
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 847, in dispatch_one_batch
    self._dispatch(tasks)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 765, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 206, in apply_async
    result = ImmediateResult(func)
  File "Desktop/mls/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 570, in __init__
    self.results = batch()
  File "Desktop/mls/lib/python3.7/site-packages/joblib/parallel.py", line 253, in __call__
    for func, args, kwargs in self.items]
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 385, in fit_transform
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 315, in _fit
    **fit_params_steps[name])
  File "Desktop/mls/lib/python3.7/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "Desktop/mls/lib/python3.7/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py", line 30, in fit_transform
    return self.fit(X, y).transform(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py", line 21, in fit
    self.preprocessor.fit(X, y)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py", line 26, in fit
    self._convert_and_check_X(X)
  File "Desktop/mls/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py", line 16, in _convert_and_check_X
    raise ValueError('Categories should be non-negative numbers. '
ValueError: Categories should be non-negative numbers. NOTE: floats will be casted to integers.

Environment and installation:

Please give details about your installation:

  • Is your installation in a virtual environment or conda environment? virtual environment
  • Python version: 3.7
  • Auto-sklearn version: 0.10.0
mfeurer added a commit that referenced this issue Oct 7, 2020
When encoding a pandas array in autosklearn.data.validator,
the columns are re-ordered by the ColumnTransformer. This PR
re-orders the feature types so that when passing the data to
the actual ML pipeline, columns and feature types are sorted
the same way.
@mfeurer
Copy link
Contributor

mfeurer commented Oct 7, 2020

Thanks a lot for the bug report. I can reproduce and suggest a fix in #975.

franchuterivera pushed a commit that referenced this issue Oct 8, 2020
When encoding a pandas array in autosklearn.data.validator,
the columns are re-ordered by the ColumnTransformer. This PR
re-orders the feature types so that when passing the data to
the actual ML pipeline, columns and feature types are sorted
the same way.
@mfeurer
Copy link
Contributor

mfeurer commented Nov 3, 2020

This was fixed by #975

@mfeurer mfeurer closed this as completed Nov 3, 2020
@ricoms
Copy link

ricoms commented Apr 19, 2021

I'm still going through this problem, but I can't provide a small working example. I have a dataset of around 200,000 instances. I split between train and test. I managed to train a model on the train dataset, although when later loading the joblib model artifact I got this error when trying to predict on the test dataset.

I'm trying to understand if my test set have something wrong, although I did not find negative values on columns that I set as categorical. I have negative values on numeric columns though, but I don't thing that should be a problem. Any hints or tips on what I should look for?

thanks in advance

@mfeurer
Copy link
Contributor

mfeurer commented Apr 21, 2021

Could you please

  1. open a new issue
  2. give us the version of Auto-skleran you're using
  3. Reduce the training set to a small dataset that trains Auto-sklearn and the test set to two examples or so that make predict fail
  4. Provide a small example that demonstrates how your code fails

?

This will allow us debugging the problem you're facing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants