Turning of the data preprocessing step causes algorithms to crash #1257

Rattko · 2021-09-27T17:56:22Z

Describe the bug

I have updated Auto-Sklearn to the latest version to be able to completely turn off any preprocessing. I have followed the example on the webpage and, it seems, turning off the preprocessing introduces crashes.

The provided example on the webpage seems to suffer the same issue as MyDummyClassifier is returned in the ensemble.

To Reproduce

# Imports and code for the NoPreprocessing class, as shown in the example

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'data_preprocessor': ['NoPreprocessing'],
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

Expected behavior

By removing the data_preprocessor, I obtain the expected behaviour with the expected output.

# Imports

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

Output:

auto-sklearn results:
  Dataset name: 5f06c41c-1fbb-11ec-b878-acde48001122
  Metric: accuracy
  Best validation score: 0.957447
  Number of target algorithm runs: 10
  Number of successful target algorithm runs: 7
  Number of crashed target algorithm runs: 2
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0

Actual behavior, stacktrace or logfile

Output, when data_preprocessor = ['NoPreprocessing'] is used:

auto-sklearn results:
  Dataset name: 2df246e0-1fba-11ec-b827-acde48001122
  Metric: accuracy
  Number of target algorithm runs: 22
  Number of successful target algorithm runs: 0
  Number of crashed target algorithm runs: 22
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

Environment and installation

macOS Catalina
Python's venv module
Python 3.9.7
auto-sklearn 0.14.0

The text was updated successfully, but these errors were encountered:

eddiebergman · 2021-09-28T17:31:47Z

Hi @Rattko,

Thanks for reporting this, I'll have a look. It seems that this dataset shouldn't require an data preprocessing in the first place but apparently all the models still seems to crash.

Link to dataset for self reference

https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset

Rattko · 2021-10-08T10:31:46Z

Hi @eddiebergman,

is there any update on this issue? I plan to use Auto-Sklearn in my bachelor thesis, but I need to turn off the preprocessing steps.

eddiebergman · 2021-10-08T11:56:34Z

Hi @Rattko,

Sorry for the delay. I haven't been able to have a look at this since, apologies. It seems that the example is actually quite outdated and data preprocessing is something that can't be turned off. Data preprocessing is how we ensure nan values and columns with strings don't reach the sklearn models. If you really wish to have full control of missing value imputation and how categorical data is dealt with, you can do this yourself and make sure that every column ends up numerical and has no NaNs.

However, perhaps you were looking at turning off feature preprocessing instead, see here for the distinction. I can confirm that snippet there works.

from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'feature_preprocessor': ['no_preprocessing'],
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

eddiebergman · 2021-10-08T12:24:56Z

I've created a proper issue #1266 for us to deal with this confusion and provide a simpler option to disable it for once I am back and available to work on this.

Rattko · 2021-10-15T10:20:56Z

Hey @eddiebergman,

I wasn't looking at turning off the feature_preprocessing, I know this can be turned off. I need to turn off data_preprocessing and especially balancing_strategy. I found that the balancing_strategy can be turned off by simply removing it from the pipeline here.

It seems that the example is actually quite outdated and data preprocessing is something that can't be turned off.

Having seen the issue #900, the pull request #977 and the release notes for 0.14.0, I thought that turning off the data_preprocessing is a new feature and I expected it to work.

If you really wish to have full control of missing value imputation and how categorical data is dealt with, you can do this yourself and make sure that every column ends up numerical and has no NaNs.

If I understand it correctly, once I do all the necessary preprocessing myself (encoding, imputation, etc.) the Auto-Sklearn will not perform any data_preprocessing. Also, 'data_preprocessor:__choice__': 'feature_type', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', etc. being present in the pipeline means that the data went through OneHotEncoder, but, as it was preprocessed beforehand, it came out of the OneHotEncoder the same.

Is there a way to check this behaviour, i.e the input to the data_preprocessing and its output are the same?

I've created a proper issue #1266 for us to deal with this confusion and provide a simpler option to disable it for once I am back and available to work on this.

How much effort would be needed for this to make it work? I might be able to help, if it doesn't require a huge amount of work.

eddiebergman · 2021-10-16T12:48:54Z

Hi @Rattko,

I just wanted to check about data preprocessing and feature preprocessing as the confusion has arisen before.

I found that the balancing_strategy can be turned off by simply removing it from the pipeline here.

There should be a better way to turn this off but yes, that should work.

I thought that turning off the data_preprocessing is a new feature and I expected it to work.

It should, we should have better tests and documentation in the PR that implemented this feature in #977. I also tried to check the commits themselves but it's been ruined by a rebase and difficult to see what was changed in that PR, I have only started working on auto-sklearn after this so unfortunately this is also new to me.

If I understand it correctly, once I do all the necessary preprocessing myself (encoding, imputation, etc.) the Auto-Sklearn will not perform any data_preprocessing

I have checked and this will not be the case.

from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={ 'classifier': ['random_forest'] },
    initial_configurations_via_metalearning=0
).fit(X, y)

weight, model = automl.get_model_with_weights()[0] # Get one of the models
data_preprocessor = model.steps[0][1] # model.steps[0] == ('datapreprocessor', DataPreprocessorChoice at 0x45..231)

assert X == data_preprocessor.transform(X) # False

How much effort would be needed for this to make it work? I might be able to help, if it doesn't require a huge amount of work.

I think the best strategy is a custom DataPreprocessor as is done in the example but to fix the issue that occurs. This DataPreprocessor should perform a no-op and provide an empty ConfigurationSpace, i.e. there is no optimization required here. While in practice, I think a working solution should not take too much, however, to properly test and document the functionality might be some extra work. We would welcome a PR on this matter but I imagine this might be more time than a quick fix or parameter change. I will add this to the upcoming agenda of things to deal with.

eddiebergman · 2021-10-16T14:20:13Z

Hi @Rattko ,

In the meantime, I've fixed the example.

class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):

    def __init__(self, **kwargs):
        for key, val in kwargs.items():
            setattr(self, key, val)
            
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X
        
    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            'shortname': 'no',
            'name': 'NoPreprocessing',
            'handles_regression': True,
            'handles_classification': True,
            'handles_multiclass': True,
            'handles_multilabel': True,
            'handles_multioutput': True,
            'is_deterministic': True,
            'input': (SPARSE, DENSE, UNSIGNED_DATA),
            'output': (INPUT,)
        }
        
    @staticmethod
    def get_hyperparameter_search_space(dataset_properties=None):
        return ConfigurationSpace()

For debugging purposes, in case you run into future issues, you can see why an individual model failed using the runhistory created by SMAC, the underlying optimizer.

for runkey, runval in automl.automl_.runhistory.data.items():
    print(runval)

eddiebergman · 2021-10-16T15:36:05Z

This is addressed, updated and documented with PR #1269

eddiebergman added the bug label Sep 28, 2021

eddiebergman mentioned this issue Oct 16, 2021

Update example on extending data preprocessing #1269

Merged

eddiebergman linked a pull request Oct 16, 2021 that will close this issue

Update example on extending data preprocessing #1269

Merged

eddiebergman closed this as completed Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning of the data preprocessing step causes algorithms to crash #1257

Turning of the data preprocessing step causes algorithms to crash #1257

Rattko commented Sep 27, 2021

eddiebergman commented Sep 28, 2021 •

edited

Loading

Rattko commented Oct 8, 2021

eddiebergman commented Oct 8, 2021 •

edited

Loading

eddiebergman commented Oct 8, 2021

Rattko commented Oct 15, 2021

eddiebergman commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

Turning of the data preprocessing step causes algorithms to crash #1257

Turning of the data preprocessing step causes algorithms to crash #1257

Comments

Rattko commented Sep 27, 2021

Describe the bug

To Reproduce

Expected behavior

Actual behavior, stacktrace or logfile

Environment and installation

eddiebergman commented Sep 28, 2021 • edited Loading

Link to dataset for self reference

Rattko commented Oct 8, 2021

eddiebergman commented Oct 8, 2021 • edited Loading

eddiebergman commented Oct 8, 2021

Rattko commented Oct 15, 2021

eddiebergman commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

eddiebergman commented Oct 16, 2021

eddiebergman commented Sep 28, 2021 •

edited

Loading

eddiebergman commented Oct 8, 2021 •

edited

Loading