Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning of the data preprocessing step causes algorithms to crash #1257

Closed
Rattko opened this issue Sep 27, 2021 · 8 comments · Fixed by #1269
Closed

Turning of the data preprocessing step causes algorithms to crash #1257

Rattko opened this issue Sep 27, 2021 · 8 comments · Fixed by #1269
Labels

Comments

@Rattko
Copy link
Contributor

Rattko commented Sep 27, 2021

Describe the bug

I have updated Auto-Sklearn to the latest version to be able to completely turn off any preprocessing. I have followed the example on the webpage and, it seems, turning off the preprocessing introduces crashes.

The provided example on the webpage seems to suffer the same issue as MyDummyClassifier is returned in the ensemble.

To Reproduce

# Imports and code for the NoPreprocessing class, as shown in the example

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'data_preprocessor': ['NoPreprocessing'],
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

Expected behavior

By removing the data_preprocessor, I obtain the expected behaviour with the expected output.

# Imports

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

Output:

auto-sklearn results:
  Dataset name: 5f06c41c-1fbb-11ec-b878-acde48001122
  Metric: accuracy
  Best validation score: 0.957447
  Number of target algorithm runs: 10
  Number of successful target algorithm runs: 7
  Number of crashed target algorithm runs: 2
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0

Actual behavior, stacktrace or logfile

Output, when data_preprocessor = ['NoPreprocessing'] is used:

auto-sklearn results:
  Dataset name: 2df246e0-1fba-11ec-b827-acde48001122
  Metric: accuracy
  Number of target algorithm runs: 22
  Number of successful target algorithm runs: 0
  Number of crashed target algorithm runs: 22
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

Environment and installation

  • macOS Catalina
  • Python's venv module
  • Python 3.9.7
  • auto-sklearn 0.14.0
@eddiebergman
Copy link
Contributor

eddiebergman commented Sep 28, 2021

Hi @Rattko,

Thanks for reporting this, I'll have a look. It seems that this dataset shouldn't require an data preprocessing in the first place but apparently all the models still seems to crash.

Link to dataset for self reference

https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset

@Rattko
Copy link
Contributor Author

Rattko commented Oct 8, 2021

Hi @eddiebergman,

is there any update on this issue? I plan to use Auto-Sklearn in my bachelor thesis, but I need to turn off the preprocessing steps.

@eddiebergman
Copy link
Contributor

eddiebergman commented Oct 8, 2021

Hi @Rattko,

Sorry for the delay. I haven't been able to have a look at this since, apologies. It seems that the example is actually quite outdated and data preprocessing is something that can't be turned off. Data preprocessing is how we ensure nan values and columns with strings don't reach the sklearn models. If you really wish to have full control of missing value imputation and how categorical data is dealt with, you can do this yourself and make sure that every column ends up numerical and has no NaNs.

However, perhaps you were looking at turning off feature preprocessing instead, see here for the distinction. I can confirm that snippet there works.

from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={
        'feature_preprocessor': ['no_preprocessing'],
        'classifier': ['random_forest']
    },
    initial_configurations_via_metalearning=0
).fit(X, y)

print(automl.sprint_statistics())

@eddiebergman
Copy link
Contributor

I've created a proper issue #1266 for us to deal with this confusion and provide a simpler option to disable it for once I am back and available to work on this.

@Rattko
Copy link
Contributor Author

Rattko commented Oct 15, 2021

Hey @eddiebergman,

I wasn't looking at turning off the feature_preprocessing, I know this can be turned off. I need to turn off data_preprocessing and especially balancing_strategy. I found that the balancing_strategy can be turned off by simply removing it from the pipeline here.

It seems that the example is actually quite outdated and data preprocessing is something that can't be turned off.

Having seen the issue #900, the pull request #977 and the release notes for 0.14.0, I thought that turning off the data_preprocessing is a new feature and I expected it to work.

If you really wish to have full control of missing value imputation and how categorical data is dealt with, you can do this yourself and make sure that every column ends up numerical and has no NaNs.

If I understand it correctly, once I do all the necessary preprocessing myself (encoding, imputation, etc.) the Auto-Sklearn will not perform any data_preprocessing. Also, 'data_preprocessor:__choice__': 'feature_type', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', etc. being present in the pipeline means that the data went through OneHotEncoder, but, as it was preprocessed beforehand, it came out of the OneHotEncoder the same.

Is there a way to check this behaviour, i.e the input to the data_preprocessing and its output are the same?

I've created a proper issue #1266 for us to deal with this confusion and provide a simpler option to disable it for once I am back and available to work on this.

How much effort would be needed for this to make it work? I might be able to help, if it doesn't require a huge amount of work.

@eddiebergman
Copy link
Contributor

Hi @Rattko,

I just wanted to check about data preprocessing and feature preprocessing as the confusion has arisen before.

I found that the balancing_strategy can be turned off by simply removing it from the pipeline here.

There should be a better way to turn this off but yes, that should work.

I thought that turning off the data_preprocessing is a new feature and I expected it to work.

It should, we should have better tests and documentation in the PR that implemented this feature in #977. I also tried to check the commits themselves but it's been ruined by a rebase and difficult to see what was changed in that PR, I have only started working on auto-sklearn after this so unfortunately this is also new to me.

If I understand it correctly, once I do all the necessary preprocessing myself (encoding, imputation, etc.) the Auto-Sklearn will not perform any data_preprocessing

I have checked and this will not be the case.

from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

automl = AutoSklearnClassifier(
    time_left_for_this_task=30,
    include={ 'classifier': ['random_forest'] },
    initial_configurations_via_metalearning=0
).fit(X, y)

weight, model = automl.get_model_with_weights()[0] # Get one of the models
data_preprocessor = model.steps[0][1] # model.steps[0] == ('datapreprocessor', DataPreprocessorChoice at 0x45..231)

assert X == data_preprocessor.transform(X) # False 

How much effort would be needed for this to make it work? I might be able to help, if it doesn't require a huge amount of work.

I think the best strategy is a custom DataPreprocessor as is done in the example but to fix the issue that occurs. This DataPreprocessor should perform a no-op and provide an empty ConfigurationSpace, i.e. there is no optimization required here. While in practice, I think a working solution should not take too much, however, to properly test and document the functionality might be some extra work. We would welcome a PR on this matter but I imagine this might be more time than a quick fix or parameter change. I will add this to the upcoming agenda of things to deal with.

@eddiebergman
Copy link
Contributor

Hi @Rattko ,

In the meantime, I've fixed the example.

class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):

    def __init__(self, **kwargs):
        for key, val in kwargs.items():
            setattr(self, key, val)
            
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X
        
    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            'shortname': 'no',
            'name': 'NoPreprocessing',
            'handles_regression': True,
            'handles_classification': True,
            'handles_multiclass': True,
            'handles_multilabel': True,
            'handles_multioutput': True,
            'is_deterministic': True,
            'input': (SPARSE, DENSE, UNSIGNED_DATA),
            'output': (INPUT,)
        }
        
    @staticmethod
    def get_hyperparameter_search_space(dataset_properties=None):
        return ConfigurationSpace()

For debugging purposes, in case you run into future issues, you can see why an individual model failed using the runhistory created by SMAC, the underlying optimizer.

for runkey, runval in automl.automl_.runhistory.data.items():
    print(runval)

@eddiebergman
Copy link
Contributor

This is addressed, updated and documented with PR #1269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants