-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turning of the data preprocessing step causes algorithms to crash #1257
Comments
Hi @Rattko, Thanks for reporting this, I'll have a look. It seems that this dataset shouldn't require an data preprocessing in the first place but apparently all the models still seems to crash. Link to dataset for self referencehttps://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset |
Hi @eddiebergman, is there any update on this issue? I plan to use Auto-Sklearn in my bachelor thesis, but I need to turn off the preprocessing steps. |
Hi @Rattko, Sorry for the delay. I haven't been able to have a look at this since, apologies. It seems that the example is actually quite outdated and data preprocessing is something that can't be turned off. Data preprocessing is how we ensure nan values and columns with strings don't reach the sklearn models. If you really wish to have full control of missing value imputation and how categorical data is dealt with, you can do this yourself and make sure that every column ends up numerical and has no NaNs. However, perhaps you were looking at turning off feature preprocessing instead, see here for the distinction. I can confirm that snippet there works. from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
automl = AutoSklearnClassifier(
time_left_for_this_task=30,
include={
'feature_preprocessor': ['no_preprocessing'],
'classifier': ['random_forest']
},
initial_configurations_via_metalearning=0
).fit(X, y)
print(automl.sprint_statistics()) |
I've created a proper issue #1266 for us to deal with this confusion and provide a simpler option to disable it for once I am back and available to work on this. |
Hey @eddiebergman, I wasn't looking at turning off the feature_preprocessing, I know this can be turned off. I need to turn off data_preprocessing and especially balancing_strategy. I found that the balancing_strategy can be turned off by simply removing it from the pipeline here.
Having seen the issue #900, the pull request #977 and the release notes for 0.14.0, I thought that turning off the data_preprocessing is a new feature and I expected it to work.
If I understand it correctly, once I do all the necessary preprocessing myself (encoding, imputation, etc.) the Auto-Sklearn will not perform any data_preprocessing. Also, Is there a way to check this behaviour, i.e the input to the data_preprocessing and its output are the same?
How much effort would be needed for this to make it work? I might be able to help, if it doesn't require a huge amount of work. |
Hi @Rattko, I just wanted to check about data preprocessing and feature preprocessing as the confusion has arisen before.
There should be a better way to turn this off but yes, that should work.
It should, we should have better tests and documentation in the PR that implemented this feature in #977. I also tried to check the commits themselves but it's been ruined by a rebase and difficult to see what was changed in that PR, I have only started working on auto-sklearn after this so unfortunately this is also new to me.
I have checked and this will not be the case. from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
automl = AutoSklearnClassifier(
time_left_for_this_task=30,
include={ 'classifier': ['random_forest'] },
initial_configurations_via_metalearning=0
).fit(X, y)
weight, model = automl.get_model_with_weights()[0] # Get one of the models
data_preprocessor = model.steps[0][1] # model.steps[0] == ('datapreprocessor', DataPreprocessorChoice at 0x45..231)
assert X == data_preprocessor.transform(X) # False
I think the best strategy is a custom DataPreprocessor as is done in the example but to fix the issue that occurs. This DataPreprocessor should perform a no-op and provide an empty ConfigurationSpace, i.e. there is no optimization required here. While in practice, I think a working solution should not take too much, however, to properly test and document the functionality might be some extra work. We would welcome a PR on this matter but I imagine this might be more time than a quick fix or parameter change. I will add this to the upcoming agenda of things to deal with. |
Hi @Rattko , In the meantime, I've fixed the example. class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
def __init__(self, **kwargs):
for key, val in kwargs.items():
setattr(self, key, val)
def fit(self, X, y=None):
return self
def transform(self, X):
return X
@staticmethod
def get_properties(dataset_properties=None):
return {
'shortname': 'no',
'name': 'NoPreprocessing',
'handles_regression': True,
'handles_classification': True,
'handles_multiclass': True,
'handles_multilabel': True,
'handles_multioutput': True,
'is_deterministic': True,
'input': (SPARSE, DENSE, UNSIGNED_DATA),
'output': (INPUT,)
}
@staticmethod
def get_hyperparameter_search_space(dataset_properties=None):
return ConfigurationSpace() For debugging purposes, in case you run into future issues, you can see why an individual model failed using the runhistory created by SMAC, the underlying optimizer. for runkey, runval in automl.automl_.runhistory.data.items():
print(runval) |
This is addressed, updated and documented with PR #1269 |
Describe the bug
I have updated Auto-Sklearn to the latest version to be able to completely turn off any preprocessing. I have followed the example on the webpage and, it seems, turning off the preprocessing introduces crashes.
The provided example on the webpage seems to suffer the same issue as MyDummyClassifier is returned in the ensemble.
To Reproduce
Expected behavior
By removing the
data_preprocessor
, I obtain the expected behaviour with the expected output.Output:
Actual behavior, stacktrace or logfile
Output, when
data_preprocessor = ['NoPreprocessing']
is used:Environment and installation
venv
moduleThe text was updated successfully, but these errors were encountered: