New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making data preprocessing step configurable with two options no preprocessing and feature type split #977
making data preprocessing step configurable with two options no preprocessing and feature type split #977
Conversation
…ocessing and feature type split
Hi @rabsr, I finally got around looking at this PR. Thanks again for all your work! Let me briefly check whether I understand this PR correctly: are you proposing to add the possibility to drop the whole data preprocessing pipeline? And could you briefly state the goal of this - is it 1) being able to configure whether the data preprocessing should be used or 2) being able to disable the data preprocessing at all? |
Development
@mfeurer The plan is to make data preprocessing step configurable. Such that each step within data preprocessing pipeline is also configurable. So user can even configure whether he wants to perform rescaling or categorical encoding etc. This will provide more control of pipeline configuration space and also reduce no. of hyperparameters to tune as per requirement. The idea of this PR is to make data preprocessing step configurable similar to other steps in pipeline, providing the ability to add custom algorithms for data preprocessing. Currently, the implementation is such that it will only use the available data preprocessing pipeline by default. If user wants to disable step, they can set include_data_preprocessor to ['no_preprocessing'] or can also use both available algos by setting it to ['no_preprocessing', 'feature_type']. It supports ability to add custom data preprocessing step. But I am facing issues handling init_params. PR needs improvements. Also, even data do not have any missing values, we have missing value imputation mandatory step. The output contains config for imputation strategy and gives false impression that imputation is performed even if there is no missing value in the data. For more enhancement, we can include enabling missing value imputation only when data has missing values otherwise forbid the configuration. To summarise the goal:
It will help if you can provide me more details regarding init_params and what all needs to be done to handle it properly. Also, how I can add ability to configure each step in data preprocessing pipeline. |
Got it.
I see. This totally makes sense, however, there's a small issue now in that 'no_preprocessing' is always an option now which can potentially lead to a lot of problems in case data has missing values. I guess it would be good if the no_preprocessing option would only be given in an example so it is not there by default.
Regarding the init_params, could you please give me some details on what is failing right now? Regarding configuring each step in data preprocessing, that'll be rather tough given the current API. I guess the way forward here would be to replace As a final remark, please excuse the slow response time. I can only have a look at at most one or two PRs at a time, and currently your PR for multiple metrics has a higher priority for me. |
Development
Development
I agree. no_preprocessing should be handled carefully. This we can make sure by having more control over each sub component depending on data meta features, let's say only enable imputation in case of missing values in dataset. Currently, each subcomponent is enabled by default and uses its hyperparameters for optimization which may not be required. Also, it gives false impression that mean imputation is performed but data doesn't have any missing values. We can always raise exception if no_preprocessing is selected and data has missing values.
Here, init_params is set for categorical features and if values are set are strictly checked by check_init_params method. This makes it somewhat difficult to make data_preprocessing configurable.
Agree on that. Do you get any idea on how API should look like? |
…into development
Sorry for the long round-trip time, I'm slowly catching up with older issues.
If you have a look at the way we currently construct the search space ( auto-sklearn/autosklearn/util/pipeline.py Line 27 in 2a4d388
auto-sklearn/autosklearn/pipeline/base.py Line 333 in 2a4d388
Having such an API where it would be possible to specify for each part of the pipeline what's in it would be a good first step moving towards a user being able to specify the complete pipeline theirselves. |
@mfeurer Thanks for reply. I checked how input is translated to dictionary of a component mapping for include and exclude list for component. I will make API changes to accept directly dictionary as an input. I will have include and exclude as two parameters keeping the same behaviour. |
f8d5bae
to
a3c9198
Compare
Codecov Report
@@ Coverage Diff @@
## development #977 +/- ##
===============================================
- Coverage 88.10% 88.08% -0.03%
===============================================
Files 138 139 +1
Lines 10866 10951 +85
===============================================
+ Hits 9574 9646 +72
- Misses 1292 1305 +13
Continue to review full report at Codecov.
|
70f5f3f
to
d03a5e6
Compare
73251a7
to
6e78aef
Compare
@mfeurer I have updated the implementation with suggested changes and is ready for review. This PR:
|
…into development
# pipeline if needed | ||
self.categ_ppl = CategoricalPreprocessingPipeline( | ||
config=None, steps=pipeline, dataset_properties=dataset_properties, | ||
include=include, exclude=exclude, random_state=random_state, | ||
init_params=init_params) | ||
random_state=random_state, init_params=init_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the trouble, could you elaborate why include and exclude are removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier there no configuration for data preprocessing step. Thus None was used for include/exclude here. But as now the step is configurable, we need to carefully handle the step. It's kinda parallel pipelines (which have configurable steps such as rescaling, cat encoding) within pipeline. For now I added it back, and removed include and exclude from where they were passed here.
Please let me know if it is required to have include and exclude for num/cat pipeline steps. We may need to think how the schema of include/exclude option look like for complex pipeline structures.
Example for current implemented schema for include/exclude
{
'classifier' : ['sgd','lda'],
'feature_preprocessor': ['pca', 'kernel_pca']
}
What do you think of below schema structure for complex pipelines?
{
'classifier': ['sgd', 'lda'],
'feature_preprocessor': ['pca', 'kernel_pca'],
'data_preprocessor': {
'numerical_transformer': {
'rescaling': ['minmax', 'normalize']
},
'categorical_transformer': {
'category_coalescence': ['no_coalescense'],
'categorical_encoding': ['no_encoding']
}
}
}
Please let me know if this makes sense I will make changes accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply. So the question we want to answer here is if someone wants to use for instance only 'minmax'
scaling. But what happens if there are no numerical columns? We have to take into account multiple scenarios.
To my understanding, if someone wants to use only minmax
as a data-preprocessing he/she can do so via:
autosklearn.pipeline.components.data_preprocessing.add_preprocessor(<custom preprocessing>)
So adding support for these custom scenarios might complicate the code and there is a work-around solution for these custom cases. In order words, I think your change gives enough flexibility to the users as it is. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it will complicate code. We can go ahead with current implementation. Let me know if you have any other feedback.
autosklearn/smbo.py
Outdated
exclude['regressor'] = self.exclude_estimators | ||
else: | ||
raise ValueError(self.task) | ||
if self.include is not None and self.exclude is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you rebase the code to the latest version, this changed a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased. Looks same to me. There are no changes over here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry. I thought a change was already merged (#1096). Depending on what gets merged first, we will update the code accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled latest changes after #1096 merge.
…into development
…into development
…ig_data_preprocess
@mfeurer @franchuterivera Can anyone of you help with the test cases, what's this error about? |
…into development
The error message is about this patch breaking the parallel mode of Auto-sklearn. However, the error actually also happens in sequential mode. The difference is just that parallel breaks the unit tests. Therefore, I suggest that you run the first failing test on your local machine to see what the issue is. |
…into development
…o-sklearn into config_data_preprocess
@mfeurer I have fixed all the test cases and rebased to development branch. Please check now. |
Hi @rabsr, It looks good to me. Unfortunately I missed out on most of this discussion as it was happening, and the rebase has killed the file changes to show all of the development changes so my ability to do my own review is hampered. Seeing as @mfeurer and @franchuterivera were happy up to commit 12e56e3 and all changes since were test and build fixes, I am happy to merge it as it's been on the radar for a while. If any issues are to appear down the line, I will create an issue and tag you in! Thanks again and sorry for the slow response time :) |
…ptions no preprocessing and feature type split (#977)
…ocessing and feature type split (#977) * making data preprocessing step configurable with two options no preprocessing and feature type split * Fix: execution fails when data_preprocessor is no_preprocesing * Incorporating review comments * Fixing test cases; updating metalearning with updated hyperparameters * Fixing examples * Updating portfolios with new config * Incorporated review comments and fix test case * Test fixes * Test fixes * Fix metalearning config * Remove unused imports * Fix test cases * Fix test cases and examples * Adding more checks for include and exclude params * Fix flake error * Fix flake error * Handling target_type in datatset_properties * Fixes * Fixes * Fix error * Fix test cases * Adding datatype annotations * Fix test cases * Fix build * Fix test case' * Update stale.yaml * Fix annotation type * Update portfolios with new config Co-authored-by: Rohit Agarwal <rohit.agarwal4@aexp.com>
New approach for #900 : Data preprocessing step of AutoML system will have two options:
Introduced new parameters:
If none of the parameter is set by user, by default, it will add 'no_preprocessing' to step exclude_data_preprocessors and will only use existing FeatureTypeSplit component.