making data preprocessing step configurable with two options no preprocessing and feature type split #977

rabsr · 2020-10-12T16:56:46Z

New approach for #900 : Data preprocessing step of AutoML system will have two options:

use feature_type split method (existing implementation)
disable step by selecting no_preprocessing

Introduced new parameters:

include_data_preprocessors
exclude_data_preprocessors

If none of the parameter is set by user, by default, it will add 'no_preprocessing' to step exclude_data_preprocessors and will only use existing FeatureTypeSplit component.

…ocessing and feature type split

mfeurer · 2020-11-09T14:07:57Z

Hi @rabsr, I finally got around looking at this PR. Thanks again for all your work!

Let me briefly check whether I understand this PR correctly: are you proposing to add the possibility to drop the whole data preprocessing pipeline? And could you briefly state the goal of this - is it 1) being able to configure whether the data preprocessing should be used or 2) being able to disable the data preprocessing at all?

Development

rabsr · 2020-11-11T19:04:39Z

@mfeurer The plan is to make data preprocessing step configurable. Such that each step within data preprocessing pipeline is also configurable. So user can even configure whether he wants to perform rescaling or categorical encoding etc. This will provide more control of pipeline configuration space and also reduce no. of hyperparameters to tune as per requirement.

The idea of this PR is to make data preprocessing step configurable similar to other steps in pipeline, providing the ability to add custom algorithms for data preprocessing.

Currently, the implementation is such that it will only use the available data preprocessing pipeline by default. If user wants to disable step, they can set include_data_preprocessor to ['no_preprocessing'] or can also use both available algos by setting it to ['no_preprocessing', 'feature_type']. It supports ability to add custom data preprocessing step. But I am facing issues handling init_params. PR needs improvements.

Also, even data do not have any missing values, we have missing value imputation mandatory step. The output contains config for imputation strategy and gives false impression that imputation is performed even if there is no missing value in the data. For more enhancement, we can include enabling missing value imputation only when data has missing values otherwise forbid the configuration.

To summarise the goal:

Ability to configure data preprocessing step in SimpleClassificationPipeline/SimpleRegressionPipeline (Implemented)
Ability to configure each step in data preprocessing pipeline
Ability to add new custom data preprocessing step (Implemented, missing proper handling of init_params)
Enable steps in data preprocessing pipeline based on dataset properties

It will help if you can provide me more details regarding init_params and what all needs to be done to handle it properly. Also, how I can add ability to configure each step in data preprocessing pipeline.

mfeurer · 2020-11-18T08:31:39Z

The idea of this PR is to make data preprocessing step configurable similar to other steps in pipeline, providing the ability to add custom algorithms for data preprocessing.

Got it.

Currently, the implementation is such that it will only use the available data preprocessing pipeline by default. If user wants to disable step, they can set include_data_preprocessor to ['no_preprocessing'] or can also use both available algos by setting it to ['no_preprocessing', 'feature_type']. It supports ability to add custom data preprocessing step. But I am facing issues handling init_params. PR needs improvements.

I see. This totally makes sense, however, there's a small issue now in that 'no_preprocessing' is always an option now which can potentially lead to a lot of problems in case data has missing values. I guess it would be good if the no_preprocessing option would only be given in an example so it is not there by default.

It will help if you can provide me more details regarding init_params and what all needs to be done to handle it properly. Also, how I can add ability to configure each step in data preprocessing pipeline.

Regarding the init_params, could you please give me some details on what is failing right now? Regarding configuring each step in data preprocessing, that'll be rather tough given the current API. I guess the way forward here would be to replace include_estimators, include_preprocessors and include_data_preprocessor by a single dictionary which allows addressing each component by name. I'll keep thinking about an API for that and will keep you posted.

As a final remark, please excuse the slow response time. I can only have a look at at most one or two PRs at a time, and currently your PR for multiple metrics has a higher priority for me.

Development

rabsr · 2021-01-12T08:35:06Z

there's a small issue now in that 'no_preprocessing' is always an option now which can potentially lead to a lot of problems in case data has missing values. I guess it would be good if the no_preprocessing option would only be given in an example so it is not there by default.

I agree. no_preprocessing should be handled carefully. This we can make sure by having more control over each sub component depending on data meta features, let's say only enable imputation in case of missing values in dataset. Currently, each subcomponent is enabled by default and uses its hyperparameters for optimization which may not be required. Also, it gives false impression that mean imputation is performed but data doesn't have any missing values. We can always raise exception if no_preprocessing is selected and data has missing values.

Regarding the init_params, could you please give me some details on what is failing right now?

Here, init_params is set for categorical features and if values are set are strictly checked by check_init_params method. This makes it somewhat difficult to make data_preprocessing configurable.

Regarding configuring each step in data preprocessing, that'll be rather tough given the current API. I guess the way forward here would be to replace include_estimators, include_preprocessors and include_data_preprocessor by a single dictionary which allows addressing each component by name. I'll keep thinking about an API for that and will keep you posted.

Agree on that. Do you get any idea on how API should look like?

…into development

mfeurer · 2021-01-20T15:10:33Z

Sorry for the long round-trip time, I'm slowly catching up with older issues.

Agree on that. Do you get any idea on how API should look like?

If you have a look at the way we currently construct the search space (

auto-sklearn/autosklearn/util/pipeline.py

Line 27 in 2a4d388

def get_configuration_space(info: Dict[str, Any],

), you'll see that's it's translated into a dictionary of a component name mapping to what to include and exclude for that component name, which is later on used here:

auto-sklearn/autosklearn/pipeline/base.py

Line 333 in 2a4d388

find_active_choices(matches, node, node_idx,

Having such an API where it would be possible to specify for each part of the pipeline what's in it would be a good first step moving towards a user being able to specify the complete pipeline theirselves.

rabsr · 2021-01-22T17:56:37Z

@mfeurer Thanks for reply.

I checked how input is translated to dictionary of a component mapping for include and exclude list for component. I will make API changes to accept directly dictionary as an input. I will have include and exclude as two parameters keeping the same behaviour.

codecov · 2021-01-23T12:16:36Z

Codecov Report

Merging #977 (96fee04) into development (832b412) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           development     #977      +/-   ##
===============================================
- Coverage        88.10%   88.08%   -0.03%     
===============================================
  Files              138      139       +1     
  Lines            10866    10951      +85     
===============================================
+ Hits              9574     9646      +72     
- Misses            1292     1305      +13

Impacted Files	Coverage Δ
autosklearn/evaluation/abstract_evaluator.py	`92.91% <ø> (+0.78%)`	⬆️
...osklearn/metalearning/metafeatures/metafeatures.py	`94.59% <ø> (ø)`
autosklearn/pipeline/base.py	`87.83% <ø> (+1.27%)`	⬆️
autosklearn/pipeline/classification.py	`86.50% <ø> (-0.38%)`	⬇️
...arn/pipeline/components/classification/__init__.py	`84.78% <ø> (ø)`
...pipeline/components/data_preprocessing/__init__.py	`82.97% <ø> (ø)`
...line/components/data_preprocessing/feature_type.py	`88.11% <ø> (ø)`
...nts/data_preprocessing/feature_type_categorical.py	`90.90% <ø> (ø)`
...nents/data_preprocessing/feature_type_numerical.py	`90.32% <ø> (ø)`
autosklearn/pipeline/regression.py	`93.87% <ø> (-0.81%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 832b412...96fee04. Read the comment docs.

rabsr · 2021-01-27T14:24:55Z

@mfeurer I have updated the implementation with suggested changes and is ready for review. This PR:

makes data_preprocessor step configurable, changing default implementation as feature_type component for the step
adds example to demonstrate how to extend and skip data_preprocessor step
updates configurations
removes individual include/exclude parameter for each step. Components can now be included and excluded by passing dictionary with key as step name and value as list of components to be included/excluded, making sure that same step can not be present in both include and exclude.

…learn into development

…into development

autosklearn/estimators.py

autosklearn/pipeline/base.py

franchuterivera · 2021-04-08T10:54:53Z

autosklearn/pipeline/components/data_preprocessing/feature_type.py

        # pipeline if needed
        self.categ_ppl = CategoricalPreprocessingPipeline(
            config=None, steps=pipeline, dataset_properties=dataset_properties,
-            include=include, exclude=exclude, random_state=random_state,
-            init_params=init_params)
+            random_state=random_state, init_params=init_params)


Sorry for the trouble, could you elaborate why include and exclude are removed here?

Earlier there no configuration for data preprocessing step. Thus None was used for include/exclude here. But as now the step is configurable, we need to carefully handle the step. It's kinda parallel pipelines (which have configurable steps such as rescaling, cat encoding) within pipeline. For now I added it back, and removed include and exclude from where they were passed here.
Please let me know if it is required to have include and exclude for num/cat pipeline steps. We may need to think how the schema of include/exclude option look like for complex pipeline structures.
Example for current implemented schema for include/exclude

{ 'classifier' : ['sgd','lda'], 'feature_preprocessor': ['pca', 'kernel_pca'] }

What do you think of below schema structure for complex pipelines?

{ 'classifier': ['sgd', 'lda'], 'feature_preprocessor': ['pca', 'kernel_pca'], 'data_preprocessor': { 'numerical_transformer': { 'rescaling': ['minmax', 'normalize'] }, 'categorical_transformer': { 'category_coalescence': ['no_coalescense'], 'categorical_encoding': ['no_encoding'] } } }

Please let me know if this makes sense I will make changes accordingly.

Thanks for your reply. So the question we want to answer here is if someone wants to use for instance only 'minmax' scaling. But what happens if there are no numerical columns? We have to take into account multiple scenarios.

To my understanding, if someone wants to use only minmax as a data-preprocessing he/she can do so via:

autosklearn.pipeline.components.data_preprocessing.add_preprocessor(<custom preprocessing>)

So adding support for these custom scenarios might complicate the code and there is a work-around solution for these custom cases. In order words, I think your change gives enough flexibility to the users as it is. What do you think?

Agreed, it will complicate code. We can go ahead with current implementation. Let me know if you have any other feedback.

franchuterivera · 2021-04-08T10:57:09Z

autosklearn/smbo.py

-                exclude['regressor'] = self.exclude_estimators
-            else:
-                raise ValueError(self.task)
+        if self.include is not None and self.exclude is not None:


I think that if you rebase the code to the latest version, this changed a bit.

Rebased. Looks same to me. There are no changes over here

Yes, sorry. I thought a change was already merged (#1096). Depending on what gets merged first, we will update the code accordingly

Pulled latest changes after #1096 merge.

…into development

…ig_data_preprocess

rabsr · 2021-07-01T15:48:31Z

@mfeurer @franchuterivera Can anyone of you help with the test cases, what's this error about?

…into development

mfeurer · 2021-07-09T07:39:35Z

The error message is about this patch breaking the parallel mode of Auto-sklearn. However, the error actually also happens in sequential mode. The difference is just that parallel breaks the unit tests. Therefore, I suggest that you run the first failing test on your local machine to see what the issue is.

…into development

…o-sklearn into config_data_preprocess

rabsr · 2021-08-01T13:58:45Z

@mfeurer I have fixed all the test cases and rebased to development branch. Please check now.

eddiebergman · 2021-08-02T09:33:14Z

Hi @rabsr,

It looks good to me. Unfortunately I missed out on most of this discussion as it was happening, and the rebase has killed the file changes to show all of the development changes so my ability to do my own review is hampered.

Seeing as @mfeurer and @franchuterivera were happy up to commit 12e56e3 and all changes since were test and build fixes, I am happy to merge it as it's been on the radar for a while.

If any issues are to appear down the line, I will create an issue and tag you in!

Thanks again and sorry for the slow response time :)

…ptions no preprocessing and feature type split (#977)

…ocessing and feature type split (#977) * making data preprocessing step configurable with two options no preprocessing and feature type split * Fix: execution fails when data_preprocessor is no_preprocesing * Incorporating review comments * Fixing test cases; updating metalearning with updated hyperparameters * Fixing examples * Updating portfolios with new config * Incorporated review comments and fix test case * Test fixes * Test fixes * Fix metalearning config * Remove unused imports * Fix test cases * Fix test cases and examples * Adding more checks for include and exclude params * Fix flake error * Fix flake error * Handling target_type in datatset_properties * Fixes * Fixes * Fix error * Fix test cases * Adding datatype annotations * Fix test cases * Fix build * Fix test case' * Update stale.yaml * Fix annotation type * Update portfolios with new config Co-authored-by: Rohit Agarwal <rohit.agarwal4@aexp.com>

making data preprocessing step configurable with two options no prepr…

9cd0796

…ocessing and feature type split

rabsr mentioned this pull request Oct 14, 2020

Data preprocessing configuration #900

Closed

Fix: execution fails when data_preprocessor is no_preprocesing

a3c9198

Merge pull request #1 from automl/development

9d89b66

Development

rabsr added 2 commits December 11, 2020 21:30

Merge pull request #2 from automl/development

f757c1e

Development

Merge pull request #4 from automl/development

df3e666

Development

Merge branch 'development' of https://github.com/automl/auto-sklearn …

bd4372a

…into development

rabsr force-pushed the config_data_preprocess branch from f8d5bae to a3c9198 Compare January 23, 2021 12:02

ra-amex added 2 commits January 23, 2021 17:49

Merge development

0e1b3e3

Incorporating review comments

c407ecc

rabsr force-pushed the config_data_preprocess branch from 70f5f3f to d03a5e6 Compare January 26, 2021 20:26

Fixing test cases; updating metalearning with updated hyperparameters

6e78aef

rabsr force-pushed the config_data_preprocess branch from 73251a7 to 6e78aef Compare January 26, 2021 21:42

ra-amex added 2 commits January 27, 2021 11:11

Fixing examples

a711f48

Updating portfolios with new config

7ec256f

ra-amex added 2 commits March 31, 2021 23:07

Merge branch 'development' of https://github.aexp.com/ragar64/auto-sk…

75cce60

…learn into development

Merge branch 'development' of https://github.com/automl/auto-sklearn …

cf86964

…into development

franchuterivera reviewed Apr 8, 2021

View reviewed changes

autosklearn/estimators.py Outdated Show resolved Hide resolved

franchuterivera reviewed Apr 8, 2021

View reviewed changes

autosklearn/pipeline/base.py Outdated Show resolved Hide resolved

franchuterivera reviewed Apr 8, 2021

View reviewed changes

Merge branch 'development' into config_data_preprocess

7c9f2cd

ra-amex added 6 commits June 14, 2021 12:29

Merge branch 'development' of https://github.com/automl/auto-sklearn …

8c88f68

…into development

Merge branch 'development' of https://github.com/automl/auto-sklearn …

29e8ed2

…into development

Merge branch 'development', remote-tracking branch 'origin' into conf…

09ddec5

…ig_data_preprocess

Fixes

92a19d7

Fixes

1449124

Fix error

7b4784e

Merge branch 'development' of https://github.com/automl/auto-sklearn …

12e56e3

…into development

rabsr and others added 12 commits July 9, 2021 15:05

Fix test cases

2200a14

Merge branch 'development' into config_data_preprocess

aa9c847

Adding datatype annotations

632c223

Fix test cases

22c910a

Fix build

d530234

Fix test case'

4914878

Update stale.yaml

65c59a3

Merge branch 'development' of https://github.com/automl/auto-sklearn …

a554003

…into development

Merge branch 'development' into config_data_preprocess

399d592

Merge branch 'config_data_preprocess' of https://github.com/rabsr/aut…

c568fdc

…o-sklearn into config_data_preprocess

Fix annotation type

070df70

Update portfolios with new config

96fee04

eddiebergman merged commit 897fe40 into automl:development Aug 2, 2021

github-actions bot pushed a commit that referenced this pull request Aug 2, 2021

Rohit Agarwal: making data preprocessing step configurable with two o…

58fd8be

…ptions no preprocessing and feature type split (#977)

rabsr deleted the config_data_preprocess branch August 2, 2021 12:00

eddiebergman mentioned this pull request Aug 2, 2021

Test timeout on test_configurations_signed_data with after merge (#977) #1202

Closed

mfeurer mentioned this pull request Aug 24, 2021

Update the code in manual and examples to be formatted more nicely #1229

Merged

Rattko mentioned this pull request Oct 15, 2021

Turning of the data preprocessing step causes algorithms to crash #1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

making data preprocessing step configurable with two options no preprocessing and feature type split #977

making data preprocessing step configurable with two options no preprocessing and feature type split #977

rabsr commented Oct 12, 2020 •

edited

mfeurer commented Nov 9, 2020

rabsr commented Nov 11, 2020

mfeurer commented Nov 18, 2020 •

edited

rabsr commented Jan 12, 2021

mfeurer commented Jan 20, 2021

rabsr commented Jan 22, 2021

codecov bot commented Jan 23, 2021 •

edited

rabsr commented Jan 27, 2021

franchuterivera Apr 8, 2021

rabsr Apr 11, 2021

franchuterivera Apr 12, 2021

rabsr Apr 12, 2021

franchuterivera Apr 8, 2021

rabsr Apr 8, 2021

franchuterivera Apr 12, 2021

rabsr Apr 12, 2021

rabsr Apr 13, 2021

rabsr commented Jul 1, 2021

mfeurer commented Jul 9, 2021

rabsr commented Aug 1, 2021

eddiebergman commented Aug 2, 2021 •

edited

making data preprocessing step configurable with two options no preprocessing and feature type split #977

making data preprocessing step configurable with two options no preprocessing and feature type split #977

Conversation

rabsr commented Oct 12, 2020 • edited

mfeurer commented Nov 9, 2020

rabsr commented Nov 11, 2020

mfeurer commented Nov 18, 2020 • edited

rabsr commented Jan 12, 2021

mfeurer commented Jan 20, 2021

rabsr commented Jan 22, 2021

codecov bot commented Jan 23, 2021 • edited

Codecov Report

rabsr commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabsr commented Jul 1, 2021

mfeurer commented Jul 9, 2021

rabsr commented Aug 1, 2021

eddiebergman commented Aug 2, 2021 • edited

rabsr commented Oct 12, 2020 •

edited

mfeurer commented Nov 18, 2020 •

edited

codecov bot commented Jan 23, 2021 •

edited

eddiebergman commented Aug 2, 2021 •

edited