Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making data preprocessing step configurable with two options no preprocessing and feature type split #977

Merged
merged 48 commits into from Aug 2, 2021

Conversation

rabsr
Copy link
Contributor

@rabsr rabsr commented Oct 12, 2020

New approach for #900 : Data preprocessing step of AutoML system will have two options:

  • use feature_type split method (existing implementation)
  • disable step by selecting no_preprocessing

Introduced new parameters:

  • include_data_preprocessors
  • exclude_data_preprocessors

If none of the parameter is set by user, by default, it will add 'no_preprocessing' to step exclude_data_preprocessors and will only use existing FeatureTypeSplit component.

@mfeurer
Copy link
Contributor

mfeurer commented Nov 9, 2020

Hi @rabsr, I finally got around looking at this PR. Thanks again for all your work!

Let me briefly check whether I understand this PR correctly: are you proposing to add the possibility to drop the whole data preprocessing pipeline? And could you briefly state the goal of this - is it 1) being able to configure whether the data preprocessing should be used or 2) being able to disable the data preprocessing at all?

@rabsr
Copy link
Contributor Author

rabsr commented Nov 11, 2020

@mfeurer The plan is to make data preprocessing step configurable. Such that each step within data preprocessing pipeline is also configurable. So user can even configure whether he wants to perform rescaling or categorical encoding etc. This will provide more control of pipeline configuration space and also reduce no. of hyperparameters to tune as per requirement.

The idea of this PR is to make data preprocessing step configurable similar to other steps in pipeline, providing the ability to add custom algorithms for data preprocessing.

Currently, the implementation is such that it will only use the available data preprocessing pipeline by default. If user wants to disable step, they can set include_data_preprocessor to ['no_preprocessing'] or can also use both available algos by setting it to ['no_preprocessing', 'feature_type']. It supports ability to add custom data preprocessing step. But I am facing issues handling init_params. PR needs improvements.

Also, even data do not have any missing values, we have missing value imputation mandatory step. The output contains config for imputation strategy and gives false impression that imputation is performed even if there is no missing value in the data. For more enhancement, we can include enabling missing value imputation only when data has missing values otherwise forbid the configuration.

To summarise the goal:

  • Ability to configure data preprocessing step in SimpleClassificationPipeline/SimpleRegressionPipeline (Implemented)
  • Ability to configure each step in data preprocessing pipeline
  • Ability to add new custom data preprocessing step (Implemented, missing proper handling of init_params)
  • Enable steps in data preprocessing pipeline based on dataset properties

It will help if you can provide me more details regarding init_params and what all needs to be done to handle it properly. Also, how I can add ability to configure each step in data preprocessing pipeline.

@mfeurer
Copy link
Contributor

mfeurer commented Nov 18, 2020

The idea of this PR is to make data preprocessing step configurable similar to other steps in pipeline, providing the ability to add custom algorithms for data preprocessing.

Got it.

Currently, the implementation is such that it will only use the available data preprocessing pipeline by default. If user wants to disable step, they can set include_data_preprocessor to ['no_preprocessing'] or can also use both available algos by setting it to ['no_preprocessing', 'feature_type']. It supports ability to add custom data preprocessing step. But I am facing issues handling init_params. PR needs improvements.

I see. This totally makes sense, however, there's a small issue now in that 'no_preprocessing' is always an option now which can potentially lead to a lot of problems in case data has missing values. I guess it would be good if the no_preprocessing option would only be given in an example so it is not there by default.

It will help if you can provide me more details regarding init_params and what all needs to be done to handle it properly. Also, how I can add ability to configure each step in data preprocessing pipeline.

Regarding the init_params, could you please give me some details on what is failing right now? Regarding configuring each step in data preprocessing, that'll be rather tough given the current API. I guess the way forward here would be to replace include_estimators, include_preprocessors and include_data_preprocessor by a single dictionary which allows addressing each component by name. I'll keep thinking about an API for that and will keep you posted.

As a final remark, please excuse the slow response time. I can only have a look at at most one or two PRs at a time, and currently your PR for multiple metrics has a higher priority for me.

@rabsr
Copy link
Contributor Author

rabsr commented Jan 12, 2021

there's a small issue now in that 'no_preprocessing' is always an option now which can potentially lead to a lot of problems in case data has missing values. I guess it would be good if the no_preprocessing option would only be given in an example so it is not there by default.

I agree. no_preprocessing should be handled carefully. This we can make sure by having more control over each sub component depending on data meta features, let's say only enable imputation in case of missing values in dataset. Currently, each subcomponent is enabled by default and uses its hyperparameters for optimization which may not be required. Also, it gives false impression that mean imputation is performed but data doesn't have any missing values. We can always raise exception if no_preprocessing is selected and data has missing values.

Regarding the init_params, could you please give me some details on what is failing right now?

Here, init_params is set for categorical features and if values are set are strictly checked by check_init_params method. This makes it somewhat difficult to make data_preprocessing configurable.

Regarding configuring each step in data preprocessing, that'll be rather tough given the current API. I guess the way forward here would be to replace include_estimators, include_preprocessors and include_data_preprocessor by a single dictionary which allows addressing each component by name. I'll keep thinking about an API for that and will keep you posted.

Agree on that. Do you get any idea on how API should look like?

@mfeurer
Copy link
Contributor

mfeurer commented Jan 20, 2021

Sorry for the long round-trip time, I'm slowly catching up with older issues.

Agree on that. Do you get any idea on how API should look like?

If you have a look at the way we currently construct the search space (

def get_configuration_space(info: Dict[str, Any],
), you'll see that's it's translated into a dictionary of a component name mapping to what to include and exclude for that component name, which is later on used here:
find_active_choices(matches, node, node_idx,

Having such an API where it would be possible to specify for each part of the pipeline what's in it would be a good first step moving towards a user being able to specify the complete pipeline theirselves.

@rabsr
Copy link
Contributor Author

rabsr commented Jan 22, 2021

@mfeurer Thanks for reply.

I checked how input is translated to dictionary of a component mapping for include and exclude list for component. I will make API changes to accept directly dictionary as an input. I will have include and exclude as two parameters keeping the same behaviour.

@codecov
Copy link

codecov bot commented Jan 23, 2021

Codecov Report

Merging #977 (96fee04) into development (832b412) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #977      +/-   ##
===============================================
- Coverage        88.10%   88.08%   -0.03%     
===============================================
  Files              138      139       +1     
  Lines            10866    10951      +85     
===============================================
+ Hits              9574     9646      +72     
- Misses            1292     1305      +13     
Impacted Files Coverage Δ
autosklearn/evaluation/abstract_evaluator.py 92.91% <ø> (+0.78%) ⬆️
...osklearn/metalearning/metafeatures/metafeatures.py 94.59% <ø> (ø)
autosklearn/pipeline/base.py 87.83% <ø> (+1.27%) ⬆️
autosklearn/pipeline/classification.py 86.50% <ø> (-0.38%) ⬇️
...arn/pipeline/components/classification/__init__.py 84.78% <ø> (ø)
...pipeline/components/data_preprocessing/__init__.py 82.97% <ø> (ø)
...line/components/data_preprocessing/feature_type.py 88.11% <ø> (ø)
...nts/data_preprocessing/feature_type_categorical.py 90.90% <ø> (ø)
...nents/data_preprocessing/feature_type_numerical.py 90.32% <ø> (ø)
autosklearn/pipeline/regression.py 93.87% <ø> (-0.81%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 832b412...96fee04. Read the comment docs.

@rabsr
Copy link
Contributor Author

rabsr commented Jan 27, 2021

@mfeurer I have updated the implementation with suggested changes and is ready for review. This PR:

  • makes data_preprocessor step configurable, changing default implementation as feature_type component for the step
  • adds example to demonstrate how to extend and skip data_preprocessor step
  • updates configurations
  • removes individual include/exclude parameter for each step. Components can now be included and excluded by passing dictionary with key as step name and value as list of components to be included/excluded, making sure that same step can not be present in both include and exclude.

# pipeline if needed
self.categ_ppl = CategoricalPreprocessingPipeline(
config=None, steps=pipeline, dataset_properties=dataset_properties,
include=include, exclude=exclude, random_state=random_state,
init_params=init_params)
random_state=random_state, init_params=init_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the trouble, could you elaborate why include and exclude are removed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier there no configuration for data preprocessing step. Thus None was used for include/exclude here. But as now the step is configurable, we need to carefully handle the step. It's kinda parallel pipelines (which have configurable steps such as rescaling, cat encoding) within pipeline. For now I added it back, and removed include and exclude from where they were passed here.
Please let me know if it is required to have include and exclude for num/cat pipeline steps. We may need to think how the schema of include/exclude option look like for complex pipeline structures.
Example for current implemented schema for include/exclude

{
    'classifier' : ['sgd','lda'],
    'feature_preprocessor': ['pca', 'kernel_pca']
}

What do you think of below schema structure for complex pipelines?

{
        'classifier': ['sgd', 'lda'],
        'feature_preprocessor': ['pca', 'kernel_pca'],
        'data_preprocessor': {
            'numerical_transformer': {
                'rescaling': ['minmax', 'normalize']
            },
            'categorical_transformer': {
                'category_coalescence': ['no_coalescense'],
                'categorical_encoding': ['no_encoding']
            }
        }
    }

Please let me know if this makes sense I will make changes accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply. So the question we want to answer here is if someone wants to use for instance only 'minmax' scaling. But what happens if there are no numerical columns? We have to take into account multiple scenarios.

To my understanding, if someone wants to use only minmax as a data-preprocessing he/she can do so via:

autosklearn.pipeline.components.data_preprocessing.add_preprocessor(<custom preprocessing>)

So adding support for these custom scenarios might complicate the code and there is a work-around solution for these custom cases. In order words, I think your change gives enough flexibility to the users as it is. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it will complicate code. We can go ahead with current implementation. Let me know if you have any other feedback.

exclude['regressor'] = self.exclude_estimators
else:
raise ValueError(self.task)
if self.include is not None and self.exclude is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you rebase the code to the latest version, this changed a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. Looks same to me. There are no changes over here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry. I thought a change was already merged (#1096). Depending on what gets merged first, we will update the code accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled latest changes after #1096 merge.

@rabsr
Copy link
Contributor Author

rabsr commented Jul 1, 2021

@mfeurer @franchuterivera Can anyone of you help with the test cases, what's this error about?

@mfeurer
Copy link
Contributor

mfeurer commented Jul 9, 2021

The error message is about this patch breaking the parallel mode of Auto-sklearn. However, the error actually also happens in sequential mode. The difference is just that parallel breaks the unit tests. Therefore, I suggest that you run the first failing test on your local machine to see what the issue is.

@rabsr
Copy link
Contributor Author

rabsr commented Aug 1, 2021

@mfeurer I have fixed all the test cases and rebased to development branch. Please check now.

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 2, 2021

Hi @rabsr,

It looks good to me. Unfortunately I missed out on most of this discussion as it was happening, and the rebase has killed the file changes to show all of the development changes so my ability to do my own review is hampered.

Seeing as @mfeurer and @franchuterivera were happy up to commit 12e56e3 and all changes since were test and build fixes, I am happy to merge it as it's been on the radar for a while.

If any issues are to appear down the line, I will create an issue and tag you in!

Thanks again and sorry for the slow response time :)

@eddiebergman eddiebergman merged commit 897fe40 into automl:development Aug 2, 2021
github-actions bot pushed a commit that referenced this pull request Aug 2, 2021
…ptions no preprocessing and feature type split (#977)
@rabsr rabsr deleted the config_data_preprocess branch August 2, 2021 12:00
eddiebergman pushed a commit that referenced this pull request Aug 18, 2021
…ocessing and feature type split (#977)

* making data preprocessing step configurable with two options no preprocessing and feature type split

* Fix: execution fails when data_preprocessor is no_preprocesing

* Incorporating review comments

* Fixing test cases; updating metalearning with updated hyperparameters

* Fixing examples

* Updating portfolios with new config

* Incorporated review comments and fix test case

* Test fixes

* Test fixes

* Fix metalearning config

* Remove unused imports

* Fix test cases

* Fix test cases and examples

* Adding more checks for include and exclude params

* Fix flake error

* Fix flake error

* Handling target_type in datatset_properties

* Fixes

* Fixes

* Fix error

* Fix test cases

* Adding datatype annotations

* Fix test cases

* Fix build

* Fix test case'

* Update stale.yaml

* Fix annotation type

* Update portfolios with new config

Co-authored-by: Rohit Agarwal <rohit.agarwal4@aexp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants