Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Add column transformer #305

Merged
merged 26 commits into from Nov 4, 2021
Merged

Conversation

ravinkohli
Copy link
Contributor

@ravinkohli ravinkohli commented Oct 26, 2021

Addresses #304

Specifically, this PR adds the following

  1. adds simple imputer to impute categorical columns.
  2. Combined imputer and encoder to column transformer
  3. moved comparator to a class method in tabular_feature_validator.py
  4. adapt tests for these changes
  5. Adds test for comparator

@ravinkohli ravinkohli added this to In progress in release v0.1.0 via automation Oct 26, 2021
@ravinkohli ravinkohli added the enhancement New feature or request label Oct 26, 2021
@ravinkohli ravinkohli moved this from In progress to Review in progress in release v0.1.0 Oct 26, 2021
@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #305 (a04ba08) into development (9002937) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #305      +/-   ##
===============================================
+ Coverage        81.74%   81.82%   +0.08%     
===============================================
  Files              151      151              
  Lines             8655     8646       -9     
  Branches          1330     1321       -9     
===============================================
  Hits              7075     7075              
+ Misses            1109     1105       -4     
+ Partials           471      466       -5     
Impacted Files Coverage Δ
autoPyTorch/api/base_task.py 84.46% <ø> (ø)
autoPyTorch/data/base_feature_validator.py 100.00% <100.00%> (ø)
autoPyTorch/data/tabular_feature_validator.py 91.19% <100.00%> (+4.88%) ⬆️
...ts/setup/network_backbone/InceptionTimeBackbone.py 98.92% <100.00%> (ø)
autoPyTorch/ensemble/ensemble_builder.py 72.32% <0.00%> (-0.84%) ⬇️
...peline/components/training/trainer/base_trainer.py 96.82% <0.00%> (+1.05%) ⬆️
...ipeline/components/setup/network_backbone/utils.py 88.72% <0.00%> (+1.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9002937...a04ba08. Read the comment docs.

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR
I checked everything except autoPyTorch/data/tabular_feature_validator.py.

test/test_data/test_feature_validator.py Show resolved Hide resolved
@@ -51,7 +51,7 @@ def __init__(self,
self.dtypes = [] # type: typing.List[str]
self.column_order = [] # type: typing.List[str]

self.encoder = None # type: typing.Optional[BaseEstimator]
self.column_transformer = None # type: typing.Optional[BaseEstimator]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.column_transformer = None # type: typing.Optional[BaseEstimator]
self.column_transformer: typing.Optional[BaseEstimator] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wanted to leave these changes for your PR. Once my PR is merged, you can rebase and edit it. This will ensure we don't make the same changes twice and an easy merge.

Comment on lines +23 to +67
def _create_column_transformer(
preprocessors: Dict[str, List[BaseEstimator]],
categorical_columns: List[str],
) -> ColumnTransformer:
"""
Given a dictionary of preprocessors, this function
creates a sklearn column transformer with appropriate
columns associated with their preprocessors.

Args:
preprocessors (Dict[str, List[BaseEstimator]]):
Dictionary containing list of numerical and categorical preprocessors.
categorical_columns (List[str]):
List of names of categorical columns

Returns:
ColumnTransformer
"""

categorical_pipeline = make_pipeline(*preprocessors['categorical'])

return ColumnTransformer([
('categorical_pipeline', categorical_pipeline, categorical_columns)],
remainder='passthrough'
)


def get_tabular_preprocessors() -> Dict[str, List[BaseEstimator]]:
"""
This function creates a Dictionary containing a list
of numerical and categorical preprocessors

Returns:
Dict[str, List[BaseEstimator]]
"""
preprocessors: Dict[str, List[BaseEstimator]] = dict()

# Categorical Preprocessors
onehot_encoder = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)
categorical_imputer = SimpleImputer(strategy='constant', copy=False)

preprocessors['categorical'] = [categorical_imputer, onehot_encoder]

return preprocessors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any reasons why you have not made this part same as the refactor_development_regularization_cocktail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intentioned to replace the previous _impute_nan_in_categoricals with a Sklearn Imputer and thus add a column transformer. If we would like to make it same as refactor_development_regularization_cocktail I think that can be better done by merging its PR #161 .

X = self.impute_nan_in_categories(X)

X = self.encoder.transform(X)
X = self.column_transformer.transform(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure:
We are imputing in the transform, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


return preprocessors


class TabularFeatureValidator(BaseFeatureValidator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add doc-string about the attributes such as categories, enc_columns, column_transformer, numerical_columns, categorical_columns, all_nan_columns, column_order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely.

@nabenabe0928 nabenabe0928 merged commit a11caf4 into automl:development Nov 4, 2021
release v0.1.0 automation moved this from Review in progress to Done Nov 4, 2021
github-actions bot pushed a commit that referenced this pull request Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants