Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: skipping without error when there are no variables to transform #599

Open
david-cortes opened this issue Jan 20, 2023 · 4 comments

Comments

@david-cortes
Copy link
Contributor

Transformers in this package have the nice functionality to automatically apply to all variables that are either numerical or categorical depending on what the transformer does if the list of variable names is not supplied.

Sometimes, one wants to perform automated feature selection as steps before or after some transformer, in which case if for example one has a transformer like MatchCategories and the selector drops all categorical variables, there will be an error later on in the pipeline as there won't be any columns for the transformer.

Would be nice if there could be an option to toggle off erroring on empty variable lists.

@solegalli
Copy link
Collaborator

@glevv what do you think about this?

@glevv
Copy link
Contributor

glevv commented Jan 21, 2023

If I understand it correctly, this will only work for estimators that can transform without calling fit first, which is incompatible with sklearn notation.

@david-cortes
Copy link
Contributor Author

If I understand it correctly, this will only work for estimators that can transform without calling fit first, which is incompatible with sklearn notation.

Not really, since in a case in which there's no columns, a call to fit just needs to return the same object and a call to transform just needs to return the same data that is passed as input.

@solegalli
Copy link
Collaborator

At the moment, if for example, encoders find that the dataset has no categorical variable, they will raise an error, fail and not perform the encoding. If you set ignore_format=True, they will also encode numerical variables, but this is not what @david-cortes wants.

Numerical transformers will also raise an error and fail if they find no numerical variable in the dataset.

This was done intentionally. My idea when designing these transformers was to stop users from carrying out encoding methodologies to numerical variables, and numerical transformations to categorical variables, inadvertently..

As a clear example, with the SimpleImputer() if you set the strategy to "most_frequent", the transformer will impute both numerical and categorical variables with the mode. Whereas this method is actually suitable for categorical variables, and numerical variables should be encoded with the mean or the median. These is the type of behaviour that Feature-engine is designed to prevent.

Hence, if a categorical encoder encounters no categorical variable in the dataset, it will fail, because it does not have a suitable input for the transformation.

@david-cortes is asking that, instead of failing, they just pass. That is, if no categorical variable is found in the dataset, instead of failing, just carry out fit and transform without modifying the dataset.

My concern with that is that, most users will not go into the source code, and some don't even read the documentation. So, if we allow the transformers to pass and do nothing, the users might believe that the transformer worked, whatever that means. Whereas, if we raise an error, we are somehow encouraging them to think what might be going on.

@david-cortes is not the first one to request this. Someone else requested that for selectors. See #566 and a little related but not quite #567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants