Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncoder: expose categories detected for each feature during fit #1182

Merged
merged 6 commits into from Sep 16, 2020

Conversation

dsherry
Copy link
Collaborator

@dsherry dsherry commented Sep 16, 2020

Fix #1180

Added an API to the one-hot-encoder for accessing the list of categories associated with a given feature.

Also, generalized the BaseMeta abstraction to support subclasses overriding the list of methods to be validated.

Will retarget against main once #1179 is merged

@codecov
Copy link

codecov bot commented Sep 16, 2020

Codecov Report

Merging #1182 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1182   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         196      196           
  Lines       11710    11729   +19     
=======================================
+ Hits        11701    11720   +19     
  Misses          9        9           
Impacted Files Coverage Δ
...components/transformers/encoders/onehot_encoder.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.00% <100.00%> (ø)
evalml/utils/base_meta.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd65f56...ab3b754. Read the comment docs.

@dsherry dsherry changed the base branch from main to ds_remove_encoder_base_class Sep 16, 2020
METHODS_TO_CHECK = ComponentBaseMeta.METHODS_TO_CHECK + ['categories']


class OneHotEncoder(Transformer, metaclass=OneHotEncoderMeta):
Copy link
Collaborator Author

@dsherry dsherry Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generalized BaseMeta so that we can add to the list of methods to check, for specific subclasses. Because in this case for one-hot encoder, if a user calls categories before fit, we want an error to get thrown.

It would be so cool if we found a way to roll this all up in a decorator... but I'm not sure how at the moment.

Copy link
Contributor

@freddyaboulton freddyaboulton Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change!

@dsherry dsherry marked this pull request as ready for review Sep 16, 2020
self._encoder = None
super().__init__(parameters=parameters,
component_obj=None,
random_state=random_state)

def _get_cat_cols(self, X):
@staticmethod
Copy link
Collaborator Author

@dsherry dsherry Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, why not :)

@@ -72,24 +80,24 @@ def fit(self, X, y=None):
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
X_t = X
cols_to_encode = self._get_cat_cols(X_t)
self._cols_to_encode = self._get_cat_cols(X_t)
Copy link
Collaborator Author

@dsherry dsherry Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now save _cols_to_encode, so that in categories we can index into the sklearn encoder's category list correctly.

@dsherry dsherry requested a review from jeremyliweishih Sep 16, 2020
property_orig = dct[attribute]
dct[attribute] = property(cls.check_for_fit(property_orig.__get__),
property_orig.__set__,
property_orig.__delattr__)
Copy link
Collaborator Author

@dsherry dsherry Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building off @jeremyliweishih 's great code

@dsherry dsherry force-pushed the ds_1180_one_hot_categories branch from 81c3332 to 2704eaa Compare Sep 16, 2020
@dsherry dsherry changed the base branch from ds_remove_encoder_base_class to main Sep 16, 2020
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

@dsherry Looks good!

METHODS_TO_CHECK = ComponentBaseMeta.METHODS_TO_CHECK + ['categories']


class OneHotEncoder(Transformer, metaclass=OneHotEncoderMeta):
Copy link
Contributor

@freddyaboulton freddyaboulton Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change!

@dsherry dsherry force-pushed the ds_1180_one_hot_categories branch from 2704eaa to ab3b754 Compare Sep 16, 2020
Copy link
Contributor

@bchen1116 bchen1116 left a comment

LGTM!

@dsherry dsherry merged commit ccc7e05 into main Sep 16, 2020
@dsherry dsherry deleted the ds_1180_one_hot_categories branch Sep 16, 2020
This was referenced Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder: expose categories detected for each feature during fit
3 participants