Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of scikit-learn's LabelEncoder #3161

Merged
merged 14 commits into from
Dec 30, 2021
Merged

Conversation

angela97lin
Copy link
Contributor

Closes #3132

@codecov
Copy link

codecov bot commented Dec 20, 2021

Codecov Report

Merging #3161 (f81bb6a) into main (785be8f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3161     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        324     324             
  Lines      31233   31232      -1     
=======================================
  Hits       31128   31128             
+ Misses       105     104      -1     
Impacted Files Coverage Δ
evalml/tests/component_tests/test_components.py 99.3% <ø> (+0.2%) ⬆️
...ents/estimators/classifiers/catboost_classifier.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <100.0%> (ø)
...nents/estimators/classifiers/xgboost_classifier.py 100.0% <100.0%> (ø)
...ansformers/preprocessing/time_series_featurizer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 785be8f...f81bb6a. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impeccable work!

@@ -167,7 +168,7 @@ def _encode_labels(self, y):
if not is_integer_dtype(y_encoded):
self._label_encoder = LabelEncoder()
y_encoded = pd.Series(
self._label_encoder.fit_transform(y_encoded), dtype="int64"
self._label_encoder.fit_transform(None, y_encoded)[1], dtype="int64"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the pd.Series call here when we don't have it in catboost?

@@ -103,7 +103,7 @@ def _convert_bool_to_int(X):
def _label_encode(self, y):
if not is_integer_dtype(y):
self._label_encoder = LabelEncoder()
y = pd.Series(self._label_encoder.fit_transform(y), dtype="int64")
y = pd.Series(self._label_encoder.fit_transform(None, y)[1], dtype="int64")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Agreed with @jeremyliweishih, the label encoder component already returns the y value as an int-encoded series, so I don't think we need to do the extra initialization and type-casting in lightgbm and xgboost

@angela97lin angela97lin merged commit 377b4ad into main Dec 30, 2021
@angela97lin angela97lin deleted the 3132_remove_label_encoder branch December 30, 2021 18:40
@chukarsten chukarsten mentioned this pull request Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove usage of scikit-learn's LabelEncoder
3 participants