Use label encoder component instead of doing encoding on the pipeline level #2821

angela97lin · 2021-09-21T17:58:22Z

Perf test docs here: https://alteryx.atlassian.net/wiki/spaces/PS/pages/1088684090/Using+label+encoder+as+a+component+in+pipelines

…852_label_encoder

codecov · 2021-09-29T18:04:59Z

Codecov Report

Merging #2821 (cae9eb0) into main (939ff9f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2821     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28392   28394      +2     
=======================================
+ Hits       28296   28301      +5     
+ Misses        96      93      -3

Impacted Files	Coverage Δ
evalml/tests/automl_tests/test_automl_utils.py	`100.0% <ø> (ø)`
evalml/__init__.py	`100.0% <100.0%> (ø)`
evalml/automl/automl_search.py	`99.9% <100.0%> (+0.1%)`	⬆️
...alml/model_understanding/permutation_importance.py	`100.0% <100.0%> (ø)`
...nderstanding/prediction_explanations/explainers.py	`100.0% <100.0%> (ø)`
evalml/objectives/cost_benefit_matrix.py	`100.0% <100.0%> (ø)`
.../pipelines/binary_classification_pipeline_mixin.py	`100.0% <100.0%> (ø)`
evalml/pipelines/classification_pipeline.py	`100.0% <100.0%> (ø)`
.../components/transformers/encoders/label_encoder.py	`100.0% <100.0%> (ø)`
.../pipelines/time_series_classification_pipelines.py	`99.0% <100.0%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 939ff9f...cae9eb0. Read the comment docs.

…l into 2648_encoding_as_component

angela97lin · 2021-10-06T18:20:13Z

evalml/pipelines/components/transformers/encoders/label_encoder.py

@@ -65,8 +65,7 @@ def transform(self, X, y=None):
            ValueError: If input `y` is None.
        """
        if y is None:
-            raise ValueError("y cannot be None!")
-
+            return X, y


Updating to not raise an error since we want to be able to use this in our component graph transform. If y is not given, we simply return the inputs :)

angela97lin · 2021-10-06T20:42:41Z

evalml/model_understanding/permutation_importance.py

@@ -325,6 +324,6 @@ def _fast_scorer(pipeline, features, X, y, objective):
        preds = pipeline.estimator.predict_proba(features)
    else:
        preds = pipeline.estimator.predict(features)
-    preds = pipeline.inverse_transform(preds)
+        preds = pipeline.inverse_transform(preds)


Only want to transform if not probabilities. Previously, inverse_transform only occurred for regression problems.

angela97lin · 2021-10-06T21:34:25Z

evalml/pipelines/classification_pipeline.py

        super().__init__(
            component_graph,
            custom_name=custom_name,
            parameters=parameters,
            random_seed=random_seed,
        )
+        try:
+            self._encoder = self.component_graph.get_component("Label Encoder")


This assumes that we have a label encoder named Label Encoder. I'm okay with this assumption, though of course it might not work in every case aka if someone decides to name their label encoder "mY lABeL enCoder" 😅 . Can file an issue separately to track this.

Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.

I think perhaps the most / an even more robust way would be to not rely on names at all, but instead find the LabelEncoder component class. Granted, that's not easy to do right now so I had filed #2878 which I think could help with this :)

I can file a separate issue about the label encoder specifically!

angela97lin · 2021-10-06T21:42:26Z

evalml/tests/conftest.py

@@ -926,18 +1059,6 @@ def safe_init_component_with_njobs_1(component_class):
                component = component_class()
            return component

-        @staticmethod
-        def safe_init_pipeline_with_njobs_1(pipeline_class):


Was only used in one place where it assumed that the component graph was a list of components. Since I had to update that and it is no longer being used, deleting.

bchen1116

Looks good to me! Glad that we're able to switch more over to a component so it's more transparent for users. The code changes look good!

It would be useful to run perf tests on this to ensure that the performance isn't changing with this new change! I would expect it to be the same, but could be nice to have confirmation on this.

chukarsten

Great work, Angela! Nothing blocking, just a few nitty questions that you can take or leave as you desire!

chukarsten · 2021-10-13T14:21:19Z

evalml/pipelines/classification_pipeline.py

        super().__init__(
            component_graph,
            custom_name=custom_name,
            parameters=parameters,
            random_seed=random_seed,
        )
+        try:
+            self._encoder = self.component_graph.get_component("Label Encoder")


Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.

chukarsten · 2021-10-13T14:27:02Z

evalml/pipelines/classification_pipeline.py

+            except ValueError as e:
+                raise ValueError(str(e))


Do you know why we only want to raise the ValueError? It seems kind of weird here to have an explicit return if the transform is successful but an implicit return of None if it isn't (in a non-ValueError'y way).

But also this is out of scope for your PR, so don't sweat it. I can file an issue for it later, lol.

I'm not sure if this fully answers your question, but I believe we only run into this block if we get passed a target that has new unseen values. Ex: label encoder fitted with values "a", "b", and we try to encode an input that has values "a", "b", "c". We consider this to be a bad input. Since the label encoder component will throw a ValueError, we catch it and raise it again.

evalml/pipelines/classification_pipeline.py

evalml/pipelines/time_series_classification_pipelines.py

chukarsten · 2021-10-13T14:37:08Z

evalml/pipelines/utils.py

@@ -73,6 +76,9 @@ def _get_preprocessing_components(
    """
    pp_components = []

+    if is_classification(problem_type):
+        pp_components.append(LabelEncoder)


sweating profusely

init

dbab01b

angela97lin self-assigned this Sep 21, 2021

angela97lin added 14 commits September 24, 2021 10:33

more impl, more to replace

97ac15a

Merge branch 'main' into 2648_encoding_as_component

f6f5591

add tests

bd2c0a9

merge

6f969d7

release notes

695826a

Merge branch 'main' into 2852_label_encoder

47b5de3

oops fix release notes annotation

b2555fa

Merge branch '2852_label_encoder' of github.com:alteryx/evalml into 2…

8f1014d

…852_label_encoder

revert changes to pipelines

69b25de

clean up tests

17ba69f

fix tests

0911c17

Merge branch 'main' into 2852_label_encoder

f963a8c

Merge branch 'main' into 2852_label_encoder

4b68b4a

retrigger tests

d5da2f4

angela97lin added 13 commits September 30, 2021 01:24

fix one classification test

2721033

merge

325973e

remove some tests

a6819e8

try to add back encode_targets

98e7016

oops caught wrong error

e01c0a1

Merge branch 'main' into 2648_encoding_as_component

9afcc8d

cleanup some tests, more to go

1ad42a3

Merge branch '2648_encoding_as_component' of github.com:alteryx/evalm…

515b683

…l into 2648_encoding_as_component

Merge branch 'main' into 2648_encoding_as_component

8037afb

cleanup more tests and logic

f730712

Merge branch '2648_encoding_as_component' of github.com:alteryx/evalm…

db39747

…l into 2648_encoding_as_component

some updates to implementation

ddbcd9f

revert mock for tests

ee7fcea

revert label encoder

c9b7a95

angela97lin commented Oct 6, 2021

View reviewed changes

angela97lin added 2 commits October 6, 2021 14:27

more cleanup and fix classes test

2c5b968

more cleanup

b838d18

angela97lin commented Oct 6, 2021

View reviewed changes

angela97lin added 4 commits October 6, 2021 17:49

clean up more tests and impl

b2e14ca

more cleanup :]

0c9d1c2

fix fixture

1ad7b03

more cleanup

e50a2d1

angela97lin marked this pull request as ready for review October 7, 2021 04:33

angela97lin requested review from dsherry, bchen1116, christopherbunn, eccabay, jeremyliweishih and ParthivNaresh October 7, 2021 04:33

bchen1116 approved these changes Oct 7, 2021

View reviewed changes

angela97lin and others added 4 commits October 8, 2021 14:35

merging

8ada99e

Merge branch 'main' into 2648_encoding_as_component

6ebb56c

Merge branch 'main' into 2648_encoding_as_component

4e4bfda

Merge branch 'main' into 2648_encoding_as_component

3a9deec

chukarsten approved these changes Oct 13, 2021

View reviewed changes

Merge branch 'main' into 2648_encoding_as_component

8714160

angela97lin mentioned this pull request Oct 14, 2021

Update label encoder in pipeline implementation to look for LabelEncoder class instead of looking for "Label Encoder" #2906

Open

combine two lines

cae9eb0

angela97lin merged commit 3607e03 into main Oct 14, 2021

angela97lin deleted the 2648_encoding_as_component branch October 14, 2021 01:26

chukarsten mentioned this pull request Oct 14, 2021

Release v0.35.0 #2918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use label encoder component instead of doing encoding on the pipeline level #2821

Use label encoder component instead of doing encoding on the pipeline level #2821

angela97lin commented Sep 21, 2021 •

edited

Loading

codecov bot commented Sep 29, 2021 •

edited

Loading

angela97lin Oct 6, 2021

angela97lin Oct 6, 2021 •

edited

Loading

angela97lin Oct 6, 2021

chukarsten Oct 13, 2021

angela97lin Oct 14, 2021

angela97lin Oct 6, 2021

bchen1116 left a comment

chukarsten left a comment

chukarsten Oct 13, 2021

chukarsten Oct 13, 2021

angela97lin Oct 14, 2021

chukarsten Oct 13, 2021

angela97lin Oct 14, 2021

Use label encoder component instead of doing encoding on the pipeline level #2821

Use label encoder component instead of doing encoding on the pipeline level #2821

Conversation

angela97lin commented Sep 21, 2021 • edited Loading

codecov bot commented Sep 29, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

angela97lin Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin commented Sep 21, 2021 •

edited

Loading

codecov bot commented Sep 29, 2021 •

edited

Loading

angela97lin Oct 6, 2021 •

edited

Loading