Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use label encoder component instead of doing encoding on the pipeline level #2821

Merged
merged 64 commits into from
Oct 14, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Sep 21, 2021

@angela97lin angela97lin self-assigned this Sep 21, 2021
@codecov
Copy link

codecov bot commented Sep 29, 2021

Codecov Report

Merging #2821 (cae9eb0) into main (939ff9f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2821     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28392   28394      +2     
=======================================
+ Hits       28296   28301      +5     
+ Misses        96      93      -3     
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl_utils.py 100.0% <ø> (ø)
evalml/__init__.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
...alml/model_understanding/permutation_importance.py 100.0% <100.0%> (ø)
...nderstanding/prediction_explanations/explainers.py 100.0% <100.0%> (ø)
evalml/objectives/cost_benefit_matrix.py 100.0% <100.0%> (ø)
.../pipelines/binary_classification_pipeline_mixin.py 100.0% <100.0%> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
.../components/transformers/encoders/label_encoder.py 100.0% <100.0%> (ø)
.../pipelines/time_series_classification_pipelines.py 99.0% <100.0%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 939ff9f...cae9eb0. Read the comment docs.

@@ -65,8 +65,7 @@ def transform(self, X, y=None):
ValueError: If input `y` is None.
"""
if y is None:
raise ValueError("y cannot be None!")

return X, y
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating to not raise an error since we want to be able to use this in our component graph transform. If y is not given, we simply return the inputs :)

@@ -325,6 +324,6 @@ def _fast_scorer(pipeline, features, X, y, objective):
preds = pipeline.estimator.predict_proba(features)
else:
preds = pipeline.estimator.predict(features)
preds = pipeline.inverse_transform(preds)
preds = pipeline.inverse_transform(preds)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only want to transform if not probabilities. Previously, inverse_transform only occurred for regression problems.

super().__init__(
component_graph,
custom_name=custom_name,
parameters=parameters,
random_seed=random_seed,
)
try:
self._encoder = self.component_graph.get_component("Label Encoder")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we have a label encoder named Label Encoder. I'm okay with this assumption, though of course it might not work in every case aka if someone decides to name their label encoder "mY lABeL enCoder" 😅 . Can file an issue separately to track this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps the most / an even more robust way would be to not rely on names at all, but instead find the LabelEncoder component class. Granted, that's not easy to do right now so I had filed #2878 which I think could help with this :)

I can file a separate issue about the label encoder specifically!

@@ -926,18 +1059,6 @@ def safe_init_component_with_njobs_1(component_class):
component = component_class()
return component

@staticmethod
def safe_init_pipeline_with_njobs_1(pipeline_class):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was only used in one place where it assumed that the component graph was a list of components. Since I had to update that and it is no longer being used, deleting.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Glad that we're able to switch more over to a component so it's more transparent for users. The code changes look good!

It would be useful to run perf tests on this to ensure that the performance isn't changing with this new change! I would expect it to be the same, but could be nice to have confirmation on this.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Angela! Nothing blocking, just a few nitty questions that you can take or leave as you desire!

super().__init__(
component_graph,
custom_name=custom_name,
parameters=parameters,
random_seed=random_seed,
)
try:
self._encoder = self.component_graph.get_component("Label Encoder")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.

Comment on lines +68 to +69
except ValueError as e:
raise ValueError(str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we only want to raise the ValueError? It seems kind of weird here to have an explicit return if the transform is successful but an implicit return of None if it isn't (in a non-ValueError'y way).

But also this is out of scope for your PR, so don't sweat it. I can file an issue for it later, lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this fully answers your question, but I believe we only run into this block if we get passed a target that has new unseen values. Ex: label encoder fitted with values "a", "b", and we try to encode an input that has values "a", "b", "c". We consider this to be a bad input. Since the label encoder component will throw a ValueError, we catch it and raise it again.

evalml/pipelines/classification_pipeline.py Outdated Show resolved Hide resolved
evalml/pipelines/time_series_classification_pipelines.py Outdated Show resolved Hide resolved
@@ -73,6 +76,9 @@ def _get_preprocessing_components(
"""
pp_components = []

if is_classification(problem_type):
pp_components.append(LabelEncoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweating profusely

@angela97lin angela97lin merged commit 3607e03 into main Oct 14, 2021
@angela97lin angela97lin deleted the 2648_encoding_as_component branch October 14, 2021 01:26
@chukarsten chukarsten mentioned this pull request Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do label encoding at the component level rather than pipeline level
3 participants