-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use label encoder component instead of doing encoding on the pipeline level #2821
Conversation
…852_label_encoder
Codecov Report
@@ Coverage Diff @@
## main #2821 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28392 28394 +2
=======================================
+ Hits 28296 28301 +5
+ Misses 96 93 -3
Continue to review full report at Codecov.
|
…l into 2648_encoding_as_component
…l into 2648_encoding_as_component
@@ -65,8 +65,7 @@ def transform(self, X, y=None): | |||
ValueError: If input `y` is None. | |||
""" | |||
if y is None: | |||
raise ValueError("y cannot be None!") | |||
|
|||
return X, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating to not raise an error since we want to be able to use this in our component graph transform. If y is not given, we simply return the inputs :)
@@ -325,6 +324,6 @@ def _fast_scorer(pipeline, features, X, y, objective): | |||
preds = pipeline.estimator.predict_proba(features) | |||
else: | |||
preds = pipeline.estimator.predict(features) | |||
preds = pipeline.inverse_transform(preds) | |||
preds = pipeline.inverse_transform(preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only want to transform if not probabilities. Previously, inverse_transform only occurred for regression problems.
super().__init__( | ||
component_graph, | ||
custom_name=custom_name, | ||
parameters=parameters, | ||
random_seed=random_seed, | ||
) | ||
try: | ||
self._encoder = self.component_graph.get_component("Label Encoder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that we have a label encoder named Label Encoder. I'm okay with this assumption, though of course it might not work in every case aka if someone decides to name their label encoder "mY lABeL enCoder" 😅 . Can file an issue separately to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think perhaps the most / an even more robust way would be to not rely on names at all, but instead find the LabelEncoder component class. Granted, that's not easy to do right now so I had filed #2878 which I think could help with this :)
I can file a separate issue about the label encoder specifically!
@@ -926,18 +1059,6 @@ def safe_init_component_with_njobs_1(component_class): | |||
component = component_class() | |||
return component | |||
|
|||
@staticmethod | |||
def safe_init_pipeline_with_njobs_1(pipeline_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was only used in one place where it assumed that the component graph was a list of components. Since I had to update that and it is no longer being used, deleting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Glad that we're able to switch more over to a component so it's more transparent for users. The code changes look good!
It would be useful to run perf tests on this to ensure that the performance isn't changing with this new change! I would expect it to be the same, but could be nice to have confirmation on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Angela! Nothing blocking, just a few nitty questions that you can take or leave as you desire!
super().__init__( | ||
component_graph, | ||
custom_name=custom_name, | ||
parameters=parameters, | ||
random_seed=random_seed, | ||
) | ||
try: | ||
self._encoder = self.component_graph.get_component("Label Encoder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth trying to robustify this? Like to maybe lower-case the string and remove the space and compare it to "labelencoder"? Is the current behavior of adding a label encoder involve both adding a label encoder that our library names and allowing the user the ability to add a custom named label encoder? I seem to recall both are an option.
except ValueError as e: | ||
raise ValueError(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we only want to raise the ValueError? It seems kind of weird here to have an explicit return if the transform is successful but an implicit return of None if it isn't (in a non-ValueError'y way).
But also this is out of scope for your PR, so don't sweat it. I can file an issue for it later, lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this fully answers your question, but I believe we only run into this block if we get passed a target that has new unseen values. Ex: label encoder fitted with values "a", "b", and we try to encode an input that has values "a", "b", "c". We consider this to be a bad input. Since the label encoder component will throw a ValueError, we catch it and raise it again.
@@ -73,6 +76,9 @@ def _get_preprocessing_components( | |||
""" | |||
pp_components = [] | |||
|
|||
if is_classification(problem_type): | |||
pp_components.append(LabelEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweating profusely
Closes #2648
Perf test docs here: https://alteryx.atlassian.net/wiki/spaces/PS/pages/1088684090/Using+label+encoder+as+a+component+in+pipelines