New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability for users to define which class is "positive" for label encoder in binary classification case #3033
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3033 +/- ##
=======================================
- Coverage 99.8% 99.8% -0.0%
=======================================
Files 312 312
Lines 30184 30229 +45
=======================================
+ Hits 30095 30139 +44
- Misses 89 90 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Thank you for picking up this issue! I think we can simplify the implementation if we just forgo the sklearn implementation and roll our own.
I think this will work:
def fit(self, X, y):
"""Fits the label encoder.
Args:
X (pd.DataFrame): The input training data of shape [n_samples, n_features]. Ignored.
y (pd.Series): The target training data of length [n_samples].
Returns:
self
Raises:
ValueError: If input `y` is None.
"""
if y is None:
raise ValueError("y cannot be None!")
y = infer_feature_types(y)
sorted_uniques = sorted(y.unique())
self.mapping = {val:i for i, val in enumerate(sorted_uniques)}
if self.parameters["positive_label"] is not None:
if len(self.mapping) != 2:
raise ValueError(
"positive_label should only be set for binary classification targets. Otherwise, positive_label should be None."
)
self.mapping = {val: int(val == self.parameters["positive_label"]) for val in sorted_uniques}
self.inverse_mapping = {val: i for i, val in self.mapping.items()}
return self
the in transform and inverse_transform we just call map
on the series with the right mapping.
Maybe there's a performance hit but we'll have to check.
f"y contains previously unseen labels: {y_unique_values.difference(self.mapping.keys())}" | ||
) | ||
y_t = y_ww.map(self.mapping) | ||
return X, ww.init_series(y_t, logical_type="integer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without setting the logical type, we end up with a categorical column which defeats the purpose 😂
@@ -261,6 +260,7 @@ def predict_in_sample(self, X, y, X_train, y_train, objective=None): | |||
proba = proba.iloc[:, 1] | |||
if objective is None: | |||
predictions = proba > self.threshold | |||
predictions = predictions.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because it returns bools, which we don't know how to map back (since we expect ints!). Although we treated bools and 0/1 as interchangeable, it actually matters now that we're storing a mapping.
Previously with sklearn this wasn't an issue because they used np.searchsorted
which would treat ints and bools the same. This is no longer true. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks good to me, solid test coverage!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @angela97lin ! Thank you
Closes #2926.
General question: since we have specific behavior in the label encoder... should we just move away from using sklearn's and completely write our own? There isn't too much additional functionality we get from using sklearn's impl but we could end up with different stack traces depending on whether or not positive_label is defined (if defined, uses our stack traces and errors, if not, uses sklearn's).