Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability for users to define which class is "positive" for label encoder in binary classification case #3033

Merged
merged 15 commits into from Nov 15, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 10, 2021

Closes #2926.

General question: since we have specific behavior in the label encoder... should we just move away from using sklearn's and completely write our own? There isn't too much additional functionality we get from using sklearn's impl but we could end up with different stack traces depending on whether or not positive_label is defined (if defined, uses our stack traces and errors, if not, uses sklearn's).

@angela97lin angela97lin self-assigned this Nov 10, 2021
@codecov
Copy link

codecov bot commented Nov 10, 2021

Codecov Report

Merging #3033 (4fc3ed5) into main (c5fe27c) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3033     +/-   ##
=======================================
- Coverage   99.8%   99.8%   -0.0%     
=======================================
  Files        312     312             
  Lines      30184   30229     +45     
=======================================
+ Hits       30095   30139     +44     
- Misses        89      90      +1     
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl.py 99.5% <ø> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <ø> (ø)
evalml/model_understanding/decision_boundary.py 100.0% <100.0%> (ø)
.../components/transformers/encoders/label_encoder.py 100.0% <100.0%> (ø)
.../pipelines/time_series_classification_pipelines.py 99.0% <100.0%> (+0.1%) ⬆️
evalml/tests/component_tests/test_label_encoder.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 98.8% <0.0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5fe27c...4fc3ed5. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review November 11, 2021 16:20
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Thank you for picking up this issue! I think we can simplify the implementation if we just forgo the sklearn implementation and roll our own.

I think this will work:

    def fit(self, X, y):
        """Fits the label encoder.

        Args:
            X (pd.DataFrame): The input training data of shape [n_samples, n_features]. Ignored.
            y (pd.Series): The target training data of length [n_samples].

        Returns:
            self

        Raises:
            ValueError: If input `y` is None.
        """
        if y is None:
            raise ValueError("y cannot be None!")
        y = infer_feature_types(y)
        sorted_uniques = sorted(y.unique())
        self.mapping = {val:i for i, val in enumerate(sorted_uniques)}
        if self.parameters["positive_label"] is not None:
            if len(self.mapping) != 2:
                raise ValueError(
                    "positive_label should only be set for binary classification targets. Otherwise, positive_label should be None."
                )
            self.mapping = {val: int(val == self.parameters["positive_label"]) for val in sorted_uniques}
        self.inverse_mapping = {val: i for i, val in self.mapping.items()}

        return self

the in transform and inverse_transform we just call map on the series with the right mapping.

Maybe there's a performance hit but we'll have to check.

f"y contains previously unseen labels: {y_unique_values.difference(self.mapping.keys())}"
)
y_t = y_ww.map(self.mapping)
return X, ww.init_series(y_t, logical_type="integer")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without setting the logical type, we end up with a categorical column which defeats the purpose 😂

@@ -261,6 +260,7 @@ def predict_in_sample(self, X, y, X_train, y_train, objective=None):
proba = proba.iloc[:, 1]
if objective is None:
predictions = proba > self.threshold
predictions = predictions.astype(int)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because it returns bools, which we don't know how to map back (since we expect ints!). Although we treated bools and 0/1 as interchangeable, it actually matters now that we're storing a mapping.

Previously with sklearn this wasn't an issue because they used np.searchsorted which would treat ints and bools the same. This is no longer true. :)

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks good to me, solid test coverage!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @angela97lin ! Thank you

@angela97lin angela97lin merged commit 4c8a8c1 into main Nov 15, 2021
@angela97lin angela97lin deleted the 2926_positive_label branch November 15, 2021 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binary classification: ability for users to define which class is "positive"
3 participants