Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix permutation importance failing when target is categorical #3017

Merged
merged 5 commits into from
Nov 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Release Notes
* Added AutoML function to access ensemble pipeline's input pipelines IDs :pr:`3011`
* Fixes
* Fixed bug where ``Oversampler`` didn't consider boolean columns to be categorical :pr:`2980`
* Fixed permutation importance failing when target is categorical :pr:`3017`
* Updated estimator and pipelines' ``predict``, ``predict_proba``, ``transform``, ``inverse_transform`` methods to preserve input indices :pr:`2979`
* Changes
* Documentation Changes
Expand Down
4 changes: 3 additions & 1 deletion evalml/model_understanding/permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

from evalml.objectives.utils import get_objective
from evalml.problem_types import is_classification
from evalml.problem_types.utils import is_regression
from evalml.utils import infer_feature_types


Expand Down Expand Up @@ -324,6 +325,7 @@ def _fast_scorer(pipeline, features, X, y, objective):
preds = pipeline.estimator.predict_proba(features)
else:
preds = pipeline.estimator.predict(features)
preds = pipeline.inverse_transform(preds)
if is_regression(pipeline.problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ok for classification here. In _fast_permutation_importance we encode the target with the pipeline._encode_targets method. So the estimator predictions will also be encoded.

Of course, if the pipeline does not have an encoder but has string-valued targets this would fail but I would say it's a pipeline definition bug as opposed to a permutation importance bug.

The fact that our classification objectives only supports integer-valued targets and that to score a pipeline it should have an encoder may not be super clear though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I wish there were a way to make this more clear. Pipelines require a label encoder for not only scoring but also fit so it still feel consistent, but I wonder if moving forward we could make these types of error messages more clear.

preds = pipeline.inverse_transform(preds)
score = pipeline._score(X, y, preds, objective)
return score if objective.greater_is_better else -score
Original file line number Diff line number Diff line change
Expand Up @@ -478,14 +478,20 @@ def test_get_permutation_importance_invalid_objective(


@pytest.mark.parametrize("data_type", ["np", "pd", "ww"])
@pytest.mark.parametrize("use_numerical_target", [True, False])
def test_get_permutation_importance_binary(
X_y_binary,
data_type,
use_numerical_target,
X_y_binary,
fraud_100,
logistic_regression_binary_pipeline_class,
binary_test_objectives,
make_data_type,
):
X, y = X_y_binary
if use_numerical_target:
X, y = X_y_binary
else:
X, y = fraud_100
X = make_data_type(data_type, X)
y = make_data_type(data_type, y)

Expand All @@ -510,8 +516,11 @@ def test_get_permutation_importance_binary(
pipeline, X, y, col, objective, fast=False
)
)
permutation_importance_sorted_row = permutation_importance_sorted[
permutation_importance_sorted["feature"] == col
]["importance"]
np.testing.assert_almost_equal(
permutation_importance_sorted["importance"][col],
permutation_importance_sorted_row.iloc[0],
permutation_importance_one_col,
)

Expand Down