Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds utility methods to calculate and graph permutation importances #860

Merged
merged 15 commits into from Jun 23, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jun 16, 2020

Closes #155

Graph looks like:

image

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #860 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #860    +/-   ##
========================================
  Coverage   99.74%   99.75%            
========================================
  Files         195      195            
  Lines        8365     8467   +102     
========================================
+ Hits         8344     8446   +102     
  Misses         21       21            
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/pipelines/pipeline_base.py 100.00% <ø> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_graphs.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.82% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d63209e...3a6fea4. Read the comment docs.

@angela97lin
Copy link
Contributor Author

angela97lin commented Jun 18, 2020

@kmax12 @dsherry

I wanted to start working on this and took what Max had outlined in the issue. In the current PR, I've implemented get_permutation_importances as another instance method of PipelineBase but realized we should discuss this before I move forward with testing/more implementation.

Here are some proposed implementations/APIs:

  • Add get_permutation_importances as an instance method. This is what's done here.

    • Pros: allows X, y so users can choose whether they want to use test / training data to calculate. Doesn't require any holding of state.
    • Cons: is it confusing to have both get_permutation_importances and the feature_importances property?
  • Add permutation_importances as a class property.

    • Pros:
    • Cons: We'd then have to use training data to calculate during fit().
  • Replace feature_importances as a class property.

    • Pros: Less confusing to have to attributes
    • Cons: Lose the ability to get feature_importances from the estimator.
  • Add get_permutation_importances as util function

    • Pros: standalone
    • Cons: standalone? lol

These were just off the top of my head, would love to discuss more / over a call if that's easier!

@angela97lin angela97lin marked this pull request as ready for review Jun 18, 2020
@angela97lin angela97lin marked this pull request as draft Jun 18, 2020
@dsherry
Copy link
Collaborator

dsherry commented Jun 18, 2020

@angela97lin great, thanks for starting the discussion.

@angela97lin and I just did a call to discuss. We'll keep permutation importance as a separate method for now, not attached to the pipelines/estimators. This is in line with sklearn's current API. We can add permutation importance as a pipeline/estimator member later, once we've had a release or two to play with it on real data.

Here's the usage we want to support in this PR:

automl = AutoRegressionSearch()
automl.search(X_train, y_train)
# get an untrained copy of a particular pipeline/parameter set (coming soon in #719)
pipeline = automl.get_pipeline(id)
# retrain on entire training data
pipeline.train(X_train, y_train)
permutation_importance_train = permutation_importance(pipeline, X_train, y_train)
permutation_importance_test = permutation_importance(pipeline, X_test, y_test)
graph_permutation_importance(permutation_importance_train, ...)
graph_permutation_importance(permutation_importance_test, ...)

and for estimators

estimator = LogisticRegressionClassifier()
estimator.fit(X_train, y_train)
permutation_importance_train = permutation_importance(estimator, X_train, y_train)
permutation_importance_test = permutation_importance(estimator, X_test, y_test)
graph_permutation_importance(permutation_importance_train, ...)
graph_permutation_importance(permutation_importance_test, ...)

Also @angela97lin I think when you're saying "class method" here, you're referring to an instance method (i.e. takes self as first arg), as opposed to a class method (uses @classmethod decorator), right?

@angela97lin
Copy link
Contributor Author

angela97lin commented Jun 18, 2020

@dsherry Yup! Updated my comment to "instance method" to make this more clear :)

@angela97lin angela97lin changed the title Adds permutation importance Adds utility methods to calculate and graph permutation importances Jun 18, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented Jun 18, 2020

@dsherry I don't think we can currently support this for estimators since they don't have a score method. Two options here:

  1. We can wrap the estimator in a pipeline object s.t. the pipeline component graph is just the estimator and things should work
  2. We can just support pipelines for now.

I think I'm more in favor of the latter, unless you see a use case for estimators. Does that sound okay to you?

@dsherry
Copy link
Collaborator

dsherry commented Jun 18, 2020

@angela97lin ah, good point. I think its fine to only support pipelines for now.

@angela97lin angela97lin requested review from dsherry and kmax12 Jun 19, 2020
@angela97lin angela97lin marked this pull request as ready for review Jun 19, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Looks good! I left a few suggestions. Not much blocking: adding the graph method to the init/docs, fix graph docstring, and some testing comments.

docs/source/api_reference.rst Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_graphs.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
evalml/pipelines/utils.py Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
@angela97lin angela97lin requested review from dsherry and kmax12 Jun 22, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented Jun 22, 2020

@dsherry @kmax12 I addressed all of your comments, if you could give it another look :D

Copy link
Collaborator

@dsherry dsherry left a comment

Good stuff! Left comments, mostly on testing, but none blocking

"""
go = import_or_raise("plotly.graph_objects", error_msg="Cannot find dependency plotly.graph_objects")
perm_importance = get_permutation_importances(pipeline, X, y, objective)
perm_importance['importance'] = abs(perm_importance['importance'])
Copy link
Collaborator

@dsherry dsherry Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. What does a negative importance mean here?

Copy link
Contributor

@kmax12 kmax12 Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure we should be doing the abs here...

Copy link
Contributor Author

@angela97lin angela97lin Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not 100% sure either, but based on https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/inspection/_permutation_importance.py#L13, they're using the difference between the two scores to calculate permutation importance, so if the importance is negative, then the baseline score - the permutated score is negative... meaning that the permutated feature performed better than the original. My guess would be that this means the feature is really not useful if noise performs better than it? 🤔

Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yeah, I think that's the right interpretation of a negative score. In that case I'm thinking we shouldn't do abs and should show the sign in these plots, right? Would be fine to file this for later.

evalml/pipelines/utils.py Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_graphs.py Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
kmax12
kmax12 approved these changes Jun 22, 2020
return pd.DataFrame(mean_perm_importance, columns=["feature", "importance"])


def graph_permutation_importances(pipeline, X, y, objective, show_all_features=False):
Copy link
Contributor

@kmax12 kmax12 Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this function share the plotting functionality with our current Pipeline.graph_feature_importances?

Copy link
Contributor Author

@angela97lin angela97lin Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, do you mean abstracting some code away in a common helper function? I'd say a lot of things are similar between the two so maybe! But we can address this again in #868 where we update both methods?

Copy link
Contributor

@kmax12 kmax12 Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a plan

"""
go = import_or_raise("plotly.graph_objects", error_msg="Cannot find dependency plotly.graph_objects")
perm_importance = get_permutation_importances(pipeline, X, y, objective)
perm_importance['importance'] = abs(perm_importance['importance'])
Copy link
Contributor

@kmax12 kmax12 Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure we should be doing the abs here...

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
@angela97lin angela97lin merged commit 56c4521 into master Jun 23, 2020
2 checks passed
@angela97lin angela97lin deleted the 155_perm_importance branch Jun 23, 2020
@dsherry
Copy link
Collaborator

dsherry commented Jun 23, 2020

@angela97lin I forgot to mention this but I tried this out myself on a couple datasets and it looks great! This was from the iris dataset

Screen Shot 2020-06-23 at 6 36 37 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use permutation importance for pipeline feature importance
3 participants