-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds utility methods to calculate and graph permutation importances #860
Conversation
Codecov Report
@@ Coverage Diff @@
## master #860 +/- ##
========================================
Coverage 99.74% 99.75%
========================================
Files 195 195
Lines 8365 8467 +102
========================================
+ Hits 8344 8446 +102
Misses 21 21
Continue to review full report at Codecov.
|
I wanted to start working on this and took what Max had outlined in the issue. In the current PR, I've implemented Here are some proposed implementations/APIs:
These were just off the top of my head, would love to discuss more / over a call if that's easier! |
@angela97lin great, thanks for starting the discussion. @angela97lin and I just did a call to discuss. We'll keep permutation importance as a separate method for now, not attached to the pipelines/estimators. This is in line with sklearn's current API. We can add permutation importance as a pipeline/estimator member later, once we've had a release or two to play with it on real data. Here's the usage we want to support in this PR: automl = AutoRegressionSearch()
automl.search(X_train, y_train)
# get an untrained copy of a particular pipeline/parameter set (coming soon in #719)
pipeline = automl.get_pipeline(id)
# retrain on entire training data
pipeline.train(X_train, y_train)
permutation_importance_train = permutation_importance(pipeline, X_train, y_train)
permutation_importance_test = permutation_importance(pipeline, X_test, y_test)
graph_permutation_importance(permutation_importance_train, ...)
graph_permutation_importance(permutation_importance_test, ...) and for estimators estimator = LogisticRegressionClassifier()
estimator.fit(X_train, y_train)
permutation_importance_train = permutation_importance(estimator, X_train, y_train)
permutation_importance_test = permutation_importance(estimator, X_test, y_test)
graph_permutation_importance(permutation_importance_train, ...)
graph_permutation_importance(permutation_importance_test, ...) Also @angela97lin I think when you're saying "class method" here, you're referring to an instance method (i.e. takes |
@dsherry Yup! Updated my comment to "instance method" to make this more clear :) |
@dsherry I don't think we can currently support this for estimators since they don't have a
I think I'm more in favor of the latter, unless you see a use case for estimators. Does that sound okay to you? |
@angela97lin ah, good point. I think its fine to only support pipelines for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a few suggestions. Not much blocking: adding the graph method to the init/docs, fix graph docstring, and some testing comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Left comments, mostly on testing, but none blocking
evalml/pipelines/utils.py
Outdated
""" | ||
go = import_or_raise("plotly.graph_objects", error_msg="Cannot find dependency plotly.graph_objects") | ||
perm_importance = get_permutation_importances(pipeline, X, y, objective) | ||
perm_importance['importance'] = abs(perm_importance['importance']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting. What does a negative importance mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure we should be doing the abs here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not 100% sure either, but based on https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/inspection/_permutation_importance.py#L13, they're using the difference between the two scores to calculate permutation importance, so if the importance is negative, then the baseline score - the permutated score is negative... meaning that the permutated feature performed better than the original. My guess would be that this means the feature is really not useful if noise performs better than it? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yeah, I think that's the right interpretation of a negative score. In that case I'm thinking we shouldn't do abs
and should show the sign in these plots, right? Would be fine to file this for later.
return pd.DataFrame(mean_perm_importance, columns=["feature", "importance"]) | ||
|
||
|
||
def graph_permutation_importances(pipeline, X, y, objective, show_all_features=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this function share the plotting functionality with our current Pipeline.graph_feature_importances
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, do you mean abstracting some code away in a common helper function? I'd say a lot of things are similar between the two so maybe! But we can address this again in #868 where we update both methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a plan
evalml/pipelines/utils.py
Outdated
""" | ||
go = import_or_raise("plotly.graph_objects", error_msg="Cannot find dependency plotly.graph_objects") | ||
perm_importance = get_permutation_importances(pipeline, X, y, objective) | ||
perm_importance['importance'] = abs(perm_importance['importance']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure we should be doing the abs here...
@angela97lin I forgot to mention this but I tried this out myself on a couple datasets and it looks great! This was from the iris dataset |
Closes #155
Graph looks like: