Refactor permutation importance method and add per-column permutation importance method#2302
Conversation
| return _go.Figure(layout=layout, data=graph_data) | ||
|
|
||
|
|
||
| def _calculate_permutation_scores_fast(pipeline, precomputed_features, y, objective, col_name, |
There was a problem hiding this comment.
I moved everything to a permutation_importance file since we were starting to add more methods that were just to calculate permutation importance. Left the graphing permutation importance method here, though.
Codecov Report
@@ Coverage Diff @@
## main #2302 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 280 281 +1
Lines 24492 24566 +74
=======================================
+ Hits 24464 24538 +74
Misses 28 28
Continue to review full report at Codecov.
|
|
@angela97lin What i like about that test is that it makes sure our permutation importance implementation is producing correct results (at least by sklearn standards). I'm not sure what you have in mind for how the new tests would look, but I think it's worth ensuring we're producing correct values (especially since we're venturing out with our own impl now). We don't have to call sklearn's permutation importance to do that though. |
|
@freddyaboulton I converted the previous test which checked against sklearn's impl to compare the new "slow" vs fast method, as a way of checking for correctness. This test which I added above is to check that the "slow" produces the same as sklearn's... which I think is fine to check for this PR but since this PR is also breaking away from the public sklearn interface, is perhaps not necessary? What do you think? 🤔 |
|
Got it! I don't think we have to use sklearn to establish "correctness"! |
|
@freddyaboulton Sounds good!! 😁 |
bchen1116
left a comment
There was a problem hiding this comment.
Looks good! I left a few nit-pick comments, but the test coverage is solid, and I like how you broke the methods apart!
| pipeline (PipelineBase or subclass): Fitted pipeline | ||
| X (pd.DataFrame): The input data used to score and compute permutation importance | ||
| y (pd.Series): The target data | ||
| objective (str, ObjectiveBase): Objective to score on |
There was a problem hiding this comment.
nitpick, but can you sort these args based on the order of the args in the method? Also, can you add col_name to the args here?
| return pd.DataFrame(mean_perm_importance, columns=["feature", "importance"]) | ||
|
|
||
|
|
||
| def calculate_permutation_importance_one_column(X, y, pipeline, col_name, objective, |
There was a problem hiding this comment.
nit: to make this and the previous method more similar, can you normalize the order of X, y, pipeline vs pipeline, X, y?
chukarsten
left a comment
There was a problem hiding this comment.
Hey Angela, this looks good. I think there are a few doctstring adjustments to make. It also seems that the per-column and base perm importance functions are pretty similar so I was wondering if its possible to reuse the function more or less as-is. I had grand ideas to try it out, but I got distracted :P. Anyways, I'm sure it's the way it is for a reason, so take it or leave it if you think it's worth exploring.
Closes #2299
calculate_permutation_importanceto only return dict with mean. Previously, the slow impl would return the std / the feature importances but that was different from the fast impl which only returned the mean. Hence, standardizing to return just the mean, since we don't use the std/feature importances to begin with.Question for discussion: We no longer use the scikit-learn calculate_permutation_importance method as the "slow" method. Should we still compare outputs? I originally wrote this test and confirmed that it passed, but don't think it's necessary to add to codebase. (It also takes 5 minutes for the test locally 😱). Since we're moving away from the sklearn method, I think its fine.