-
Notifications
You must be signed in to change notification settings - Fork 91
Standardize feature importance for estimators #3305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3305 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 327 327
Lines 31814 31840 +26
=======================================
+ Hits 31689 31715 +26
Misses 125 125
Continue to review full report at Codecov.
|
jeremyliweishih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some food for thought but not blocking!
evalml/model_understanding/graphs.py
Outdated
| ) | ||
| coef_ = estimator.feature_importance | ||
| coef_ = pd.Series(coef_, name="Coefficients", index=features) | ||
| coef_ = pd.Series(coef_, name="Coefficients") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all the pd.Series calls with this call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope since coef_ should be a pd.Series object now. Though we still want the name, so I can update this line to set the name instead.
| pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("estimator_class", _all_estimators()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete any of the previous tests? I think we could remove all the tests don't check equality (i.e the estimators that don't support feature importance) since there's some redundancy between these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can see this going both ways. I can see tests here in test_estimators existing for the purpose of determining that no feature_importance() sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance() of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what @chukarsten is saying here, where this is more general and just tests that we don't return nan values / the shapes are expected. Specific modules will test for expected values (since many of the feature importances are calculated from an external library) or assert that we can't calculate it.
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for standardizing and reducing my OCD :)
| pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("estimator_class", _all_estimators()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can see this going both ways. I can see tests here in test_estimators existing for the purpose of determining that no feature_importance() sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance() of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just left a comment about whether we want a Series name to be part of the standardization in case people call it from the estimator directly and not from model understanding.
| coef_ = estimator.feature_importance | ||
| coef_ = pd.Series(coef_, name="Coefficients", index=features) | ||
| coef_ = pd.Series(coef_, name="Coefficients") | ||
| coef_.index = features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, why did we separate these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get it to work with them together, because of pandas behavior so it reindexes after and returns all NaNs if put on the same line (or that's my rough understanding)?https://pandas.pydata.org/docs/reference/api/pandas.Series.html
| regressor.fit(X, y) | ||
| assert regressor.feature_importance == np.zeros(1) | ||
| pd.testing.assert_series_equal( | ||
| regressor.feature_importance, pd.Series(np.zeros(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a name to the feature importance series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #3330 that we can use to track this! :)
Closes #3183