-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize feature importance for estimators #3305
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3305 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 327 327
Lines 31814 31840 +26
=======================================
+ Hits 31689 31715 +26
Misses 125 125
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some food for thought but not blocking!
evalml/model_understanding/graphs.py
Outdated
@@ -1526,7 +1526,8 @@ def get_linear_coefficients(estimator, features=None): | |||
"before using this estimator." | |||
) | |||
coef_ = estimator.feature_importance | |||
coef_ = pd.Series(coef_, name="Coefficients", index=features) | |||
coef_ = pd.Series(coef_, name="Coefficients") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all the pd.Series
calls with this call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope since coef_
should be a pd.Series
object now. Though we still want the name, so I can update this line to set the name instead.
@@ -359,3 +359,30 @@ def test_estimator_fit_predict_and_predict_proba_respect_custom_indices( | |||
) | |||
X_pred = estimator.predict(X) | |||
pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True) | |||
|
|||
|
|||
@pytest.mark.parametrize("estimator_class", _all_estimators()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete any of the previous tests? I think we could remove all the tests don't check equality (i.e the estimators that don't support feature importance) since there's some redundancy between these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can see this going both ways. I can see tests here in test_estimators
existing for the purpose of determining that no feature_importance()
sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance()
of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what @chukarsten is saying here, where this is more general and just tests that we don't return nan values / the shapes are expected. Specific modules will test for expected values (since many of the feature importances are calculated from an external library) or assert that we can't calculate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for standardizing and reducing my OCD :)
@@ -359,3 +359,30 @@ def test_estimator_fit_predict_and_predict_proba_respect_custom_indices( | |||
) | |||
X_pred = estimator.predict(X) | |||
pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True) | |||
|
|||
|
|||
@pytest.mark.parametrize("estimator_class", _all_estimators()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can see this going both ways. I can see tests here in test_estimators
existing for the purpose of determining that no feature_importance()
sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance()
of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just left a comment about whether we want a Series name to be part of the standardization in case people call it from the estimator directly and not from model understanding.
@@ -1526,7 +1526,8 @@ def get_linear_coefficients(estimator, features=None): | |||
"before using this estimator." | |||
) | |||
coef_ = estimator.feature_importance | |||
coef_ = pd.Series(coef_, name="Coefficients", index=features) | |||
coef_ = pd.Series(coef_, name="Coefficients") | |||
coef_.index = features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, why did we separate these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get it to work with them together, because of pandas behavior so it reindexes after and returns all NaNs if put on the same line (or that's my rough understanding)?https://pandas.pydata.org/docs/reference/api/pandas.Series.html
@@ -119,7 +119,9 @@ def test_feature_importance(ts_data): | |||
regressor = ExponentialSmoothingRegressor() | |||
with patch.object(regressor, "_component_obj"): | |||
regressor.fit(X, y) | |||
assert regressor.feature_importance == np.zeros(1) | |||
pd.testing.assert_series_equal( | |||
regressor.feature_importance, pd.Series(np.zeros(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a name to the feature importance series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #3330 that we can use to track this! :)
Closes #3183