Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize feature importance for estimators #3305

Merged
merged 7 commits into from
Feb 15, 2022

Conversation

angela97lin
Copy link
Contributor

Closes #3183

@angela97lin angela97lin self-assigned this Feb 4, 2022
@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #3305 (194e60c) into main (e67430c) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3305     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        327     327             
  Lines      31814   31840     +26     
=======================================
+ Hits       31689   31715     +26     
  Misses       125     125             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/baseline_classifier.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.0% <100.0%> (ø)
...ts/estimators/classifiers/elasticnet_classifier.py 100.0% <100.0%> (ø)
...tors/classifiers/logistic_regression_classifier.py 100.0% <100.0%> (ø)
...nents/estimators/classifiers/xgboost_classifier.py 100.0% <100.0%> (ø)
...valml/pipelines/components/estimators/estimator.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <100.0%> (ø)
...ents/estimators/regressors/elasticnet_regressor.py 100.0% <100.0%> (ø)
...tors/regressors/exponential_smoothing_regressor.py 100.0% <100.0%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e67430c...194e60c. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some food for thought but not blocking!

@@ -1526,7 +1526,8 @@ def get_linear_coefficients(estimator, features=None):
"before using this estimator."
)
coef_ = estimator.feature_importance
coef_ = pd.Series(coef_, name="Coefficients", index=features)
coef_ = pd.Series(coef_, name="Coefficients")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all the pd.Series calls with this call here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope since coef_ should be a pd.Series object now. Though we still want the name, so I can update this line to set the name instead.

@@ -359,3 +359,30 @@ def test_estimator_fit_predict_and_predict_proba_respect_custom_indices(
)
X_pred = estimator.predict(X)
pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True)


@pytest.mark.parametrize("estimator_class", _all_estimators())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete any of the previous tests? I think we could remove all the tests don't check equality (i.e the estimators that don't support feature importance) since there's some redundancy between these tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I can see this going both ways. I can see tests here in test_estimators existing for the purpose of determining that no feature_importance() sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance() of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what @chukarsten is saying here, where this is more general and just tests that we don't return nan values / the shapes are expected. Specific modules will test for expected values (since many of the feature importances are calculated from an external library) or assert that we can't calculate it.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for standardizing and reducing my OCD :)

@@ -359,3 +359,30 @@ def test_estimator_fit_predict_and_predict_proba_respect_custom_indices(
)
X_pred = estimator.predict(X)
pd.testing.assert_index_equal(X_original_index, X_pred.index, check_names=True)


@pytest.mark.parametrize("estimator_class", _all_estimators())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I can see this going both ways. I can see tests here in test_estimators existing for the purpose of determining that no feature_importance() sends back NaN. If we wanted to do any assertion on the shapes of feature importances relative to the input datasets, now and here would be the place to do it. I think if we wanted, for some reason, to test the specific feature_importance() of an individual estimator and make sure that it calculates the right values, we'd want to do it in that specific module.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just left a comment about whether we want a Series name to be part of the standardization in case people call it from the estimator directly and not from model understanding.

@@ -1526,7 +1526,8 @@ def get_linear_coefficients(estimator, features=None):
"before using this estimator."
)
coef_ = estimator.feature_importance
coef_ = pd.Series(coef_, name="Coefficients", index=features)
coef_ = pd.Series(coef_, name="Coefficients")
coef_.index = features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, why did we separate these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get it to work with them together, because of pandas behavior so it reindexes after and returns all NaNs if put on the same line (or that's my rough understanding)?https://pandas.pydata.org/docs/reference/api/pandas.Series.html

@@ -119,7 +119,9 @@ def test_feature_importance(ts_data):
regressor = ExponentialSmoothingRegressor()
with patch.object(regressor, "_component_obj"):
regressor.fit(X, y)
assert regressor.feature_importance == np.zeros(1)
pd.testing.assert_series_equal(
regressor.feature_importance, pd.Series(np.zeros(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a name to the feature importance series?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #3330 that we can use to track this! :)

@angela97lin angela97lin merged commit 95aa7e2 into main Feb 15, 2022
@angela97lin angela97lin deleted the 3183_feature_importance branch February 15, 2022 03:57
@chukarsten chukarsten mentioned this pull request Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize Feature Importance Return
4 participants