Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the default value for feature_importance 'weight' in python but R uses 'gain'? #2706

Closed
bbennett36 opened this issue Sep 13, 2017 · 3 comments

Comments

@bbennett36
Copy link

bbennett36 commented Sep 13, 2017

I was reading through the docs and noticed that in the R-package section
http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html#feature-importance
it says the follow..

"The column Gain provide the information we are looking for."

"Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it)."
Which I'm assuming is 'weight' in the python package. (Correct me if I'm wrong)

If we shouldn't be using weight/frequency as a way to check feature importance, why is it the default param in XGBoost ?

Just a bit concerning as someone who just started learning XGBoost recently because I was using the default "plot_importance" to find the best features but now it seems misleading since it doesn't default to the best param to find this. Unless I am missing something. Seems like 'gain' should be the default parameter.

@pommedeterresautee
Copy link
Member

There are many ways to look at feature importance. Gain is more informative than just frequency.
Please check this paper for some development https://arxiv.org/abs/1706.06060
It s going to be implemented in XGBoost
#2438

@bbennett36
Copy link
Author

Sounds good, I'll check it out. Thank you!

@HugoDLopes
Copy link

HugoDLopes commented Nov 15, 2017

Hi,
@pommedeterresautee
I've run into the same issue today. The feature_importance_ being default to weight in the python package can be really misleading. I've then dig into the code and noticed that the definition of feature importance in the XGBoost is the weight.

When compared with sklearn classifiers (RF or GB) this type of feature importance is not used. Shall it be more similar to the sklearn since the purpose of the feature_importance_ is to resemble the implementation in sklearn?

Currently the weight is not being very helpful since it is far from the actually predictive contribution of a feature for the whole model.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants