You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"The column Gain provide the information we are looking for."
"Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it)."
Which I'm assuming is 'weight' in the python package. (Correct me if I'm wrong)
If we shouldn't be using weight/frequency as a way to check feature importance, why is it the default param in XGBoost ?
Just a bit concerning as someone who just started learning XGBoost recently because I was using the default "plot_importance" to find the best features but now it seems misleading since it doesn't default to the best param to find this. Unless I am missing something. Seems like 'gain' should be the default parameter.
The text was updated successfully, but these errors were encountered:
There are many ways to look at feature importance. Gain is more informative than just frequency.
Please check this paper for some development https://arxiv.org/abs/1706.06060
It s going to be implemented in XGBoost #2438
Hi, @pommedeterresautee
I've run into the same issue today. The feature_importance_ being default to weight in the python package can be really misleading. I've then dig into the code and noticed that the definition of feature importance in the XGBoost is the weight.
When compared with sklearn classifiers (RF or GB) this type of feature importance is not used. Shall it be more similar to the sklearn since the purpose of the feature_importance_ is to resemble the implementation in sklearn?
Currently the weight is not being very helpful since it is far from the actually predictive contribution of a feature for the whole model.
lockbot
locked as resolved and limited conversation to collaborators
Oct 25, 2018
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I was reading through the docs and noticed that in the R-package section
http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html#feature-importance
it says the follow..
"The column Gain provide the information we are looking for."
"Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it)."
Which I'm assuming is 'weight' in the python package. (Correct me if I'm wrong)
If we shouldn't be using weight/frequency as a way to check feature importance, why is it the default param in XGBoost ?
Just a bit concerning as someone who just started learning XGBoost recently because I was using the default "plot_importance" to find the best features but now it seems misleading since it doesn't default to the best param to find this. Unless I am missing something. Seems like 'gain' should be the default parameter.
The text was updated successfully, but these errors were encountered: