Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upWhat are the ways of treatng missing values in XGboost? #21
Comments
|
xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value. i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it. |
|
Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss. |
|
I haven't done formal comparison with other methods, but I think it should be comparable, and it also gives computation benefit when your feature matrix is sparse |
|
well - if values are not provided, it takes it as missing. So are all 0 values also treated as missing? Example: A column has 25 values, 15 are 1, 5 are missing/NA and 5 are 0. |
|
It will depends on how you present the data. If you put data in as LIBSVM format, and list zero features there, it will not be treated as missing |
|
it maybe extremely difficult to list 0 features in case of sparse data. So should we avoid xgboost in cases where there is missing data and many 0 features? |
|
Just gave a quick glance of the code ( it is beautiful ,by the way), it is very interesting the way you treat the missing values - it is depending how to make the tree better. Does this method/algorithm have name? |
|
Normally, it is fine that you treat missing and zero all as zero:) On Sat, Aug 30, 2014 at 5:11 AM, rkirana notifications@github.com wrote:
Sincerely,Tianqi Chen |
|
I invent the protocol and tricks by my self, maybe you can just call it On Sat, Aug 30, 2014 at 8:56 AM, maxliu notifications@github.com wrote:
Sincerely,Tianqi Chen |
|
I am not surprised by the seed of xgboost but the score is better than sklearn-GBR. The trick of missing value might be one of the reasons. Have you published any paper for the boosting algorithm you used for xgboost? Unlike random forest, I could not find many code for boosting with parallel algorithm - may need to improve my google skill though. |
|
I didn't yet publish any paper describing xgboost. For parallel boosting tree code, the only one I am aware of so far is On Sat, Aug 30, 2014 at 9:40 AM, maxliu notifications@github.com wrote:
Sincerely,Tianqi Chen |
|
A follow up question- While I understand how XGboost handles missing values within discrete variables, I'm not sure how does it handle continues (numeric) variables. |
|
For continuous features, a missing(default) direction is learnt for missing value data to go into, so when the data of the speficific value is missing, then it goes to the default direction |
|
Thanks Tianqi. |
|
Hi Tianqi |
|
see https://arxiv.org/abs/1603.02754 sec 3.4 |
|
Xgboost also works in the presence of categorical features ?We don't need to prepocess them (binarisation,etc.).For e.g my dataset has a feature called city which has values-"Milan","Rome","venice".Can I present them to xgboost without any preproccessing at all? |
|
Tianqi, I have question about xgb.importance function. When I run this and look at the Real Cover it seems as though if there is any missing data in a feature, the Real Cover is NA. Is there anyway to deal with this issue to get some co-occurence count for each split? Rex |
|
Hi Tianqi, Thanks, |
|
@tqchen
What about a case when the train set has not missing values, but the test has? |
Generally does the model performance get better with that ?