What are the ways of treatng missing values in XGboost? #21

naggar1 · 2014-08-12T21:46:00Z

Generally does the model performance get better with that ?

tqchen · 2014-08-12T21:48:35Z

xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.

tqchen · 2014-08-12T21:51:47Z

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

tqchen · 2014-08-12T21:55:18Z

I haven't done formal comparison with other methods, but I think it should be comparable, and it also gives computation benefit when your feature matrix is sparse

rkirana · 2014-08-29T18:37:26Z

well - if values are not provided, it takes it as missing. So are all 0 values also treated as missing?

Example: A column has 25 values, 15 are 1, 5 are missing/NA and 5 are 0.
Are the 5 + 5 = 10 treated as missing?

tqchen · 2014-08-29T21:13:42Z

It will depends on how you present the data. If you put data in as LIBSVM format, and list zero features there, it will not be treated as missing

rkirana · 2014-08-30T12:11:25Z

it maybe extremely difficult to list 0 features in case of sparse data. So should we avoid xgboost in cases where there is missing data and many 0 features?

maxliu · 2014-08-30T15:56:19Z

Just gave a quick glance of the code ( it is beautiful ,by the way), it is very interesting the way you treat the missing values - it is depending how to make the tree better. Does this method/algorithm have name?

tqchen · 2014-08-30T15:59:54Z

Normally, it is fine that you treat missing and zero all as zero:)

On Sat, Aug 30, 2014 at 5:11 AM, rkirana notifications@github.com wrote:

it maybe extremely difficult to list 0 features in case of sparse data. So
should we avoid xgboost in cases where there is missing data and many 0
features?

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53956745.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

tqchen · 2014-08-30T16:01:36Z

I invent the protocol and tricks by my self, maybe you can just call it
xgboost. The general algorithm however, fits into framework of gradient
boosting.

On Sat, Aug 30, 2014 at 8:56 AM, maxliu notifications@github.com wrote:

Just gave a quick glance of the code ( it is beautiful ,by the way), it is
very interesting the way you treat the missing values - it is depending how
to make the tree better. Does this method/algorithm has name?

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53962310.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

maxliu · 2014-08-30T16:40:25Z

I am not surprised by the seed of xgboost but the score is better than sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for xgboost? Unlike random forest, I could not find many code for boosting with parallel algorithm - may need to improve my google skill though.

tqchen · 2014-08-30T16:46:53Z

I didn't yet publish any paper describing xgboost.

For parallel boosting tree code, the only one I am aware of so far is
http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt . You can try it out
and compare with xgb if you are interested

On Sat, Aug 30, 2014 at 9:40 AM, maxliu notifications@github.com wrote:

I am not surprised by the seed of xgboost but the score is better than
sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for
xgboost? Unlike random forest, I could not find many code for boosting with
parallel algorithm - may need to improve my google skill though.

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53963590.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

Acriche · 2015-07-06T10:06:23Z

A follow up question-

While I understand how XGboost handles missing values within discrete variables, I'm not sure how does it handle continues (numeric) variables.
Can you please explain?

tqchen · 2015-07-07T03:27:11Z

For continuous features, a missing(default) direction is learnt for missing value data to go into, so when the data of the speficific value is missing, then it goes to the default direction

Acriche · 2015-07-07T07:47:26Z

Thanks Tianqi.
And what about missing continuous features in generalized linear models?

akshenndra · 2016-04-29T16:07:46Z

Hi Tianqi
I am looking for an algo which does no imputation of the missing values internally and yet works .How does the xgboost work internally to handle missing values(can you drop in some basic idea) ?

tqchen · 2016-04-29T16:44:01Z

see https://arxiv.org/abs/1603.02754 sec 3.4

akshenndra · 2016-05-01T01:42:09Z

Xgboost also works in the presence of categorical features ?We don't need to prepocess them (binarisation,etc.).For e.g my dataset has a feature called city which has values-"Milan","Rome","venice".Can I present them to xgboost without any preproccessing at all?

johnsonr05 · 2016-05-02T19:54:10Z

Tianqi,

I have question about xgb.importance function. When I run this and look at the Real Cover it seems as though if there is any missing data in a feature, the Real Cover is NA. Is there anyway to deal with this issue to get some co-occurence count for each split?

Rex

nyutal · 2017-01-19T11:58:33Z

Hi Tianqi,
my processing pipeline include normalize the features before learning. Also, i have a lot of indicators which are missing and not zero for negative indication.
as a result, my normalized indicators are 0 for indicated values and missing for non-indicated values.
will xgboost handle such behavior properly? (changing non-exists features to 0 will cause problems...)

Thanks,
Nadav

acc-to-learn · 2017-11-22T08:50:07Z

@tqchen
You wrote:

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

What about a case when the train set has not missing values, but the test has?

tqchen added the question label Aug 12, 2014

tqchen changed the title ~~What are the ways of treatng missing values in Xgboost?~~ What are the ways of treatng missing values in XGboost? Aug 12, 2014

tqchen closed this as completed Aug 12, 2014

randombishop mentioned this issue Sep 11, 2015

LibSVM format question #485

Closed

imenelk mentioned this issue Apr 22, 2016

Segmentation fault with Python package for the approximate method only #1133

Closed

raghavrv mentioned this issue Aug 1, 2016

[MRG] ENH Add support for missing values to Tree based Classifiers scikit-learn/scikit-learn#5974

Closed

18 tasks

keycai mentioned this issue Sep 8, 2017

sparse format data contain NA， when feeding into xgboost， the result is very different from filling them with 0 #2683

Closed

cowden mentioned this issue Dec 1, 2017

Clarification on Trivial Splits #2914

Open

kapil-stp84 mentioned this issue Apr 24, 2018

Treating missing values for categorical variables in XGBoost #3267

Closed

Pscheidl mentioned this issue May 30, 2018

PUBDEV-5551 XGBoost Zero/NaN representation dense & sparse h2oai/h2o-3#2461

Closed

kingfengji mentioned this issue Jun 5, 2018

How to deal with missing values？ pylablanche/gcForest#13

Open

lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the ways of treatng missing values in XGboost? #21

What are the ways of treatng missing values in XGboost? #21

naggar1 commented Aug 12, 2014

tqchen commented Aug 12, 2014

tqchen commented Aug 12, 2014

tqchen commented Aug 12, 2014

rkirana commented Aug 29, 2014

tqchen commented Aug 29, 2014

rkirana commented Aug 30, 2014

maxliu commented Aug 30, 2014

tqchen commented Aug 30, 2014

tqchen commented Aug 30, 2014

maxliu commented Aug 30, 2014

tqchen commented Aug 30, 2014

Acriche commented Jul 6, 2015

tqchen commented Jul 7, 2015

Acriche commented Jul 7, 2015

akshenndra commented Apr 29, 2016

tqchen commented Apr 29, 2016

akshenndra commented May 1, 2016

johnsonr05 commented May 2, 2016

nyutal commented Jan 19, 2017

acc-to-learn commented Nov 22, 2017 •

edited

Loading

What are the ways of treatng missing values in XGboost? #21

What are the ways of treatng missing values in XGboost? #21

Comments

naggar1 commented Aug 12, 2014

tqchen commented Aug 12, 2014

tqchen commented Aug 12, 2014

tqchen commented Aug 12, 2014

rkirana commented Aug 29, 2014

tqchen commented Aug 29, 2014

rkirana commented Aug 30, 2014

maxliu commented Aug 30, 2014

tqchen commented Aug 30, 2014

Sincerely,

tqchen commented Aug 30, 2014

Sincerely,

maxliu commented Aug 30, 2014

tqchen commented Aug 30, 2014

Sincerely,

Acriche commented Jul 6, 2015

tqchen commented Jul 7, 2015

Acriche commented Jul 7, 2015

akshenndra commented Apr 29, 2016

tqchen commented Apr 29, 2016

akshenndra commented May 1, 2016

johnsonr05 commented May 2, 2016

nyutal commented Jan 19, 2017

acc-to-learn commented Nov 22, 2017 • edited Loading

acc-to-learn commented Nov 22, 2017 •

edited

Loading