Different predictions for the same data from sparse_matrix (Bug) #2488

vadym-wix · 2017-07-04T17:43:04Z

xgboost prediction yields different results when DMatrix was created from sci_py csr_matrix for the same data.
Difference occurs depending if I explicitly or implicitly specify zeros.

In the case below I had predictions : 0.11 and 0.92, but they should be the same!

Environment info

Operating System: Debian
Package used (python/R/jvm/C++): python

xgboost version used: 0.6

The python version and distribution : 2.7.9

Steps to reproduce

import numpy as np
from sklearn.datasets import load_svmlight_file
from scipy.sparse import csr_matrix, vstack
import xgboost as xgb

# make dataset
loaded = load_svmlight_file('../data/agaricus.txt.train')
data = csr_matrix(loaded[0])
labels = loaded[1]
dtrain = xgb.DMatrix(data, labels)

#train data
param = {'max_depth':4, 'eta':0.2, 'silent':1, 'objective':'binary:logistic' }
num_round = 10
bst = xgb.train(param, dtrain, num_round)

# create 2 equal sparse rows . but one row contains explicit zeros
r1 = np.array([1] * 126)
r1[[108, 125]] = 0
r1 = csr_matrix(r1)
r2 = csr_matrix([1] * 126)
r2[0, [108, 125]] = 0
print(np.array_equal(r1.toarray(),r2.toarray()))

#evaluate results. should be the same!
test_data = vstack([r1, r2])
dtest = xgb.DMatrix(test_data)
bst.predict(dtest)

The text was updated successfully, but these errors were encountered:

khotilov · 2017-07-04T22:13:33Z

Sparse elements are treated by gbtree as if they were "missing" values (but note that the gblinear booster considers them to be zeros). If there are any explicit zeros in a sparse matrix (as in your r2), those would be treated as actual zero values in tree algorithms. Since the training sparse data did not have explicit zeros, it wouldn't be good to have them in new data.

khotilov · 2017-07-04T22:22:01Z

An illustration using dense data:

In [17]: d1 = np.array([[1.] * 126]) # the same as sparse data r1
    ...: d1[0, [108, 125]] = np.nan
    ...:

In [18]: bst.predict(xgb.DMatrix(d1))
Out[18]: array([ 0.11394204], dtype=float32)

In [19]:

In [19]: d2 = np.array([[1.] * 126]) # the same as sparse data with explicit zeros r2
    ...: d2[0, [108, 125]] = 0
    ...:

In [20]: bst.predict(xgb.DMatrix(d2))
Out[20]: array([ 0.9202382], dtype=float32)

vadym-wix · 2017-07-05T14:12:54Z

Nevertheless it gives different results if there are multiple rows and one of them contains np.nan

>>> r1 = np.array([1.] * 126)
>>> r1[[108]] = 0
>>> r1 = csr_matrix(r1)
>>> bst.predict(xgb.DMatrix(r1))
array([ 0.11394204], dtype=float32)

>>> d1 = np.array([[1.] * 126]) # the same as sparse data r1
>>> d1[0, [108]] = np.nan
>>> bst.predict(xgb.DMatrix(d1))
array([ 0.11394204], dtype=float32)

>>> test_data = vstack([r1, d1])
>>> dtest = xgb.DMatrix(test_data)
>>> bst.predict(dtest)
array([ 0.11394204,  0.9202382 ], dtype=float32)

P.S. My goal is given sparse matrix, explicitly set some values to zero

khotilov · 2017-07-05T19:35:29Z

How did you install xgboost? Handling of NaN's in sparse data was fixed in #2062 last Feb, while xgboost pypi package was last updated in Aug 2016.

Setting "some values to zero" in sparse matrix would depend on why you want to do that:

if by "setting elements to zero" you actually mean "making elements sparse", then assigning zeros has to be followed by a call to eliminate_zeros()
but if you want actual explicit zeros in sparse data, then, as I've mentioned, it would only make sense if columns in training data also had explicit zeros.

vadym-wix · 2017-07-06T10:02:13Z

Yeah, the reason is that I used xgboost pypi package.

In my case I'll go with eliminate_zeros()
Thanks for explanation!

papinisan · 2017-07-29T19:03:22Z

@khotilov "note that the gblinear booster considers them to be zeros"

So glinear imputes missing data with "0"? Does it standardize the features, making this a mean imputation, or should features be standardized in preprocessing?

khotilov · 2017-08-03T03:23:08Z

@papinisan no feature standardization is currently done within xgboost - this one is on a user.

khotilov closed this as completed Jul 4, 2017

lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different predictions for the same data from sparse_matrix (Bug) #2488

Different predictions for the same data from sparse_matrix (Bug) #2488

vadym-wix commented Jul 4, 2017 •

edited

Loading

khotilov commented Jul 4, 2017

khotilov commented Jul 4, 2017

vadym-wix commented Jul 5, 2017

khotilov commented Jul 5, 2017

vadym-wix commented Jul 6, 2017 •

edited

Loading

papinisan commented Jul 29, 2017

khotilov commented Aug 3, 2017

Different predictions for the same data from sparse_matrix (Bug) #2488

Different predictions for the same data from sparse_matrix (Bug) #2488

Comments

vadym-wix commented Jul 4, 2017 • edited Loading

Environment info

Steps to reproduce

khotilov commented Jul 4, 2017

khotilov commented Jul 4, 2017

vadym-wix commented Jul 5, 2017

khotilov commented Jul 5, 2017

vadym-wix commented Jul 6, 2017 • edited Loading

papinisan commented Jul 29, 2017

khotilov commented Aug 3, 2017

vadym-wix commented Jul 4, 2017 •

edited

Loading

vadym-wix commented Jul 6, 2017 •

edited

Loading