Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different predictions for the same data from sparse_matrix (Bug) #2488

Closed
vadym-wix opened this issue Jul 4, 2017 · 7 comments
Closed

Different predictions for the same data from sparse_matrix (Bug) #2488

vadym-wix opened this issue Jul 4, 2017 · 7 comments

Comments

@vadym-wix
Copy link

vadym-wix commented Jul 4, 2017

xgboost prediction yields different results when DMatrix was created from sci_py csr_matrix for the same data.
Difference occurs depending if I explicitly or implicitly specify zeros.

In the case below I had predictions : 0.11 and 0.92, but they should be the same!

Environment info

Operating System: Debian
Package used (python/R/jvm/C++): python

xgboost version used: 0.6

The python version and distribution : 2.7.9

Steps to reproduce

import numpy as np
from sklearn.datasets import load_svmlight_file
from scipy.sparse import csr_matrix, vstack
import xgboost as xgb

# make dataset
loaded = load_svmlight_file('../data/agaricus.txt.train')
data = csr_matrix(loaded[0])
labels = loaded[1]
dtrain = xgb.DMatrix(data, labels)

#train data
param = {'max_depth':4, 'eta':0.2, 'silent':1, 'objective':'binary:logistic' }
num_round = 10
bst = xgb.train(param, dtrain, num_round)

# create 2 equal sparse rows . but one row contains explicit zeros
r1 = np.array([1] * 126)
r1[[108, 125]] = 0
r1 = csr_matrix(r1)
r2 = csr_matrix([1] * 126)
r2[0, [108, 125]] = 0
print(np.array_equal(r1.toarray(),r2.toarray()))

#evaluate results. should be the same!
test_data = vstack([r1, r2])
dtest = xgb.DMatrix(test_data)
bst.predict(dtest)
@khotilov
Copy link
Member

khotilov commented Jul 4, 2017

Sparse elements are treated by gbtree as if they were "missing" values (but note that the gblinear booster considers them to be zeros). If there are any explicit zeros in a sparse matrix (as in your r2), those would be treated as actual zero values in tree algorithms. Since the training sparse data did not have explicit zeros, it wouldn't be good to have them in new data.

@khotilov
Copy link
Member

khotilov commented Jul 4, 2017

An illustration using dense data:

In [17]: d1 = np.array([[1.] * 126]) # the same as sparse data r1
    ...: d1[0, [108, 125]] = np.nan
    ...:

In [18]: bst.predict(xgb.DMatrix(d1))
Out[18]: array([ 0.11394204], dtype=float32)

In [19]:

In [19]: d2 = np.array([[1.] * 126]) # the same as sparse data with explicit zeros r2
    ...: d2[0, [108, 125]] = 0
    ...:

In [20]: bst.predict(xgb.DMatrix(d2))
Out[20]: array([ 0.9202382], dtype=float32)

@khotilov khotilov closed this as completed Jul 4, 2017
@vadym-wix
Copy link
Author

Nevertheless it gives different results if there are multiple rows and one of them contains np.nan

>>> r1 = np.array([1.] * 126)
>>> r1[[108]] = 0
>>> r1 = csr_matrix(r1)
>>> bst.predict(xgb.DMatrix(r1))
array([ 0.11394204], dtype=float32)

>>> d1 = np.array([[1.] * 126]) # the same as sparse data r1
>>> d1[0, [108]] = np.nan
>>> bst.predict(xgb.DMatrix(d1))
array([ 0.11394204], dtype=float32)

>>> test_data = vstack([r1, d1])
>>> dtest = xgb.DMatrix(test_data)
>>> bst.predict(dtest)
array([ 0.11394204,  0.9202382 ], dtype=float32)

P.S. My goal is given sparse matrix, explicitly set some values to zero

@khotilov
Copy link
Member

khotilov commented Jul 5, 2017

How did you install xgboost? Handling of NaN's in sparse data was fixed in #2062 last Feb, while xgboost pypi package was last updated in Aug 2016.

Setting "some values to zero" in sparse matrix would depend on why you want to do that:

  • if by "setting elements to zero" you actually mean "making elements sparse", then assigning zeros has to be followed by a call to eliminate_zeros()
  • but if you want actual explicit zeros in sparse data, then, as I've mentioned, it would only make sense if columns in training data also had explicit zeros.

@vadym-wix
Copy link
Author

vadym-wix commented Jul 6, 2017

Yeah, the reason is that I used xgboost pypi package.

In my case I'll go with eliminate_zeros()
Thanks for explanation!

@papinisan
Copy link

@khotilov "note that the gblinear booster considers them to be zeros"

So glinear imputes missing data with "0"? Does it standardize the features, making this a mean imputation, or should features be standardized in preprocessing?

@khotilov
Copy link
Member

khotilov commented Aug 3, 2017

@papinisan no feature standardization is currently done within xgboost - this one is on a user.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants