-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different predictions for the same data from sparse_matrix (Bug) #2488
Comments
Sparse elements are treated by gbtree as if they were "missing" values (but note that the gblinear booster considers them to be zeros). If there are any explicit zeros in a sparse matrix (as in your r2), those would be treated as actual zero values in tree algorithms. Since the training sparse data did not have explicit zeros, it wouldn't be good to have them in new data. |
An illustration using dense data: In [17]: d1 = np.array([[1.] * 126]) # the same as sparse data r1
...: d1[0, [108, 125]] = np.nan
...:
In [18]: bst.predict(xgb.DMatrix(d1))
Out[18]: array([ 0.11394204], dtype=float32)
In [19]:
In [19]: d2 = np.array([[1.] * 126]) # the same as sparse data with explicit zeros r2
...: d2[0, [108, 125]] = 0
...:
In [20]: bst.predict(xgb.DMatrix(d2))
Out[20]: array([ 0.9202382], dtype=float32) |
Nevertheless it gives different results if there are multiple rows and one of them contains np.nan
P.S. My goal is given sparse matrix, explicitly set some values to zero |
How did you install xgboost? Handling of NaN's in sparse data was fixed in #2062 last Feb, while xgboost pypi package was last updated in Aug 2016. Setting "some values to zero" in sparse matrix would depend on why you want to do that:
|
Yeah, the reason is that I used xgboost pypi package. In my case I'll go with |
@khotilov "note that the gblinear booster considers them to be zeros" So glinear imputes missing data with "0"? Does it standardize the features, making this a mean imputation, or should features be standardized in preprocessing? |
@papinisan no feature standardization is currently done within xgboost - this one is on a user. |
xgboost prediction yields different results when DMatrix was created from sci_py
csr_matrix
for the same data.Difference occurs depending if I explicitly or implicitly specify zeros.
In the case below I had predictions : 0.11 and 0.92, but they should be the same!
Environment info
Operating System: Debian
Package used (python/R/jvm/C++): python
xgboost
version used: 0.6The python version and distribution : 2.7.9
Steps to reproduce
The text was updated successfully, but these errors were encountered: