Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost in R providing unexpected prediction #7294

Closed
dwayne298 opened this issue Oct 7, 2021 · 7 comments
Closed

xgboost in R providing unexpected prediction #7294

dwayne298 opened this issue Oct 7, 2021 · 7 comments

Comments

@dwayne298
Copy link

Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.

Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!

Does anyone have an idea what is going on?

10th Tree

Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0

library(xgboost)
library(dplyr)
library(data.table)

set.seed(2)
a <- matrix(runif(1000,0.1234561,0.1234562),
            ncol=1,nrow=1000)
colnames(a) <- c("b") 
d <- abs(rnorm(1000,3*a[,1]))
d2 <- xgb.DMatrix(data = a,label = d)
e <- xgboost::xgboost(data=d2,nrounds=10,method="hist",objective="reg:gamma")

xgb.plot.tree(e$feature_names,e,trees=9)
x <- 2
log((predict(e,a,ntreelimit = 10)/predict(e,a,ntreelimit = 9)))[x]
format(a[x,],nsmall=10)
@dwayne298
Copy link
Author

Just realised you can get the actual value returned from that tree as follows:

predict(e,a,predleaf = T)[2,10]
xgb.dump(e)

Which, as expected from above, gives the result:
"21:leaf=0.00873187464"

@dwayne298
Copy link
Author

Having done some testing with pmml files (using r2pmml) - this issue seems to be occurring due to the parameter method="hist".

Comparing the predictions R vs pmml on a sample dataset, including "hist" leads to differences but excluding "hist" doesn't (at least, no way near as extreme - seem to be rounding errors).

@trivialfis
Copy link
Member

Let me take a closer look today. Thanks for opening the issue.

@trivialfis
Copy link
Member

trivialfis commented Oct 12, 2021

Hi, it's a floating issue. Internally XGBoost uses float32 for data while R uses double by default. Your input is 0.1234561702374036, and a static cast of it to float32 static_cast<float>(v) is 0.123456173, which is a split value in the tree.

BTW, the correct parameter for the training algorithm is called tree_method instead of method, I think there's a warning about that. ;-)

@dwayne298
Copy link
Author

dwayne298 commented Oct 12, 2021

Thanks for looking at it so quickly!

So nothing will change with the xgboost package? To get the correct output I'll need to make sure to display using float32?

Ah yes, that's what I get for rushing to get an example! Didn't get the warning for some reason.

Edit: I do get the warning, missed that.

Edit 2: Does this mean it can be a problem even when tree_method isn't "hist"?

@trivialfis
Copy link
Member

So nothing will change with the xgboost package? To get the correct output I'll need to make sure to display using float32?

At the moment we don't plan to store the data in double precision. We use double optionally in some places where we need to accumulate multiple values.

To get the correct output I'll need to make sure to display using float32?

That's one way around it. I don't know how to ask R to use float32. In your example, the output is in some sense "correct". The split value 0.123456173 of the tree node is not invented but an actual value taken from your data. I guess it's actually the original a after truncation. So a does follow the correct path.

@dwayne298
Copy link
Author

Thank you for clarifying and all the work on xgboost!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants