xgboost in R providing unexpected prediction #7294

dwayne298 · 2021-10-07T13:51:31Z

Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.

Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!

Does anyone have an idea what is going on?

Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0

library(xgboost)
library(dplyr)
library(data.table)

set.seed(2)
a <- matrix(runif(1000,0.1234561,0.1234562),
            ncol=1,nrow=1000)
colnames(a) <- c("b") 
d <- abs(rnorm(1000,3*a[,1]))
d2 <- xgb.DMatrix(data = a,label = d)
e <- xgboost::xgboost(data=d2,nrounds=10,method="hist",objective="reg:gamma")

xgb.plot.tree(e$feature_names,e,trees=9)
x <- 2
log((predict(e,a,ntreelimit = 10)/predict(e,a,ntreelimit = 9)))[x]
format(a[x,],nsmall=10)

The text was updated successfully, but these errors were encountered:

dwayne298 · 2021-10-07T14:02:30Z

Just realised you can get the actual value returned from that tree as follows:

predict(e,a,predleaf = T)[2,10]
xgb.dump(e)

Which, as expected from above, gives the result:
"21:leaf=0.00873187464"

dwayne298 · 2021-10-11T09:36:11Z

Having done some testing with pmml files (using r2pmml) - this issue seems to be occurring due to the parameter method="hist".

Comparing the predictions R vs pmml on a sample dataset, including "hist" leads to differences but excluding "hist" doesn't (at least, no way near as extreme - seem to be rounding errors).

trivialfis · 2021-10-11T09:38:08Z

Let me take a closer look today. Thanks for opening the issue.

trivialfis · 2021-10-12T20:18:35Z

Hi, it's a floating issue. Internally XGBoost uses float32 for data while R uses double by default. Your input is 0.1234561702374036, and a static cast of it to float32 static_cast<float>(v) is 0.123456173, which is a split value in the tree.

BTW, the correct parameter for the training algorithm is called tree_method instead of method, I think there's a warning about that. ;-)

dwayne298 · 2021-10-12T21:07:06Z

Thanks for looking at it so quickly!

So nothing will change with the xgboost package? To get the correct output I'll need to make sure to display using float32?

Ah yes, that's what I get for rushing to get an example! Didn't get the warning for some reason.

Edit: I do get the warning, missed that.

Edit 2: Does this mean it can be a problem even when tree_method isn't "hist"?

trivialfis · 2021-10-13T08:53:16Z

So nothing will change with the xgboost package? To get the correct output I'll need to make sure to display using float32?

At the moment we don't plan to store the data in double precision. We use double optionally in some places where we need to accumulate multiple values.

To get the correct output I'll need to make sure to display using float32?

That's one way around it. I don't know how to ask R to use float32. In your example, the output is in some sense "correct". The split value 0.123456173 of the tree node is not invented but an actual value taken from your data. I guess it's actually the original a after truncation. So a does follow the correct path.

dwayne298 · 2021-10-13T11:58:55Z

Thank you for clarifying and all the work on xgboost!

trivialfis mentioned this issue Oct 12, 2021

1.5.0 Release Candidate #7260

Closed

8 tasks

trivialfis closed this as completed Oct 12, 2021

dwayne298 mentioned this issue Oct 13, 2021

Converting xgboost splits data format in pmml jpmml/r2pmml#70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost in R providing unexpected prediction #7294

xgboost in R providing unexpected prediction #7294

dwayne298 commented Oct 7, 2021

dwayne298 commented Oct 7, 2021

dwayne298 commented Oct 11, 2021

trivialfis commented Oct 11, 2021

trivialfis commented Oct 12, 2021 •

edited

dwayne298 commented Oct 12, 2021 •

edited

trivialfis commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

xgboost in R providing unexpected prediction #7294

xgboost in R providing unexpected prediction #7294

Comments

dwayne298 commented Oct 7, 2021

dwayne298 commented Oct 7, 2021

dwayne298 commented Oct 11, 2021

trivialfis commented Oct 11, 2021

trivialfis commented Oct 12, 2021 • edited

dwayne298 commented Oct 12, 2021 • edited

trivialfis commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

trivialfis commented Oct 12, 2021 •

edited

dwayne298 commented Oct 12, 2021 •

edited