New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost in R providing unexpected prediction #7294
Comments
Just realised you can get the actual value returned from that tree as follows:
Which, as expected from above, gives the result: |
Having done some testing with pmml files (using r2pmml) - this issue seems to be occurring due to the parameter method="hist". Comparing the predictions R vs pmml on a sample dataset, including "hist" leads to differences but excluding "hist" doesn't (at least, no way near as extreme - seem to be rounding errors). |
Let me take a closer look today. Thanks for opening the issue. |
Hi, it's a floating issue. Internally XGBoost uses float32 for data while R uses double by default. Your input is BTW, the correct parameter for the training algorithm is called |
Thanks for looking at it so quickly! So nothing will change with the xgboost package? To get the correct output I'll need to make sure to display using float32? Ah yes, that's what I get for rushing to get an example! Didn't get the warning for some reason. Edit: I do get the warning, missed that. Edit 2: Does this mean it can be a problem even when tree_method isn't "hist"? |
At the moment we don't plan to store the data in double precision. We use double optionally in some places where we need to accumulate multiple values.
That's one way around it. I don't know how to ask R to use float32. In your example, the output is in some sense "correct". The split value |
Thank you for clarifying and all the work on xgboost! |
Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.
Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!
Does anyone have an idea what is going on?
Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0
The text was updated successfully, but these errors were encountered: