Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imported JSON xgb.dump yields incorrect predictions due to internal single precision floats #4097

Closed
ras44 opened this issue Feb 1, 2019 · 5 comments

Comments

@ras44
Copy link
Contributor

ras44 commented Feb 1, 2019

Related to the discussion in: #3960

library(xgboost)
library(jsonlite)

# set display options to show 12 digits
options(digits=12)


dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
bst_preds <- predict(bst,as.matrix(data$dates))

# display the json dump string
cat(xgb.dump(bst, with_stats = FALSE, dump_format='json'))

#dump to json, import the json model
bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- jsonlite::fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
bst_from_json_preds <- 1/(1+exp(-1*ifelse(data$dates<node$split_condition,node$children[[1]]$leaf,node$children[[2]]$leaf)))

# test that values are equal
bst_preds - bst_from_json_preds
stopifnot(bst_preds - bst_from_json_preds == 0)

xgboost handles values as single precision floats internally, however when the model is exported as JSON, the values in the child leaf nodes lose precision. This is exacerbated in binary logistic scores if attempting to import the JSON model and parse it for scoring. Importing a binary-saved model does not have this issue, since it apparently maintains the single precision float values.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 1, 2019

See #3960 (comment). Can you convert your input data into 32-bit float first?

@ras44
Copy link
Contributor Author

ras44 commented Feb 1, 2019

Thanks for the reference. In current applications, much of the data comes from a very large DB, so it would be very cumbersome to convert to 32-bit float. I use xgb.dump to convert the model scoring process into SQL so that I can score in-database (instead of attempting to score in-memory). So the features in the database would also need to be converted, and that's unlikely.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 1, 2019

There may not be an easy solution, since XGBoost converts training data into 32-bit floats internally at training time. So splits are chosen assuming that input data is 32-bit.

To accommodate your use case, we need to dump 32-bit floating-point x so that the following statement holds for every 64-bit floating point value y:

x < convert-to-32-bit(y)    if and only if    convert-to-64-bit(x) < y

This may or may not be practicable.

EDIT. In fact, such guarantee is NOT possible. There will be always a gap between convert-to-64-bit(convert-to-32-bit(y)) and y. See here for an explanation of why 1.0f (32-bit) and 1.0 (64-bit) are NOT the same number. So the only safe bet is to convert the input data to 32-bit.

@khotilov Does my explanation seem reasonable to you?

@ras44
Copy link
Contributor Author

ras44 commented Feb 1, 2019

I agree. I think the only reasonable solution is to converting any data to 32 bit. Thanks for sharing.

@ras44
Copy link
Contributor Author

ras44 commented May 2, 2019

In case anyone is interested, I was able to calculate the same values with the JSON output by using the following code. Note the use of the float library to convert both input data and tree values to floats:

library(xgboost)
library(jsonlite)
library(float)

# set display options to show 12 digits
options(digits=22)


dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
bst_preds <- predict(bst,as.matrix(data$dates))

# display the json dump string
cat(xgb.dump(bst, with_stats = FALSE, dump_format='json'))

#dump to json, import the json model
bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- jsonlite::fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
bst_from_json_preds <- ifelse(as.numeric(fl(data$dates))<as.numeric(fl(node$split_condition)),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[1]]$leaf)))),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[2]]$leaf))))
                              )

# test that values are equal
bst_preds
bst_from_json_preds
stopifnot(bst_preds - bst_from_json_preds == 0)

@lock lock bot locked as resolved and limited conversation to collaborators Jul 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants