Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large integers as split conditions not represented well in dumps #3960

Closed
jjdelvalle opened this issue Dec 3, 2018 · 11 comments
Closed

Large integers as split conditions not represented well in dumps #3960

jjdelvalle opened this issue Dec 3, 2018 · 11 comments

Comments

@jjdelvalle
Copy link

jjdelvalle commented Dec 3, 2018

If your split conditions are gonna be large integers, there is a big possibility they won't be represented correctly in a dump file. Curiously though, if you actually save the model in binary format it will get represented properly and will predict correctly. JSON and Text dumps are the problem.

Minimal example

library(xgboost)

dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)
data

bst <- xgboost(data = as.matrix(data$dates), label = labels, nthread = 2,  nrounds = 2, objective = "binary:logistic", missing = NA, max_depth = 2)

### Output JSON
#[
# { "nodeid": 0, "depth": 0, "split": 0, "split_condition": 20180132, "yes": 1, "no": 2, "missing": 1, "gain": 10.9636364, "cover": 4.5, "children": [
#   { "nodeid": 1, "leaf": 0.360000014, "cover": 1.5 },
#   { "nodeid": 2, "leaf": -0.450000018, "cover": 3 }
#   ]},
# { "nodeid": 0, "depth": 0, "split": 0, "split_condition": 20180132, "yes": 1, "no": 2, "missing": 1, "gain": 7.22717142, "cover": 4.30553818, "children": [
#   { "nodeid": 1, "leaf": 0.301630229, "cover": 1.45243073 },
#   { "nodeid": 2, "leaf": -0.363783985, "cover": 2.85310745 }
#   ]}
# ]

As you can see, the splits are all wrong in the json file. If you parse this model everything will be predicted as 1.

Random thoughts: I know floats over a certain range get approximated to a multiple of 2 which is why this is happening (20180131 isnt' a multiple of 2 but 20180132 is). I checked the code but I couldn't find an easy way to get the value in a non-float way. Couldn't get it as an integer nor as a double. Could someone more familiar with the library tell me how it's stored internally and how to fetch it so that maybe I could fix this in a fork while an appropriate solution is found?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 3, 2018

All split thresholds are stored as single-precision floats internally. So the issue is not confined to the dump function.

@jjdelvalle
Copy link
Author

jjdelvalle commented Dec 4, 2018

That's interesting. How come I can save the model (using xgb.save, not xgb.dump), load it, and have it evaluate the conditions properly? What else is going on here?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 4, 2018

@clinchergt Beats me. We'll have to do some debugging here.

@jjdelvalle
Copy link
Author

@tqchen Any idea what's going on here?

@trivialfis
Copy link
Member

@hcho3 Is it possible the new JSON RFC take dump_model into consideration?

@jjdelvalle
Copy link
Author

Any news regarding this? As of now, this makes dumps pretty unreliable.

@trivialfis
Copy link
Member

@clinchergt There's an ongoing discussion on using JSON to represent XGBosst's state, will get to this once finished. I want to remove to old separated model dumping.

@khotilov
Copy link
Member

@clinchergt You are hitting the limits of single precision floats here (see the example below). Feature values and splits are stored as floats, so while int 20180130 gets converted to float 20180130, int 20180131 is converted to float 20180132, and the comparison would still (luckily) work for these specific numbers when comparing floats to floats.

Shifting from 20XX into the two-digit XX year range would be the easiest solution in your case. There's sometimes a price to pay for the single precision, and some extra work is needed in situations when this precision is insufficient.

Perhaps, some limitations on feature values due to single presision might need to be documented somewhere.

#include <iostream>
#include <iomanip>
#include <cstdint>

int main()
{
  union ufloat {
    float f;
    std::uint32_t i;
  };
  
  for(uint32_t i = 20180130; i < 20180130 + 10; ++i) {
    ufloat x{static_cast<float>(i)}; // initializes the 1st element of the union
    std::cout
      <<std::dec<<std::setprecision(17)<< i <<" "
      <<std::defaultfloat<< x.f <<"   "
      <<std::hex<< i <<" "
      << x.i << std::endl;
  }
}

produces this:

20180130 20180130   133eca2 4b99f651
20180131 20180132   133eca3 4b99f652
20180132 20180132   133eca4 4b99f652
20180133 20180132   133eca5 4b99f652
20180134 20180134   133eca6 4b99f653
20180135 20180136   133eca7 4b99f654
20180136 20180136   133eca8 4b99f654
20180137 20180136   133eca9 4b99f654
20180138 20180138   133ecaa 4b99f655
20180139 20180140   133ecab 4b99f656

@jjdelvalle
Copy link
Author

@khotilov Yes, that is exactly the problem. I've stated as much in the OP. However, what confuses me, is that xgboost itself does evaluate it properly, but when dumping the values, since it uses floats, this issue happens.

Is xgboost internally using int for integer thresholds?

@khotilov
Copy link
Member

what confuses me, is that xgboost itself does evaluate it properly

Since internally it uses float features and thresholds, it sees everything at single precision during training and during evaluation. When you would convert data to float first, and then apply the parsed model to it, you would get correct predictions.

If an R example would help more than C++, here's one:

> library(float)
> dates <- c(20180130, 20180131) # R uses double precision for numeric
> dates
[1] 20180130 20180131
> fl(dates) # precision is lost after conversion to 32 bit floats
# A float32 vector: 2
[1] 20180130 20180132

@ras44
Copy link
Contributor

ras44 commented May 2, 2019

For anyone interested, this will correctly reproduce the binary model's predictions for the above example by parsing the JSON output:

library(xgboost)
library(jsonlite)
library(float)

# set display options to show 12 digits
options(digits=22)


dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
bst_preds <- predict(bst,as.matrix(data$dates))

# display the json dump string
cat(xgb.dump(bst, with_stats = FALSE, dump_format='json'))

#dump to json, import the json model
bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- jsonlite::fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
bst_from_json_preds <- ifelse(as.numeric(fl(data$dates))<as.numeric(fl(node$split_condition)),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[1]]$leaf)))),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[2]]$leaf))))
                              )

# test that values are equal
bst_preds
bst_from_json_preds
stopifnot(bst_preds - bst_from_json_preds == 0)

@lock lock bot locked as resolved and limited conversation to collaborators Jul 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants