-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add support for Universal Binary JSON #7545
Conversation
I approve the idea of adopting UBJSON to speed up serialization.
Do you plan to implement a method to serialize Booster object as a string or byte-buffer (without using the disk)? Or should this feature be in a different PR? |
We already have that feature, in R that's xgb.save.raw |
Hmm .. the SHAP package is using that raw buffer ... |
@trivialfis There isn't an equivalent feature in Python. Treelite was using |
@hcho3 It's a function that calls |
But the idea is that it's quite easy to change the format for that function. |
@trivialfis Can we add an optional argument to |
Yes, we can. But we need a new C api function. |
Got it. Let's add a new C API function. |
Added a new API function along with support for R and Python. |
@hcho3 I plan to gradually remove the support for the old binary model. A simple roadmap would be
I will open an issue for the roadmap if it's feasible. |
@trivialfis Can you put up a roadmap issue for phasing out the binary serialization format? |
This looks very nice. Can we expect dask performance to improve? In the past I remember model serialisation in a multi-process setting to be a significant bottleneck. |
We have optimization on dask to avoid model serialization, specifically, users can pass a future object to xgb.dask.predict. But for training large models this can still be significant, for instance #5474 (comment) . |
ubjson is chosen for its performance and compactness. Its typed container suites XGBoost's schema which has lots of arrays like leaf weights. * Add UBJson reader and writer. * Use it for memory snapshot. * Add typed arrays for int and float * Remove most of the bracket operator. * Remove assignment operator for JSON Value. * Use typed array in model. * Add support for different formats for saving raw buffer. Code comments. R doc. Fixes. More explicit about type. Warning. Typo. Port changes. Avoid using string. Require move. cleanup.
All merged as separated PRs. |
This is an RFC for adding binary JSON format (https://ubjson.org/) to XGBoost for
serialization to resolve the performance issues. In the original proposal of revising the
serialization format, one of the reasons we choose JSON is that there are multiple
specifications of binary equivalence.
Motivation
The serialization can occur in multiple places, including saving model, pickle,
transferring the model between distributed workers, releasing memory at the end of
training (copy). We have adopted JSON format in the past, which cleanups the
serialization format, but also introduced performance overhead. As originally proposed,
we will add a binary implementation when needed. The binary JSON can close the gap
between the current JSON format and our old binary format, allowing us to phase out support
for the old binary models.
Universal Binary JSON
Universal Binary JSON (https://ubjson.org/) is a specification for serializing JSON
document into the binary format with a focus on performance and ease of implementation.
So far I have found 2 promising candidates. First is the BSON specification. It's widely
adopted and well maintained with big projects like mongoDB utilizing it. However it's
complicated, requires known document size, and most importantly it has a limit on the
total document size as 16 megabytes. (https://docs.mongodb.com/manual/reference/limits/).
Another one is the ubjson adopted in this RFC, which is efficient for array objects but has
lesser implementations. For instance non of the existing Python implementations have full
support for the specification. It has the concept of a typed container (array and object),
which particularly suits the need for XGBoost since XGBoost has many arrays including
tree weights, node indices, etc. Also, it's a relatively simple specification that we can
implement easily.
Implementation
The serializer is easily added on top of existing JSON infrastructure since the JSON
implementation in XGBoost has a modularized reader and writer. Also, I have changed model
serialization to using typed containers wherever appropriate.
Performance
Benchmark is carried out on a model trained on airline dataset with 1000 boosting rounds
and 12 max_depth. The following result is a total of 1000 calls of
Booster.__copy__
. Intotal 1e6 tree copies are carried out in the test.
Close #7145
Close #6697
Close #7158
Close #4060
TO-DOs