Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add support for Universal Binary JSON #7545

Closed
wants to merge 1 commit into from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jan 8, 2022

This is an RFC for adding binary JSON format (https://ubjson.org/) to XGBoost for
serialization to resolve the performance issues. In the original proposal of revising the
serialization format, one of the reasons we choose JSON is that there are multiple
specifications of binary equivalence.

Motivation

The serialization can occur in multiple places, including saving model, pickle,
transferring the model between distributed workers, releasing memory at the end of
training (copy). We have adopted JSON format in the past, which cleanups the
serialization format, but also introduced performance overhead. As originally proposed,
we will add a binary implementation when needed. The binary JSON can close the gap
between the current JSON format and our old binary format, allowing us to phase out support
for the old binary models.

Universal Binary JSON

Universal Binary JSON (https://ubjson.org/) is a specification for serializing JSON
document into the binary format with a focus on performance and ease of implementation.

So far I have found 2 promising candidates. First is the BSON specification. It's widely
adopted and well maintained with big projects like mongoDB utilizing it. However it's
complicated, requires known document size, and most importantly it has a limit on the
total document size as 16 megabytes. (https://docs.mongodb.com/manual/reference/limits/).
Another one is the ubjson adopted in this RFC, which is efficient for array objects but has
lesser implementations. For instance non of the existing Python implementations have full
support for the specification. It has the concept of a typed container (array and object),
which particularly suits the need for XGBoost since XGBoost has many arrays including
tree weights, node indices, etc. Also, it's a relatively simple specification that we can
implement easily.

Implementation

The serializer is easily added on top of existing JSON infrastructure since the JSON
implementation in XGBoost has a modularized reader and writer. Also, I have changed model
serialization to using typed containers wherever appropriate.

Performance

Benchmark is carried out on a model trained on airline dataset with 1000 boosting rounds
and 12 max_depth. The following result is a total of 1000 calls of Booster.__copy__. In
total 1e6 tree copies are carried out in the test.

Master PR
Time 1279.9649624824524 53.655569553375244

Close #7145
Close #6697
Close #7158
Close #4060

TO-DOs

  • Remove the experimental tag for JSON model format.
  • Investigate JVM support.
  • More documents.
  • Add support to the SHAP package.
  • Add warnings according to the roadmap.
  • Roadmap for deprecating the old binary format.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 8, 2022

I approve the idea of adopting UBJSON to speed up serialization.

Close #7158

Do you plan to implement a method to serialize Booster object as a string or byte-buffer (without using the disk)? Or should this feature be in a different PR?

@trivialfis
Copy link
Member Author

We already have that feature, in R that's xgb.save.raw

@trivialfis
Copy link
Member Author

Hmm .. the SHAP package is using that raw buffer ...

@hcho3
Copy link
Collaborator

hcho3 commented Jan 9, 2022

@trivialfis There isn't an equivalent feature in Python. Treelite was using xgb.Booster.save_raw() method which doesn't work with categorical support and I am looking for an alternative.

@trivialfis
Copy link
Member Author

@hcho3 It's a function that calls Learner.SaveModel. This PR changes it to use UBJSON from old binary format, which I might change back due to breaking.

@trivialfis
Copy link
Member Author

But the idea is that it's quite easy to change the format for that function.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 9, 2022

@trivialfis Can we add an optional argument to save_raw(), like save_raw(format='json') ? We can set the argument default to binary for now, but after a while we can change the default to json or ubjson.

@trivialfis
Copy link
Member Author

Yes, we can. But we need a new C api function.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 9, 2022

Got it. Let's add a new C API function.

@trivialfis
Copy link
Member Author

Added a new API function along with support for R and Python.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 9, 2022

@hcho3 I plan to gradually remove the support for the old binary model. A simple roadmap would be

  • Release 1.6: Add ubjson to python/R/jvm, emit a warning about the model format. The jvm part might be more difficult, we will see.
  • Release 2.0: Continue the warning.
  • Release 2.1: Continue the warning, save the model as json or ubj by default. I haven't decided which one should we choose for file output. I would probably go with text JSON, since the typed array in this PR also improves the performance for using text format.
  • Release 2.2: Remove support for saving to old binary format. But keep the loading part with a warning.
  • Release 2.3: Remove support for loading old binary format.

I will open an issue for the roadmap if it's feasible.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 9, 2022

@trivialfis Can you put up a roadmap issue for phasing out the binary serialization format?

@trivialfis trivialfis mentioned this pull request Jan 10, 2022
@RAMitchell
Copy link
Member

This looks very nice. Can we expect dask performance to improve? In the past I remember model serialisation in a multi-process setting to be a significant bottleneck.

@trivialfis
Copy link
Member Author

Can we expect dask performance to improve? In the past I remember model serialisation in a multi-process setting to be a significant bottleneck.

We have optimization on dask to avoid model serialization, specifically, users can pass a future object to xgb.dask.predict. But for training large models this can still be significant, for instance #5474 (comment) .

jvm-packages/xgboost4j/src/native/xgboost4j.cpp Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api_utils.h Outdated Show resolved Hide resolved
ubjson is chosen for its performance and compactness.  Its typed container suites
XGBoost's schema which has lots of arrays like leaf weights.

* Add UBJson reader and writer.
* Use it for memory snapshot.
* Add typed arrays for int and float
* Remove most of the bracket operator.
* Remove assignment operator for JSON Value.
* Use typed array in model.
* Add support for different formats for saving raw buffer.

Code comments.

R doc.

Fixes.

More explicit about type.

Warning.

Typo.

Port changes.

Avoid using string.

Require move.

cleanup.
@trivialfis
Copy link
Member Author

All merged as separated PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants