[RFC] Add support for Universal Binary JSON #7545

trivialfis · 2022-01-08T10:20:58Z

This is an RFC for adding binary JSON format (https://ubjson.org/) to XGBoost for
serialization to resolve the performance issues. In the original proposal of revising the
serialization format, one of the reasons we choose JSON is that there are multiple
specifications of binary equivalence.

Motivation

The serialization can occur in multiple places, including saving model, pickle,
transferring the model between distributed workers, releasing memory at the end of
training (copy). We have adopted JSON format in the past, which cleanups the
serialization format, but also introduced performance overhead. As originally proposed,
we will add a binary implementation when needed. The binary JSON can close the gap
between the current JSON format and our old binary format, allowing us to phase out support
for the old binary models.

Universal Binary JSON

Universal Binary JSON (https://ubjson.org/) is a specification for serializing JSON
document into the binary format with a focus on performance and ease of implementation.

So far I have found 2 promising candidates. First is the BSON specification. It's widely
adopted and well maintained with big projects like mongoDB utilizing it. However it's
complicated, requires known document size, and most importantly it has a limit on the
total document size as 16 megabytes. (https://docs.mongodb.com/manual/reference/limits/).
Another one is the ubjson adopted in this RFC, which is efficient for array objects but has
lesser implementations. For instance non of the existing Python implementations have full
support for the specification. It has the concept of a typed container (array and object),
which particularly suits the need for XGBoost since XGBoost has many arrays including
tree weights, node indices, etc. Also, it's a relatively simple specification that we can
implement easily.

Implementation

The serializer is easily added on top of existing JSON infrastructure since the JSON
implementation in XGBoost has a modularized reader and writer. Also, I have changed model
serialization to using typed containers wherever appropriate.

Performance

Benchmark is carried out on a model trained on airline dataset with 1000 boosting rounds
and 12 max_depth. The following result is a total of 1000 calls of Booster.__copy__. In
total 1e6 tree copies are carried out in the test.

	Master	PR
Time	1279.9649624824524	53.655569553375244

Close #7145
Close #6697
Close #7158
Close #4060

TO-DOs

Remove the experimental tag for JSON model format.
Investigate JVM support.
More documents.
Add support to the SHAP package.
Add warnings according to the roadmap.
Roadmap for deprecating the old binary format.

hcho3 · 2022-01-08T23:40:31Z

I approve the idea of adopting UBJSON to speed up serialization.

Close #7158

Do you plan to implement a method to serialize Booster object as a string or byte-buffer (without using the disk)? Or should this feature be in a different PR?

trivialfis · 2022-01-09T05:34:27Z

We already have that feature, in R that's xgb.save.raw

trivialfis · 2022-01-09T05:37:12Z

Hmm .. the SHAP package is using that raw buffer ...

hcho3 · 2022-01-09T08:33:02Z

@trivialfis There isn't an equivalent feature in Python. Treelite was using xgb.Booster.save_raw() method which doesn't work with categorical support and I am looking for an alternative.

trivialfis · 2022-01-09T08:36:58Z

@hcho3 It's a function that calls Learner.SaveModel. This PR changes it to use UBJSON from old binary format, which I might change back due to breaking.

trivialfis · 2022-01-09T08:37:34Z

But the idea is that it's quite easy to change the format for that function.

hcho3 · 2022-01-09T08:38:35Z

@trivialfis Can we add an optional argument to save_raw(), like save_raw(format='json') ? We can set the argument default to binary for now, but after a while we can change the default to json or ubjson.

trivialfis · 2022-01-09T08:40:31Z

Yes, we can. But we need a new C api function.

hcho3 · 2022-01-09T08:42:01Z

Got it. Let's add a new C API function.

trivialfis · 2022-01-09T10:18:35Z

Added a new API function along with support for R and Python.

trivialfis · 2022-01-09T11:05:14Z

@hcho3 I plan to gradually remove the support for the old binary model. A simple roadmap would be

Release 1.6: Add ubjson to python/R/jvm, emit a warning about the model format. The jvm part might be more difficult, we will see.
Release 2.0: Continue the warning.
Release 2.1: Continue the warning, save the model as json or ubj by default. I haven't decided which one should we choose for file output. I would probably go with text JSON, since the typed array in this PR also improves the performance for using text format.
Release 2.2: Remove support for saving to old binary format. But keep the loading part with a warning.
Release 2.3: Remove support for loading old binary format.

I will open an issue for the roadmap if it's feasible.

hcho3 · 2022-01-09T19:24:23Z

@trivialfis Can you put up a roadmap issue for phasing out the binary serialization format?

RAMitchell · 2022-01-10T11:19:09Z

This looks very nice. Can we expect dask performance to improve? In the past I remember model serialisation in a multi-process setting to be a significant bottleneck.

trivialfis · 2022-01-10T11:58:38Z

Can we expect dask performance to improve? In the past I remember model serialisation in a multi-process setting to be a significant bottleneck.

We have optimization on dask to avoid model serialization, specifically, users can pass a future object to xgb.dask.predict. But for training large models this can still be significant, for instance #5474 (comment) .

jvm-packages/xgboost4j/src/native/xgboost4j.cpp

src/c_api/c_api.cc

src/c_api/c_api_utils.h

doc/tutorials/saving_model.rst

ubjson is chosen for its performance and compactness. Its typed container suites XGBoost's schema which has lots of arrays like leaf weights. * Add UBJson reader and writer. * Use it for memory snapshot. * Add typed arrays for int and float * Remove most of the bracket operator. * Remove assignment operator for JSON Value. * Use typed array in model. * Add support for different formats for saving raw buffer. Code comments. R doc. Fixes. More explicit about type. Warning. Typo. Port changes. Avoid using string. Require move. cleanup.

trivialfis · 2022-01-22T13:47:11Z

All merged as separated PRs.

trivialfis added the status: WIP label Jan 8, 2022

This was referenced Jan 8, 2022

[RFC] Unify model format customize string or Json microsoft/LightGBM#4887

Open

Serializers convert floats to strings microsoft/LightGBM#4681

Open

trivialfis mentioned this pull request Jan 10, 2022

Implement ubjson. #7549

Merged

trivialfis force-pushed the ubjson branch from 4ad62f3 to 79c9821 Compare January 10, 2022 17:09

trivialfis removed the status: WIP label Jan 10, 2022

trivialfis commented Jan 10, 2022

View reviewed changes

jvm-packages/xgboost4j/src/native/xgboost4j.cpp Outdated Show resolved Hide resolved

jvm-packages/xgboost4j/src/native/xgboost4j.cpp Show resolved Hide resolved

src/c_api/c_api.cc Outdated Show resolved Hide resolved

src/c_api/c_api_utils.h Outdated Show resolved Hide resolved

trivialfis force-pushed the ubjson branch from 79c9821 to b90b635 Compare January 11, 2022 04:57

trivialfis commented Jan 11, 2022

View reviewed changes

doc/tutorials/saving_model.rst Outdated Show resolved Hide resolved

trivialfis mentioned this pull request Jan 11, 2022

[breaking] Change internal model serialization to UBJSON. #7556

Merged

trivialfis force-pushed the ubjson branch from 3506136 to ad474ed Compare January 14, 2022 08:48

trivialfis force-pushed the ubjson branch from fb45229 to 9816389 Compare January 15, 2022 18:15

This was referenced Jan 15, 2022

[jvm-packages] Implement new save_raw in jvm-packages. #7570

Merged

[R] Implement new save_raw in R. #7571

Merged

trivialfis closed this Jan 22, 2022

trivialfis deleted the ubjson branch January 22, 2022 13:47

StrikerRUS mentioned this pull request Jan 23, 2022

[R-package] User-friendly redesign for lightgbm() microsoft/LightGBM#4968

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add support for Universal Binary JSON #7545

[RFC] Add support for Universal Binary JSON #7545

trivialfis commented Jan 8, 2022 •

edited

Loading

hcho3 commented Jan 8, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022 •

edited

Loading

hcho3 commented Jan 9, 2022

RAMitchell commented Jan 10, 2022

trivialfis commented Jan 10, 2022

trivialfis commented Jan 22, 2022

[RFC] Add support for Universal Binary JSON #7545

[RFC] Add support for Universal Binary JSON #7545

Conversation

trivialfis commented Jan 8, 2022 • edited Loading

Motivation

Universal Binary JSON

Implementation

Performance

TO-DOs

hcho3 commented Jan 8, 2022 • edited Loading

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022 • edited Loading

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022 • edited Loading

trivialfis commented Jan 9, 2022

hcho3 commented Jan 9, 2022

trivialfis commented Jan 9, 2022

trivialfis commented Jan 9, 2022 • edited Loading

hcho3 commented Jan 9, 2022

RAMitchell commented Jan 10, 2022

trivialfis commented Jan 10, 2022

trivialfis commented Jan 22, 2022

trivialfis commented Jan 8, 2022 •

edited

Loading

hcho3 commented Jan 8, 2022 •

edited

Loading

hcho3 commented Jan 9, 2022 •

edited

Loading

hcho3 commented Jan 9, 2022 •

edited

Loading

trivialfis commented Jan 9, 2022 •

edited

Loading