-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] On-demand serialization + standardization of attributes #9924
Conversation
Thank you for introducing the new serialization! This is a major upgrade to the existing interface and I'm excited about it. I will look into the PR deeper in the coming days and learn more about the changes. Since this is related to serialization, I will assist in writing some testing guidelines, as mentioned earlier, serialization has been a foot gun and we still have issues with it today. I would like to be extra careful from the beginning. |
Related #9908 |
Looks like the windows build from buildkite is still using R 3.6:
|
What's the minimum R version do we need for ALTREP? I can update the R version on the Windows side. |
It's R 4.3. |
Currently, we have a policy of not including binary files in the git repository, since git isn't suited for handling diffs in binary (non-text) files. Some alternatives:
|
Does it mean that the new XGBoost R package will only be compatible with R 4.3+ ? We probably should document the requirement. |
It's already stated in the |
Sorry, I missed the conversation in #9847, as I was gone for vacation. I will go ahead and update the R version in the Windows CI pipeline. |
Awesome work! I noted that we will have a mix of non-Markdown and Markdown help files in "xgb.Booster.R" and "xgb.model.dt.tree.R". E.g., backticks instead of \code, two asterisks instead of \bold, Furthermore, I am avoiding long titles, Of course, I can go over these files after merging. Currently, I am not working on the help files until most of the PRs will be merged to avoid such conflicts. |
There's also the option of pre-downloading it as part of the R package build script; and/or to look for the file under somewhere like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks for the work on serialization! It will take some time to process it (I'm still on PTO). Can we extract some of the unrelated changes and merge them first?
In addition, a brief introduction to how it works as code comment would be nice. Something that can get new comers up to speed.
Thanks for pointing this out. What would be the python equivalent of this function? |
pickle. |
specifically, |
I'm quite excited about this PR, and am comfortable with merging it regarding the interface change, this way we can unblock other PRs and avoid too many rebase work. Do you want to merge it and leave the RDS issue in a follow-up PR or do you want to have it solved in this one (I can help with the serialization)? |
I've added a small note about the R attribute I am wondering however if there's some C-level function that would automatically gather and serialize all C attributes that aren't handled by This step could be done in the same ALTREP serializer if we know what else needs to be there, but I see that the python class has many things that it keeps in attributes and I'm not sure which ones are C things that need to be re-assigned. Does it need to also gather
Let's better do it in this PR to keep all serialization-related things together. |
@david-cortes See #9924 (comment) .
It saves everything that's defined in C, including parameters and attributes. It's used by the now removed xgboost/python-package/xgboost/core.py Line 1811 in 01c4711
|
I think we can use it here: https://github.com/dmlc/xgboost/pull/9924/files#r1445141247 . |
Changed it to use Seems to pass the tests and work as expected. |
We should probably retain the note about backward compatibility: if saveRDS is used, the model may not be fully usable in a future version of XGBoost. |
Is that because |
When loading from a different version of XGBoost, the internal parameters are automatically discarded, in which case, it's the same as |
Ok, then I guess it's no big deal, and the compatibility note makes sense to keep. |
See #9734 (comment)
sounds good! |
This comment has been minimized.
This comment has been minimized.
I've rewritten the compatibility note to reflect the current situation per my understanding. Would be ideal if you and @trivialfis could take a look at it. It doesn't have a warning to not use R's serializers, but describes a bit the differences. I think it's still important to have it, since now we have an incompatibility with models that were created before this PR, and an incompatibility with older versions of |
Thanks for the new wording. I recognize that some use cases do call for the use of |
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
ref #9810
This PR changes the serialization logic of
xgb.Booster
objects to trigger only on-demand, by using an R altrep list class on which serialization methods are implemented, and not all of the expected altrep methods are implemented so as to avoid potential unwanted conversions that might lose the serializers.Since the serialization logic changes with this PR, the way in which attributes are kept in model objects also needed to be changed:
attributes(model)
; and C-level attributes which are kept in the model JSON, which can be accessed and set throughxgb.attributes(model)
.xgb.Booster.handle
class. I removed it as part of this PR, since it would not have any use-case where it'd have some advantage overxgb.Booster
without optional attributes.This separation means that now accessing attributes is not as simple as calling
model$<attribute>
- now one needs to explicitly use eitherattributes
orxgb.attributes
.Since the logic for keeping track of attributes was changed here, this PR required doing changes throughout pretty much all the R code. As it is right now, all of the tests are passing, but I am not entirely confident that this PR won't break something not covered in the tests, and I am not sure if I have updated all the docs that became outdated after these changes.
The changes here also now make xgboost incompatible with {caret}, so I've removed all the references to it. I also removed suggestions around it since this package is not in active development anymore as it was superceded by {tidymodels}. Haven't tested {mlr} but I have a feeling it might also break.
A couple notes about the PR and about many things I noticed - would be ideal if maintainers could open independent issues about some of these if needed:
Prioritization
As this PR creates merge conflicts with all others and it's the hardest to keep track of, would be very helpful to merge it before others like inplace predict or quantile dmatrices or roxygen updates.
Shallow and deep copies
Before this PR, serialization and de-serialization was triggered multiple times throughout the runs of different functions. Doing this was very inefficient, and had the potential to create inconsistencies between the R attributes and the C booster.
After this PR, there are no such duplications - objects are updated in-place, at the C level. It's now also possible to incrementally update a booster in-place through
xgb.train
, which is controlled by an optional parametertraining_continuation
, but I wasn't sure how this parameter should play along with xgboost parameterprocess_type
.predict
or similar methods.Since one might now need to make copies of the booster, a helper function
xgb.copy.Booster
was also implemented in the public interface. I could not find any C-level function to duplicate a booster so I used the ubj to-bytes serializer for it, which I am guessing is not the most efficient way.Information lost during serialization
It seems that the serialization functions that save models to disk (e.g.
XGBoosterSaveModel
), regardless of the format that they end up using (json/ubj), will never save custom user attributes that one sets to the booster outside of training, except for some particular attributes like 'niter'; unlike the serialization functions that save them to bytes likeXGBoosterSaveModelToRaw
which keep all of the custom attributes.Additionally, these serializers also seem to lose feature names and types if they were set in the booster through
XGBoosterSetStrFeatureInfo
. As such, I modified some of the tests to manually remove the feature names for the comparisons, but I have a feeling that something here could be improved.Serialization compatibility
After this PR, it will not be possible to load models that were saved with R serializers like
saveRDS
with a previous xgboost version, and it will not be possible to load models saved withsaveRDS
after this PR in previous xgboost versions. I updated the compatibility note to mention the breakpoint being xgboost version 2.1.0, which I suppose will be the next release.There's a test which downloads serialized files from the internet and which will need to be updated. For now, I simply modified the test to skip the Rds files, but in the future would be more logical to update the files there.
I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.
Serialization advise
XGBoost's own docs advise users to not use R-specific serializers, but lots of the functionalities from the public interface actually require using R serializers, such as function
xgb.gblinear.history
, which uses attributes fromcallbacks
that aren't part of the standard C booster and thus do not get saved in functions likeXGBoosterSaveModel
orXGBoosterSaveModelToRaw
. Moreover, from the point above, even C-level information is lost when using to-disk serializers, such as the feature names in the booster, which can also lead to very unexpected behaviors (e.g. they are saved withxgb.save.raw
, but not withxgb.save
).I was thinking that perhaps the booster could add serialized versions of the R attributes (as obtained by
base::serialize
), but those raw bytes would not conform to a JSON-loadable string, would add latency and memory usage, are not usable in other interfaces, among other minuses.It leaves me wondering what should be the actual suggestion in the doc, given that a user following that advise might find that things actually break after the fact, and there's advantages to using
xgb.save.raw
+ R'swriteBin
compared to directly usingxgb.save
.Plotting multi-valued trees
I am not sure if the plotting functionalities need any kind of adjustment for models like multi-quantile regression, multi-target objectives, and so on. I didn't add any modification here.
In the case of multi-quantile regression, functions seem to produce some output, but I am not sure about the correctness of such output, since there's one value per leaf visible there. Should this perhaps produce an error at the C++ level?
For mutli-output regression, functions like
xgb.dump
will error out from the C side - I would guess that multi-quantile and multi-class might share some structural similarities but am not familiar with tree plotting.Retrieving booster attributes
As part of this PR, I added internal functions to access fields that are part of the C booster, such as the booster type.
I implemented them by producing full dumps of the booster JSON through
XGBoosterSaveJsonConfig
, then parsing them with R'sjsonlite
, and accessing the field in that parsed JSON.This is rather inefficient, but I couldn't find any C-level function to extract only one particular field from the booster JSON. Would be quite useful to add more functions to retrieve and set commonly used attributes like the booster type and the number of threads, or to retrieve only one particular attribute from the JSON if given a path like
field1.subfield2.<etc>
.Missing functions in core library
Both the R and Python interfaces now resort to parsing text dumps in order to extract information such as model coefficients in a linear booster. If such functionalities are going to be required throughout different interfaces, would be ideal to create C-level accessors for them that would for example avoid the loss of precision from the conversion from float to string and back.
Furthermore, some of these parsings broke after adding feature names, and the regexes needed to be updated. I am not sure that I've pushed updated everywhere necessary.
Attribute 'niter'
Handling of this attribute was rather strange before this PR. This attribute was kept both in the booster attributes and in the R attributes, with the caveat that the C one used base-0 indexing, while the R one used base-1. Further, they were not updated in synch everywhere.
After this PR, I preferred to remove this attribute, sticking instead to 'nrounds' as returned by function
XGBoosterBoostedRounds
.There is one particular issue with this function however: after calling
XGBoosterSetParam
on a fitted booster, depending on what the parameters there contain, functionXGBoosterBoostedRounds
will afterwards return zero, but the trees will still be there when looking at e.g.learner.gradient_booster.gbtree_model_param.num_trees
in the JSON config.Linear intercepts
When creating a
gblinear
booster, it will have two intercepts:base_score
.Both of those intercepts are added together in the prediction, which is quite confusing. Would be ideal if this booster type could auto-adjust itself after the fact to contain only one intercept - for example by forcibly setting
base_score
to zero and adding it to the last coefficient.It would also improve things for some objective like
multi:softmax
as it gets an automatically added abase_score
of 0.5 which, logically speaking, does not help with convergence / loss decrease.Legacy R code
Changing things throughout multiple places made me notice that the logic for callbacks is very hard to follow and rather unidiomatic - for example, it requires defining variables inside the function call that do not seem to be used and one cannot tell from a look at that function if they could be removed or not. I am guessing it will also be the hardest thing to review in this PR, and the one that's most prone to errors.
Since these callbacks are part of the public interface (meaning: the user can pass custom callbacks), it'd be ideal to:
bst
variable in the environment that needs to be accessed), towards using function keyword arguments.xgb.Callback
that would take arguments likebefore_training
,after_training
,before_iteration
,after_iteration
, each taking functions expecting a defined function signature.<<-
setters - better to keep a shared Renvironment
-class variable for such purposes.R-specific functionality
I also noticed that there are some functionalities in the R interface that the Python one lacks, such as function
xgb.gblinear.history
. I think this function could also be moved into a C-level functionality to keep that coefficient history as an internal booster property.