diff --git a/R-package/DESCRIPTION b/R-package/DESCRIPTION index 468098a083e6..b5d7585a3ca8 100644 --- a/R-package/DESCRIPTION +++ b/R-package/DESCRIPTION @@ -64,5 +64,5 @@ Imports: data.table (>= 1.9.6), magrittr (>= 1.5), stringi (>= 0.5.2) -RoxygenNote: 7.1.0 +RoxygenNote: 7.1.1 SystemRequirements: GNU make, C++14 diff --git a/R-package/R/utils.R b/R-package/R/utils.R index 48e84be41fc6..b0c653f17671 100644 --- a/R-package/R/utils.R +++ b/R-package/R/utils.R @@ -308,18 +308,64 @@ xgb.createFolds <- function(y, k = 10) #' @name xgboost-deprecated NULL -#' Do not use saveRDS() for long-term archival of models. Use xgb.save() instead. +#' Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of +#' models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}. #' -#' It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to -#' the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it -#' is not advisable to use it if the model is to be accessed in the future. If you train a model -#' with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not -#' guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be -#' accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and -#' explanation, consult the page +#' It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or +#' \code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist +#' \code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if +#' the model is to be accessed in the future. If you train a model with the current version of +#' XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be +#' accessible in later releases of XGBoost. To ensure that your model can be accessed in future +#' releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead. +#' +#' @details +#' Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into +#' the JSON format by specifying the JSON extension. To read the model back, use +#' \code{\link{xgb.load}}. +#' +#' Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes +#' in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and +#' re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}. +#' The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model +#' as part of another R object. +#' +#' Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the +#' model but also internal configurations and parameters, and its format is not stable across +#' multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing. +#' +#' For more details and explanation about model persistence and archival, consult the page #' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}. #' -#' @name a-compatibility-note-for-saveRDS +#' @examples +#' data(agaricus.train, package='xgboost') +#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, +#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") +#' +#' # Save as a stand-alone file; load it with xgb.load() +#' xgb.save(bst, 'xgb.model') +#' bst2 <- xgb.load('xgb.model') +#' +#' # Save as a stand-alone file (JSON); load it with xgb.load() +#' xgb.save(bst, 'xgb.model.json') +#' bst2 <- xgb.load('xgb.model.json') +#' +#' # Save as a raw byte vector; load it with xgb.load.raw() +#' xgb_bytes <- xgb.save.raw(bst) +#' bst2 <- xgb.load.raw(xgb_bytes) +#' +#' # Persist XGBoost model as part of another R object +#' obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model") +#' # Persist the R object. Here, saveRDS() is okay, since it doesn't persist +#' # xgb.Booster directly. What's being persisted is the future-proof byte representation +#' # as given by xgb.save.raw(). +#' saveRDS(obj, 'my_object.rds') +#' # Read back the R object +#' obj2 <- readRDS('my_object.rds') +#' # Re-construct xgb.Booster object from the bytes +#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes) +#' +#' @name a-compatibility-note-for-saveRDS-save NULL # Lookup table for the deprecated parameters bookkeeping diff --git a/R-package/R/xgb.Booster.R b/R-package/R/xgb.Booster.R index a2bde19cfb35..acc040c4b81c 100644 --- a/R-package/R/xgb.Booster.R +++ b/R-package/R/xgb.Booster.R @@ -111,6 +111,8 @@ xgb.get.handle <- function(object) { #' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") #' saveRDS(bst, "xgb.model.rds") #' +#' # Warning: The resulting RDS file is only compatible with the current XGBoost version. +#' # Refer to the section titled "a-compatibility-note-for-saveRDS-save". #' bst1 <- readRDS("xgb.model.rds") #' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds") #' # the handle is invalid: diff --git a/R-package/R/xgb.save.R b/R-package/R/xgb.save.R index 4a5d462ecd7f..1e68cd0adae8 100644 --- a/R-package/R/xgb.save.R +++ b/R-package/R/xgb.save.R @@ -13,7 +13,11 @@ #' #' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} #' or \code{\link[base]{save}}). However, it would then only be compatible with R, and -#' corresponding R-methods would need to be used to load it. +#' corresponding R-methods would need to be used to load it. Moreover, persisting the model with +#' \code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in +#' future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn +#' how to persist models in a future-proof way, i.e. to make the model accessible in future +#' releases of XGBoost. #' #' @seealso #' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}. diff --git a/R-package/man/a-compatibility-note-for-saveRDS-save.Rd b/R-package/man/a-compatibility-note-for-saveRDS-save.Rd new file mode 100644 index 000000000000..63b8dfce52ac --- /dev/null +++ b/R-package/man/a-compatibility-note-for-saveRDS-save.Rd @@ -0,0 +1,62 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/utils.R +\name{a-compatibility-note-for-saveRDS-save} +\alias{a-compatibility-note-for-saveRDS-save} +\title{Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of +models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.} +\description{ +It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or +\code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist +\code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if +the model is to be accessed in the future. If you train a model with the current version of +XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be +accessible in later releases of XGBoost. To ensure that your model can be accessed in future +releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead. +} +\details{ +Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into +the JSON format by specifying the JSON extension. To read the model back, use +\code{\link{xgb.load}}. + +Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes +in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and +re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}. +The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model +as part of another R object. + +Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the +model but also internal configurations and parameters, and its format is not stable across +multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing. + +For more details and explanation about model persistence and archival, consult the page +\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}. +} +\examples{ +data(agaricus.train, package='xgboost') +bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, + eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") + +# Save as a stand-alone file; load it with xgb.load() +xgb.save(bst, 'xgb.model') +bst2 <- xgb.load('xgb.model') + +# Save as a stand-alone file (JSON); load it with xgb.load() +xgb.save(bst, 'xgb.model.json') +bst2 <- xgb.load('xgb.model.json') + +# Save as a raw byte vector; load it with xgb.load.raw() +xgb_bytes <- xgb.save.raw(bst) +bst2 <- xgb.load.raw(xgb_bytes) + +# Persist XGBoost model as part of another R object +obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model") +# Persist the R object. Here, saveRDS() is okay, since it doesn't persist +# xgb.Booster directly. What's being persisted is the future-proof byte representation +# as given by xgb.save.raw(). +saveRDS(obj, 'my_object.rds') +# Read back the R object +obj2 <- readRDS('my_object.rds') +# Re-construct xgb.Booster object from the bytes +bst2 <- xgb.load.raw(obj2$xgb_model_bytes) + +} diff --git a/R-package/man/a-compatibility-note-for-saveRDS.Rd b/R-package/man/a-compatibility-note-for-saveRDS.Rd deleted file mode 100644 index 2c5c4a1b4b46..000000000000 --- a/R-package/man/a-compatibility-note-for-saveRDS.Rd +++ /dev/null @@ -1,15 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/utils.R -\name{a-compatibility-note-for-saveRDS} -\alias{a-compatibility-note-for-saveRDS} -\title{Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.} -\description{ -It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to -the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it -is not advisable to use it if the model is to be accessed in the future. If you train a model -with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not -guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be -accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and -explanation, consult the page -\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}. -} diff --git a/R-package/man/xgb.Booster.complete.Rd b/R-package/man/xgb.Booster.complete.Rd index 2b38b4c04c38..214694565c29 100644 --- a/R-package/man/xgb.Booster.complete.Rd +++ b/R-package/man/xgb.Booster.complete.Rd @@ -38,6 +38,8 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") saveRDS(bst, "xgb.model.rds") +# Warning: The resulting RDS file is only compatible with the current XGBoost version. +# Refer to the section titled "a-compatibility-note-for-saveRDS-save". bst1 <- readRDS("xgb.model.rds") if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds") # the handle is invalid: diff --git a/R-package/man/xgb.create.features.Rd b/R-package/man/xgb.create.features.Rd index 9c59d90b1f58..fecd24ad82f3 100644 --- a/R-package/man/xgb.create.features.Rd +++ b/R-package/man/xgb.create.features.Rd @@ -24,9 +24,9 @@ This is the function inspired from the paragraph 3.1 of the paper: \strong{Practical Lessons from Predicting Clicks on Ads at Facebook} -\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, +\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quinonero Candela)} - + International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014 \url{https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}. @@ -37,10 +37,10 @@ Extract explaining the method: convenient way to implement non-linear and tuple transformations of the kind we just described. We treat each individual tree as a categorical feature that takes as value the -index of the leaf an instance ends up falling in. We use -1-of-K coding of this type of features. +index of the leaf an instance ends up falling in. We use +1-of-K coding of this type of features. -For example, consider the boosted tree model in Figure 1 with 2 subtrees, +For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will diff --git a/R-package/man/xgb.cv.Rd b/R-package/man/xgb.cv.Rd index 8532305b11e9..98e70e48cade 100644 --- a/R-package/man/xgb.cv.Rd +++ b/R-package/man/xgb.cv.Rd @@ -28,7 +28,7 @@ xgb.cv( ) } \arguments{ -\item{params}{the list of parameters. The complete list of parameters is +\item{params}{the list of parameters. The complete list of parameters is available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below is a shorter summary: \itemize{ diff --git a/R-package/man/xgb.dump.Rd b/R-package/man/xgb.dump.Rd index 210c6e2a967b..f1eeff2fbc30 100644 --- a/R-package/man/xgb.dump.Rd +++ b/R-package/man/xgb.dump.Rd @@ -16,14 +16,14 @@ xgb.dump( \arguments{ \item{model}{the model object.} -\item{fname}{the name of the text file where to save the model text dump. +\item{fname}{the name of the text file where to save the model text dump. If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.} \item{fmap}{feature map file representing feature types. -Detailed description could be found at +Detailed description could be found at \url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}. See demo/ for walkthrough example in R, and -\url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt} +\url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt} for example Format.} \item{with_stats}{whether to dump some additional statistics about the splits. @@ -47,7 +47,7 @@ data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train <- agaricus.train test <- agaricus.test -bst <- xgboost(data = train$data, label = train$label, max_depth = 2, +bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") # save the model in file 'xgb.model.dump' dump_path = file.path(tempdir(), 'model.dump') diff --git a/R-package/man/xgb.importance.Rd b/R-package/man/xgb.importance.Rd index 84a18e1f2a2e..d9367b2116c6 100644 --- a/R-package/man/xgb.importance.Rd +++ b/R-package/man/xgb.importance.Rd @@ -22,7 +22,7 @@ Non-null \code{feature_names} could be provided to override those in the model.} \item{trees}{(only for the gbtree booster) an integer vector of tree indices that should be included into the importance calculation. If set to \code{NULL}, all trees of the model are parsed. -It could be useful, e.g., in multiclass classification to get feature importances +It could be useful, e.g., in multiclass classification to get feature importances for each class separately. IMPORTANT: the tree index in xgboost models is zero-based (e.g., use \code{trees = 0:4} for first 5 trees).} @@ -37,7 +37,7 @@ For a tree model, a \code{data.table} with the following columns: \itemize{ \item \code{Features} names of the features used in the model; \item \code{Gain} represents fractional contribution of each feature to the model based on - the total gain of this feature's splits. Higher percentage means a more important + the total gain of this feature's splits. Higher percentage means a more important predictive feature. \item \code{Cover} metric of the number of observation related to this feature; \item \code{Frequency} percentage representing the relative number of times @@ -51,7 +51,7 @@ A linear model's importance \code{data.table} has the following columns: \item \code{Class} (only for multiclass models) class label. } -If \code{feature_names} is not provided and \code{model} doesn't have \code{feature_names}, +If \code{feature_names} is not provided and \code{model} doesn't have \code{feature_names}, index of the features will be used instead. Because the index is extracted from the model dump (based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R). } @@ -61,21 +61,21 @@ Creates a \code{data.table} of feature importances in a model. \details{ This function works for both linear and tree models. -For linear models, the importance is the absolute magnitude of linear coefficients. -For that reason, in order to obtain a meaningful ranking by importance for a linear model, -the features need to be on the same scale (which you also would want to do when using either +For linear models, the importance is the absolute magnitude of linear coefficients. +For that reason, in order to obtain a meaningful ranking by importance for a linear model, +the features need to be on the same scale (which you also would want to do when using either L1 or L2 regularization). } \examples{ # binomial classification using gbtree: data(agaricus.train, package='xgboost') -bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, +bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") xgb.importance(model = bst) # binomial classification using gblinear: -bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear", +bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear", eta = 0.3, nthread = 1, nrounds = 20, objective = "binary:logistic") xgb.importance(model = bst) diff --git a/R-package/man/xgb.model.dt.tree.Rd b/R-package/man/xgb.model.dt.tree.Rd index cf1750117968..b89d298b6298 100644 --- a/R-package/man/xgb.model.dt.tree.Rd +++ b/R-package/man/xgb.model.dt.tree.Rd @@ -20,7 +20,7 @@ Non-null \code{feature_names} could be provided to override those in the model.} \item{model}{object of class \code{xgb.Booster}} -\item{text}{\code{character} vector previously generated by the \code{xgb.dump} +\item{text}{\code{character} vector previously generated by the \code{xgb.dump} function (where parameter \code{with_stats = TRUE} should have been set). \code{text} takes precedence over \code{model}.} @@ -53,10 +53,10 @@ The columns of the \code{data.table} are: \item \code{Quality}: either the split gain (change in loss) or the leaf value \item \code{Cover}: metric related to the number of observation either seen by a split or collected by a leaf during training. -} +} When \code{use_int_id=FALSE}, columns "Yes", "No", and "Missing" point to model-wide node identifiers -in the "ID" column. When \code{use_int_id=TRUE}, those columns point to node identifiers from +in the "ID" column. When \code{use_int_id=TRUE}, those columns point to node identifiers from the corresponding trees in the "Node" column. } \description{ @@ -67,17 +67,17 @@ Parse a boosted tree model text dump into a \code{data.table} structure. data(agaricus.train, package='xgboost') -bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, +bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic") (dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst)) -# This bst model already has feature_names stored with it, so those would be used when +# This bst model already has feature_names stored with it, so those would be used when # feature_names is not set: (dt <- xgb.model.dt.tree(model = bst)) # How to match feature names of splits that are following a current 'Yes' branch: merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)] - + } diff --git a/R-package/man/xgb.plot.deepness.Rd b/R-package/man/xgb.plot.deepness.Rd index b642398701e2..39e291a811cc 100644 --- a/R-package/man/xgb.plot.deepness.Rd +++ b/R-package/man/xgb.plot.deepness.Rd @@ -23,7 +23,7 @@ or a data.table result of the \code{xgb.model.dt.tree} function.} \item{which}{which distribution to plot (see details).} -\item{plot}{(base R barplot) whether a barplot should be produced. +\item{plot}{(base R barplot) whether a barplot should be produced. If FALSE, only a data.table is returned.} \item{...}{other parameters passed to \code{barplot} or \code{plot}.} @@ -45,10 +45,10 @@ When \code{which="2x1"}, two distributions with respect to the leaf depth are plotted on top of each other: \itemize{ \item the distribution of the number of leafs in a tree model at a certain depth; - \item the distribution of average weighted number of observations ("cover") + \item the distribution of average weighted number of observations ("cover") ending up in leafs at certain depth. } -Those could be helpful in determining sensible ranges of the \code{max_depth} +Those could be helpful in determining sensible ranges of the \code{max_depth} and \code{min_child_weight} parameters. When \code{which="max.depth"} or \code{which="med.depth"}, plots of either maximum or median depth diff --git a/R-package/man/xgb.plot.tree.Rd b/R-package/man/xgb.plot.tree.Rd index 3f9f99a18baf..8fd7196afdba 100644 --- a/R-package/man/xgb.plot.tree.Rd +++ b/R-package/man/xgb.plot.tree.Rd @@ -60,7 +60,7 @@ The content of each node is organised that way: \item \code{Gain} (for split nodes): the information gain metric of a split (corresponds to the importance of the node in the model). \item \code{Value} (for leafs): the margin value that the leaf may contribute to prediction. -} +} The tree root nodes also indicate the Tree index (0-based). The "Yes" branches are marked by the "< split_value" label. @@ -80,7 +80,7 @@ xgb.plot.tree(model = bst) xgb.plot.tree(model = bst, trees = 0, show_node_id = TRUE) \dontrun{ -# Below is an example of how to save this plot to a file. +# Below is an example of how to save this plot to a file. # Note that for `export_graph` to work, the DiagrammeRsvg and rsvg packages must also be installed. library(DiagrammeR) gr <- xgb.plot.tree(model=bst, trees=0:1, render=FALSE) diff --git a/R-package/man/xgb.save.Rd b/R-package/man/xgb.save.Rd index 7d1842d89e97..235fc504c9ed 100644 --- a/R-package/man/xgb.save.Rd +++ b/R-package/man/xgb.save.Rd @@ -15,21 +15,25 @@ xgb.save(model, fname) Save xgboost model to a file in binary format. } \details{ -This methods allows to save a model in an xgboost-internal binary format which is universal +This methods allows to save a model in an xgboost-internal binary format which is universal among the various xgboost interfaces. In R, the saved model file could be read-in later -using either the \code{\link{xgb.load}} function or the \code{xgb_model} parameter +using either the \code{\link{xgb.load}} function or the \code{xgb_model} parameter of \code{\link{xgb.train}}. -Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} -or \code{\link[base]{save}}). However, it would then only be compatible with R, and -corresponding R-methods would need to be used to load it. +Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} +or \code{\link[base]{save}}). However, it would then only be compatible with R, and +corresponding R-methods would need to be used to load it. Moreover, persisting the model with +\code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in +future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn +how to persist models in a future-proof way, i.e. to make the model accessible in future +releases of XGBoost. } \examples{ data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train <- agaricus.train test <- agaricus.test -bst <- xgboost(data = train$data, label = train$label, max_depth = 2, +bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic") xgb.save(bst, 'xgb.model') bst <- xgb.load('xgb.model') diff --git a/R-package/man/xgb.train.Rd b/R-package/man/xgb.train.Rd index e592b7a036ab..94db595cbc65 100644 --- a/R-package/man/xgb.train.Rd +++ b/R-package/man/xgb.train.Rd @@ -42,7 +42,7 @@ xgboost( ) } \arguments{ -\item{params}{the list of parameters. The complete list of parameters is +\item{params}{the list of parameters. The complete list of parameters is available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below is a shorter summary: diff --git a/doc/tutorials/saving_model.rst b/doc/tutorials/saving_model.rst index 44a85cb7cc30..544ef4c66a01 100644 --- a/doc/tutorials/saving_model.rst +++ b/doc/tutorials/saving_model.rst @@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model: ``booster.save_model('model.json')``. More details below. Before we get started, XGBoost is a gradient boosting library with focus on tree model, -which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and -algorithms used to build it. If you come from Deep Learning community, then it should be +which means inside XGBoost, there are 2 distinct parts: + +1. The model consisting of trees and +2. Hyperparameters and configurations used for building the model. + +If you come from Deep Learning community, then it should be clear to you that there are differences between the neural network structures composed of -weights with fixed tensor operations, and the optimizers (like RMSprop) used to train -them. +weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them. -So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters -like number of input columns in trained trees, and the objective function, which combined +So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model +parameters like number of input columns in trained trees, and the objective function, which combined to represent the concept of "model" in XGBoost. As for why are we saving the objective as part of model, that's because objective controls transformation of global bias (called ``base_score`` in XGBoost). Users can share this model with others for prediction, evaluation or continue the training with a different set of hyper-parameters etc. + However, this is not the end of story. There are cases where we need to save something more than just the model itself. For example, in distrbuted training, XGBoost performs checkpointing operation. Or for some reasons, your favorite distributed computing framework decide to copy the model from one worker to another and continue the training in there. In such cases, the serialisation output is required to contain enougth information to continue previous training without user providing any parameters again. We consider -such scenario as memory snapshot (or memory based serialisation method) and distinguish it -with normal model IO operation. In Python, this can be invoked by pickling the -``Booster`` object. Other language bindings are still working in progress. +such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it +with normal model IO operation. Currently, memory snapshot is used in the following places: + +* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module. +* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS`` + or ``save``. + +Other language bindings are still working in progress. .. note:: @@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective) a filename with ``.json`` as file extension: .. code-block:: python + :caption: Python bst.save_model('model_file_name.json') -While for enabling JSON as memory based serialisation format, pass -``enable_experimental_json_serialization`` as a training parameter. In Python this can be -done by: +.. code-block:: r + :caption: R + + xgb.save(bst, 'model_file_name.json') + +To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training +parameter. In Python this can be done by: .. code-block:: python @@ -63,13 +77,33 @@ done by: Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence parameter ``enable_experimental_json_serialization`` is required to enable JSON format. -As the name suggested, memory based serialisation captures many stuffs internal to -XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable -output format. That being said, loading pickled booster (memory snapshot) in a different -XGBoost version may lead to errors or undefined behaviors. But we promise the stable -output format of binary model and JSON model (once it's no-longer experimental) as they -are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled -bytecode can be used in different Python version. + +Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training +parameter: + +.. code-block:: r + + params <- list(enable_experimental_json_serialization = TRUE, ...) + bst <- xgboost.train(params, dtrain, nrounds = 10) + saveRDS(bst, 'filename.rds') + +*************************************************************** +A note on backward compatibility of models and memory snapshots +*************************************************************** + +**We guarantee backward compatibility for models but not for memory snapshots.** + +Models (trees and objective) use a stable representation, so that models produced in earlier +versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive +your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R). + +On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its +format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for +checkpointing only, where you persist the complete snapshot of the training configurations so that +you can recover robustly from possible failures and resume the training process. Loading memory +snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors. +**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may +not be accessible in later versions of XGBoost.** *************************** Custom objective and metric @@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See comments in the script for more details. +A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are +able to install an older version of XGBoost using the ``remotes`` package: + +.. code-block:: r + + library(remotes) + remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1 + +Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the +``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation. +Now you should be able to use the model in the latest version of XGBoost. + ******************************************************** Saving and Loading the internal parameters configuration ********************************************************