/
Using-mixgb.Rmd
159 lines (121 loc) · 5.46 KB
/
Using-mixgb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "mixgb: Multiple Imputation Through XGBoost"
author: "Yongshi Deng"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{mixgb: Multiple Imputation Through XGBoost}
%\VignetteEngine{knitr::rmarkdown}
---
## Introduction
Mixgb offers a scalable solution for imputing large datasets using
XGBoost, subsampling and predictive mean matching. Our method utilizes
the capabilities of XGBoost, a highly efficient implementation of gradient
boosted trees, to capture interactions and non-linear relations automatically.
Moreover, we have integrated subsampling and predictive mean matching to
minimize bias and reflect appropriate imputation variability. Our package
supports various types of variables and offers flexible settings for subsampling
and predictive mean matching. We also include diagnostic tools for evaluating
the quality of the imputed values.
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Impute missing values with `mixgb`
We first load the `mixgb` package and the `nhanes3_newborn` dataset, which contains 16 variables of various types
(integer/numeric/factor/ordinal factor). There are 9 variables with missing values.
```{r}
library(mixgb)
str(nhanes3_newborn)
colSums(is.na(nhanes3_newborn))
```
To impute this dataset, we can use the default settings. The default
number of imputed datasets is `m = 5`. Note that we do not need to convert
our data into dgCMatrix or one-hot coding format. Our package will automatically
convert it for you. Variables should be of the following types:
numeric, integer, factor or ordinal factor.
```{r, eval = FALSE}
# use mixgb with default settings
imputed.data <- mixgb(data = nhanes3_newborn, m = 5)
```
### Customize imputation settings
We can also customize imputation settings:
- The number of imputed datasets
`m`
- The number of imputation iterations
`maxit`
- XGBoost hyperparameters and verbose settings.
`xgb.params`, `nrounds`, `early_stopping_rounds`, `print_every_n` and `verbose`.
- Subsampling ratio. By default, `subsample = 0.7`. Users can change this value under the `xgb.params` argument.
- Predictive mean matching settings
`pmm.type`, `pmm.k` and `pmm.link`.
- Whether ordinal factors should be converted to integer (imputation process may be faster)
`ordinalAsInteger`
- Whether or not to use bootstrapping
`bootstrap`
- Initial imputation methods for different types of variables
`initial.num`, `initial.int` and `initial.fac`.
- Whether to save models for imputing newdata
`save.models` and `save.vars`.
```{r, eval = FALSE}
# Use mixgb with chosen settings
params <- list(
max_depth = 5,
subsample = 0.9,
nthread = 2,
tree_method = "hist"
)
imputed.data <- mixgb(
data = nhanes3_newborn, m = 10, maxit = 2,
ordinalAsInteger = FALSE, bootstrap = FALSE,
pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
initial.num = "normal", initial.int = "mode", initial.fac = "mode",
save.models = FALSE, save.vars = NULL,
xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)
```
### Tune hyperparameters
Imputation performance can be affected by the hyperparameter settings. Although
tuning a large set of hyperparameters may appear intimidating, it is often
possible to narrowing down the search space because many hyperparameters are
correlated. In our package, the function `mixgb_cv()` can be used to tune
the number of boosting rounds - `nrounds`. There is no default `nrounds` value
in `XGBoost,` so users are required to specify this value themselves. The
default `nrounds` in `mixgb()` is 100. However, we recommend using
`mixgb_cv()` to find the optimal `nrounds` first.
```{r}
params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = nhanes3_newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
cv.results$response
cv.results$best.nrounds
```
By default, `mixgb_cv()` will randomly choose an incomplete variable as
the response and build an XGBoost model with other variables as explanatory
variables using the complete cases of the dataset. Therefore, each run of `mixgb_cv()`
will likely return different results. Users can also specify the response
and covariates in the argument `response` and `select_features` respectively.
```{r}
cv.results <- mixgb_cv(
data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM", "BMPSB1", "BMPSB2", "BMPTR1", "BMPTR2", "BMPWT"), xgb.params = params, verbose = FALSE
)
cv.results$best.nrounds
```
Let us just try setting `nrounds = cv.results$best.nrounds` in `mixgb()` to obtain 5 imputed datasets.
```{r, eval = FALSE}
imputed.data <- mixgb(data = nhanes3_newborn, m = 5, nrounds = cv.results$best.nrounds)
```
## Inspect multiply imputed values
The `mixgb` package provides the following visual diagnostics functions:
(i) Single variable: `plot_hist()`, `plot_box()`, `plot_bar()` ;
(ii) Two variables: `plot_2num()`, `plot_2fac()`, `plot_1num1fac()` ;
(iii) Three variables: `plot_2num1fac()`, `plot_1num2fac()`.
Each function will return `m+1` panels to compare the observed data with
`m` sets of actual imputed values.
For more details, please check the vignette on GitHub [Visual diagnostics
for multiply imputed
values](https://agnesdeng.github.io/mixgb/articles/web/Visual-diagnostics.html).