-
Notifications
You must be signed in to change notification settings - Fork 0
/
Linear-mixed.Rmd
345 lines (260 loc) · 18 KB
/
Linear-mixed.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "Bayesian subset selection for linear mixed models"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Bayesian subset selection for linear mixed models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
comment = "#>",
fig.width = 8, fig.height = 6
)
```
## Linear Mixed Models
Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates---while accounting for this structured dependence---remains a challenge.
Here, we explore the Bayesian subset selection strategies for LMMs from [Kowal (2022)](https://doi.org/10.1111/biom.13707). A central priority is to account for the *structured dependence* modeled by the LMM, both in the selection criteria and the performance metrics. Thus, our decision analysis uses **Mahalanobis loss** to measure accuracy. The weighting matrix in the Mahalanobis loss is the inverse (marginal) covariance matrix in the LMM. This quantity depends on the variances of the random effects and the errors, and thus requires careful attention for extracting optimal coefficients, providing proper uncertainty quantification, and designing scalable computations.
### Random intercept model
Suppose we observe repeated measurements $y_{ij}$, $j=1,\ldots,m_i$ along with $p$ covariates $x_i$ for each individual $i=1,\ldots,n$. The *random intercept model* is a LMM that seeks to capture the within-subject dependence:
$$
y_{ij} = x_i'\beta + u_i + \epsilon_{ij}
$$
where $u_i \sim N(0,\sigma_u^2)$ and $\epsilon_{ij} \sim N(0, \sigma_\epsilon^2)$ are mutually independent. By design, the random intercept $u_i$ is shared among all replicates for subject $i$ and induces the within-subject correlation
$$
corr(y_{ij}, y_{ij'}) = \sigma_u^2/(\sigma_u^2 + \sigma_\epsilon^2)
$$
conditional on $x_i$ and $\beta$.
The **goal** is to perform subset selection for $x$ while accounting for these intra-subject dependencies. We do so using Mahalanobis loss, which modifies squared error loss to include a weight matrix $\Omega$:
$$
\Vert v \Vert_\Omega^2 = v'\Omega v
$$
For the random intercept model, the (scaled) Mahalanobis weight matrix is
$$
\sigma_\epsilon^2 \Omega = \mbox{bdiag}\{I_{m_i} -
(\sigma_\epsilon^2/\sigma_u^2 + m_i)^{-1} 1_{m_i}1_{m_i}'
\}
$$
where bdiag constructs a block-diagonal matrix across $i=1,\ldots,n$, $I_m$ is the $m\times m$ identity, and $1_m$ is the $m\times 1$ vector of ones. Clearly, this term depends on model parameters $(\sigma_\epsilon, \sigma_u)$ and requires careful consideration for efficient computing.
### Computing for the random intercept model
`BayesSubsets` includes a highly efficient MCMC sampler for the random intercept model via the function `bayeslmm()`. In particular, our algorithm *jointly* samples the fixed and random effects $(\beta, \{u_i\})$ using fast and scalable steps. This joint sampler avoids the inefficiencies that arise from iteratively sampling full conditionals of $\beta$ given $u$ and vice versa---which can be so debilitating that it often requires alternative parametrizations (e.g., centered vs. noncentered) or sampling strategies (e.g., interweaving or Stan). Our joint sampler eliminates these issues entirely, while crucially maintaining computational scalability.
### Subset selection
Like the other methods implemented in `BayesSubsets`, we provide
- Optimal **linear** coefficients for any subset of variables;
- Regularization (i.e., shrinkage) for these coefficients, which is inherited from the LMM and helps guard against variance-inflation;
- Uncertainty quantification for these coefficients, again leveraging the LMM;
- The **acceptable family** of "near-optimal" subsets of linear predictors that match or nearly match the "best" (by minimum cross-validated error) subset; and
- Summaries of the acceptable family, including the 1) **smallest acceptable subset** and 2) a **variable importance** metrics that computes, for each variable $j$, the proportion of acceptable subsets in which $j$ appears.
The **key difference** from the alternative subset selection strategies (e.g., Kowal [2021](https://doi.org/10.1080/01621459.2021.1891926), [2022a](https://jmlr.org/papers/v23/21-0403.html)) is that each of the above terms uses Mahalanobis loss to account for the structured dependence modeled by the LMM.
## Getting started
We begin by installing and loading the package:
```{r setup}
# devtools::install_github("drkowal/BayesSubsets")
library(BayesSubsets)
```
For this example, we will consider simulated data with $n \times p$ correlated covariates $X$ and $m \times n$ response matrix $Y$, where $m$ is the number of replicates per subject.
```{r sim}
# To reproduce:
set.seed(123)
# Simulate some data:
dat = simulate_lm_randint(
n = 200, # number of subjects
p = 10, # number of predictors
m = 5, # number of replicates per subject
p_sig = 5, # number of true signals
SNR = 1 # signal-to-noise ratio
)
# Store the data:
Y = dat$Y; X = dat$X
```
Next, we fit a Bayesian LMM with random intercepts. Anticipating sparsity in the regression coefficients, we specify a horseshoe prior for $\beta$. The following function fits this model using a highly efficient MCMC sampling algorithm:
```{r lm}
# Fit the Bayesian linear mixed model:
fit = bayeslmm(Y = Y,
X = X,
nsave = 1000, # MCMC samples to save
nburn = 1000 # initial samples to discard
)
```
For any subset of covariates $S$, we compute the **optimal linear coefficients** according to Bayesian decision analysis [(Kowal, 2022)](https://doi.org/10.1111/biom.13707). For an example subset $S = \{1,3,10\}$ and Mahalanobis loss, the following code computes our optimal linear summary under the random intercept model:
```{r getcoefs}
# Example subset:
S_ex = c(1, 3, 10)
# Optimal coefficients:
get_coefs_randint(
post_y_pred = fit$post_y_pred,
XX = X[,S_ex],
post_sigma_e = fit$post_sigma_e,
post_sigma_u = fit$post_sigma_u,
post_y_pred_sum = fit$post_y_pred_sum)
# Compare to the posterior mean for these coefficients:
coef(fit)[S_ex]
```
## Uncertainty quantification for the linear coefficients
We may also obtain posterior uncertainty quantification for the linear coefficients that are active (nonzero) in $S$. To do so, we project the posterior predictive distribution onto $X_S$ draw-by-draw, which induces a posterior predictive distribution for the linear coefficients under the LMM. Again, we use the Mahalanobis loss for this projection, and summarize the posterior using 95\% credible intervals.
```{r uq}
# Posterior predictive draws of *all* coefficients:
post_beta_s = proj_posterior_randint(post_y_pred = fit$post_y_pred,
XX = X,
sub_x = S_ex,
post_sigma_e = fit$post_sigma_e,
post_sigma_u = fit$post_sigma_u,
post_y_pred_sum = fit$post_y_pred_sum)
dim(post_beta_s) # the coefficients outside S_ex are fixed at zero
# Compute 95% credible intervals for the nonzero entries:
t(apply(post_beta_s[,S_ex], 2,
quantile, c(0.05/2, 1 - 0.05/2)))
```
## Bayesian subset search
To this point, we have focused on point and interval (linear) summaries for an arbitrary yet fixed subset $S$. However, we are often interested in *searching* across subsets and measuring the predictive performances. Here, we use the LMM output to generate a collection of "candidate subsets" using decision analysis [(Kowal, 2022a)](https://jmlr.org/papers/v23/21-0403.html).
Since we use Mahalanobis loss, we first construct a (vectorized) $mn$-dimensional response vector $y^*$ and $mn \times p$ covariate matrix $X^*$ such that *squared error loss* with these quantities is equivalent to Mahalanobis loss.
```{r pseudo}
# Access the "X" and "Y" matrices needed for the search:
objXY = getXY_randint(XX = X,
post_y_pred = fit$post_y_pred,
post_sigma_e = fit$post_sigma_e,
post_sigma_u = fit$post_sigma_u,
post_y_pred_sum = fit$post_y_pred_sum)
X_star = objXY$X_star; y_star = objXY$y_star; rm(objXY)
```
For small $p$ it may be possible to enumerate all possible subsets. Here, we screen to the "best" `n_best = 50` models of each size according to squared error loss. We store these in a Boolean matrix `indicators`: each row is an individual subset, while the columns indicate which variables are included (`TRUE`) or excluded (`FALSE`).
```{r exhaust}
indicators = branch_and_bound(yy = y_star, # response is the fitted values
XX = X_star, # covariates
n_best = 50 # restrict to the "best" 50 subsets of each size
)
# Inspect:
indicators[1:5, 1:10]
# Dimensions:
dim(indicators)
# Summarize the model sizes:
table(rowSums(indicators)) # note: intercept always included
```
When $p \gg 30$, it is recommended to use the `prescreen` function, which restricts the search to include only `num_to_keep` possible active variables. This makes the branch-and-bound algorithm feasible for very large $p$.
## The acceptable family of "near-optimal" subsets
From this large collection of `r nrow(indicators)` candidate subsets, we seek to filter to the **acceptable family** of subsets, i.e., those "near-optimal" subsets that predict about as well as the "best" subset. These are computed based on 10-fold cross-validation, and use the out-of-sample predictive distribution from the LMM to provide uncertainty quantification for predictive accuracy.
```{r accept}
# Compute the acceptable family:
accept_info = accept_family_randint(post_y_pred = fit$post_y_pred,
post_lpd = fit$post_lpd,
post_sigma_e = fit$post_sigma_e,
post_sigma_u = fit$post_sigma_u,
XX = X, YY = Y,
indicators = indicators,
post_y_pred_sum = fit$post_y_pred_sum)
# How many subsets are in the acceptable family?
length(accept_info$all_accept)
# These are the rows of `indicators` that belong to the acceptable family:
head(accept_info$all_accept)
# An example acceptable subset:
ex_accept = accept_info$all_accept[1]
which(indicators[ex_accept,])
```
The plot shows how the out-of-sample predictive performance varies across subsets of different sizes, specifically relative (\% change) to the "best" subset (by minimum cross-validated error; dashed gray vertical line). The x-marks are the (usual) empirical cross-validated error, while the intervals leverage the predictive distribution from the LMM to quantify uncertainty in the out-of-sample predictive performance. While performance improves as variables are added, it is clear that several smaller subsets are highly competitive---especially when accounting for the predictive uncertainty.
## Subset selection: the smallest acceptable subset
If we wish to **select** a single subset, a compelling representative of the acceptable family is the **smallest** acceptable subset. This choice favors parsimony, while its membership in the acceptable family implies that it meets a high standard for predictive accuracy. From the previous plot, we select the smallest subset for which the intervals include zero (solid gray vertical line).
```{r simp}
# Simplest acceptable subset:
beta_hat_small = accept_info$beta_hat_small
# Which coefficients are nonzero:
S_small = which(beta_hat_small != 0)
# How many coefficients are nonzero:
length(S_small)
```
The "best" subset by minimum cross-validation often includes many extraneous variables, which is a well-known (and undesirable) byproduct of cross-validation.
```{r min}
# Acceptable subset that minimizes CV error:
beta_hat_min = accept_info$beta_hat_min
# Typically much larger (and often too large...)
sum(beta_hat_min != 0)
```
For reference, the true model size is `r sum(dat$beta_true != 0)`.
Returning to the *smallest* acceptable subset, we can obtain posterior samples and credible intervals for the coefficients as before:
```{r ci-small}
# Draws from the posterior predictive distribution
# Posterior predictive draws of *all* coefficients:
post_beta_small = proj_posterior_randint(post_y_pred = fit$post_y_pred,
XX = X,
sub_x = S_small,
post_sigma_e = fit$post_sigma_e,
post_sigma_u = fit$post_sigma_u,
post_y_pred_sum = fit$post_y_pred_sum)
# Compute 95% credible intervals for the nonzero entries:
t(apply(post_beta_small[,S_small], 2,
quantile, c(0.05/2, 1 - 0.05/2)))
```
## Variable importance from acceptable subsets
Another useful summary of the acceptable family is the **variable importance**, which reports, for each variable $j$, the proportion of acceptable subsets in which $j$ appears. We are particularly interested in distinguishing among those variables that occur in *all*, *some*, or *no* acceptable subsets, which provides insight about which variables are indispensable ("keystone covariates") and which variables are part of a "predictively plausible" explanation.
```{r vi}
# Variable importance: proportion of *acceptable subsets* in which each variable appears
vi_e = var_imp(indicators = indicators,
all_accept = accept_info$all_accept)$vi_inc
# "Keystone covariates" that appear in *all* acceptable families:
which(vi_e == 1)
# Irrelevant covariates that appear in *no* acceptable families:
which(vi_e == 0)
# Visualize:
barplot(vi_e[order(vi_e, (ncol(X):1))], # order...
horiz = TRUE,
main = paste('Variable importance for the acceptable family'))
abline(v = 1)
```
Note that the covariates are highly correlated in this simulated example (and $p$ is moderate), so it is reasonable to expect that many covariates are roughly interchangeable in terms of predictive accuracy.
## Comparing with traditional posterior summaries
Typically, Bayesian linear regression would report the posterior expectations and 95\% posterior credible intervals of the regression coefficients $\beta$. We plot these together with the point and interval estimates for the *smallest* acceptable model:
```{r plot, echo = FALSE}
# Which variables to include in the plot?
use_vars = 2:ncol(X) # Exclude intercept
nv = length(use_vars)
# Posterior mean of coefficients:
beta_hat = coef(fit)
# Compute 95% credible intervals for posterior and smallest acceptable subset:
ci_beta = t(apply(fit$post_beta, 2,
quantile, c(0.05/2, 1 - 0.05/2)))
ci_beta_small = t(apply(post_beta_small, 2,
quantile, c(0.05/2, 1 - 0.05/2)))
# Order by decreasing coefficients;
ord = order(beta_hat[use_vars], decreasing = TRUE)
# Initialize the plot:
par(mfrow=c(1,1), mai = c(1.15,.75,.75,.5));
jit = 0.25 # jitter
plot(seq(1 - jit, nv + jit, length = nv),
seq(1 - jit, nv + jit, length = nv),
ylim = range(ci_beta[use_vars,], ci_beta_small[use_vars,]),
type='n', ylab='', xaxt='n', xlab = '',
main = 'Estimates and 95% intervals for the linear coefficients')
abline(v = 1:nv, lwd=2, col = 'black')
legend('topright', c('truth', 'post', 'S_small'),
bg = 'white',
pch = c(4,0,1), col = c('black', 'red', 'blue'))
# Full model:
arrows(1:nv + jit, ci_beta[use_vars[ord],1],
1:nv + jit, ci_beta[use_vars[ord],2],
length=0.05, angle=90, code=3, lwd=4, col='red')
lines(1:nv + jit, beta_hat[use_vars[ord]], type='p', pch=0, lwd = 3, cex = 1.5, col='red')
# S_small
arrows(1:nv - jit, ci_beta_small[use_vars[ord],1],
1:nv - jit, ci_beta_small[use_vars[ord],2],
length=0.05, angle=90, code=3, lwd=4, col='blue')
lines(1:nv - jit, beta_hat_small[use_vars[ord]], type='p', pch=1, lwd = 3, cex = 1.5, col='blue')
# Add the truth:
lines(1:nv, dat$beta_true[use_vars[ord]], type='p', lwd=3, cex=1.5, pch = 4)
# Reference line:
abline(h = 0, lwd=1, col='black', lty=2)
axis(1, at = 1:nv, labels=FALSE)
text(1:nv, par("usr")[3], pos = 1, # adj = 1?
labels = colnames(X)[use_vars[ord]],
srt = 45, xpd = TRUE)
```
The traditional model summaries are completely dense: the point estimates $\hat \beta$ are nonzero for *all* covariates. By comparison, the point estimates from the smallest acceptable subset are sparse, with only `r sum(beta_hat_small != 0)` active coefficients. By design, the smallest acceptable subset only reports interval estimates for these active coefficients. In this example, the intervals are narrower for the smallest acceptable subset, while the traditional intervals produce large intervals even for the truly zero coefficients.
## Conclusion
We have sought to demonstrate that `BayesSubsets` is a useful accompaniment to Bayesian LMM regression workflow. Given a Bayesian LMM---and specifically, a random intercept model for longitudinal data regression---we have shown how to compute
1. Optimal linear summaries for any subset of covariates;
2. Accompanying uncertainty quantification via posterior (predictive) distributions and intervals;
3. Bayesian subset search using decision analysis;
4. The acceptable family of near-optimal subsets; and
5. Key summaries of the acceptable family, including the smallest acceptable subset and a variable importance metric.
Each of these quantities incorporates the structured dependence (here, due to repeated measurements) via Mahalanobis loss.
When a single subset is required and parsimony is valued, then we recommend the **smallest acceptable subset**. However, we caution against the overreliance on any single subset without compelling motivation. A key contribution of the acceptable family is that it identifies *many* competing explanations (subsets) that are nearly indistinguishable in predictive accuracy. From a purely predictive perspective, we cannot completely rule out any member of the acceptable family. Thus, we further recommend reporting the **variable importance** as a default, variable-specific summary of the acceptable family.