-
Notifications
You must be signed in to change notification settings - Fork 4
/
2018-04-23-simulate-simulate-part-2-a-linear-mixed-model.Rmd
456 lines (302 loc) · 21.5 KB
/
2018-04-23-simulate-simulate-part-2-a-linear-mixed-model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
---
title: 'Simulate! Simulate! - Part 2: A linear mixed model'
author: Ariel Muldoon
date: '2018-04-23'
slug: simulate-simulate-part-2
categories:
- r
- statistics
tags:
- simulation
- lmm
- lme4
draft: FALSE
description: "In my second simulation example I show how to simulate data from a basic two-level hierarchical design. I go on to explore how well the random effects variance component is estimated for different sample sizes."
---
```{r setup, echo = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851",
filename = 'render_toc.R')
# Make data
set.seed(16)
nstand = 5
nplot = 4
mu = 10
sds = 2
sd = 1
stand = rep(LETTERS[1:nstand], each = nplot)
plot = letters[1:(nstand*nplot)]
standeff = rnorm(nstand, 0, sds)
standeff = rep(standeff, each = nplot)
ploteff = rnorm(nstand*nplot, 0, sd)
dat = data.frame(stand, standeff, plot, ploteff)
dat$resp = with(dat, mu + standeff + ploteff )
```
*This post was last updated on `r format(Sys.Date(), "%Y-%m-%d")`.*
I feel like I learn something every time start simulating new data to update an assignment or exploring a question from a client via simulation. I've seen instances where residual autocorrelation isn't detectable when I *know* it exists (because I simulated it) or I have skewed residuals and/or unequal variances when I simulated residuals from a normal distribution with a single variance. Such results are often due to small sample sizes, which even in this era of big data still isn't so unusual in ecology. I've found exploring the effect of sample size on statistical results to be quite eye-opening `r emo::ji("eye")`.
In [my first simulation post](https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/) I showed how to simulate data for a basic linear model. Today I'll be talking about how to simulate data for a linear mixed model. So I'm still working with normally distributed residuals but will add an additional level of variation.
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
render_toc("2018-04-23-simulate-simulate-part-2-a-linear-mixed-model.Rmd")
```
# Simulate, simulate, dance to the music
I learned the basics of linear mixed models in a class where we learned how to analyze data from "classically designed" experiments. We spent a lot of time writing down the various statistical models in mathematical notation and then fitting said models in SAS. I felt like I understood the basics of mixed models when the class was over (and swore I was done with all the $i$ and $j$ mathematical notation once and for all `r emo::ji("laughing")`).
It wasn't until I started working with clients and teaching labs on mixed models in R that I learned how to do simulations to understand how well such models worked under various scenarios. These simulations took me to a whole new level of understanding of these models and the meaning of all that pesky mathematical notation.
# The statistical model
**Danger, equations below!**
You can opt to [jump to the next section](#a-single-simulation-for-the-two-level-model) if you don't find equations useful.
If you don't have a good understanding of the statistical model (or even if you do), writing it out in mathematical notation can actually be pretty useful. I'll write a relatively simple model with random effects below. (Note that I only have an overall mean in this model; I'll do a short example at the end of the post to show how one can simulate additional fixed effects.)
The study design that is the basis of my model has two different sizes of study units. I'm using a classic forestry example, where stands of trees are somehow chosen for sampling and then multiple plots within each stand are measured. This is a design with two levels, stands and plots; we could add a third level if individual trees were measured in each plot.
Everything today will be perfectly balanced, so the same number of plots will be sampled in each stand.
I'm using (somewhat sloppy) "regression" style notation instead of experimental design notation, where the $t$ indexes the observations.
$$y_t = \mu + (b_s)_t + \epsilon_t$$
+ $y_t$ is the recorded value for the $t$th observation of the quantitative response variable; $t$ goes from 1 to the number of observations in the dataset. Since plot is the level of observation in this example (i.e., we have a single observation for each plot), $t$ indexes both the number of plots and the number of rows in the dataset.
+ $\mu$ is the overall mean response
+ $b_s$ is the (random) effect of the $s$th stand on the response. $s$ goes from 1 to the total number of stands sampled. The stand-level random effects are assumed to come from an iid normal distribution with a mean of 0 and some shared, stand-level variance, $\sigma^2_s$: $b_s \thicksim N(0, \sigma^2_s)$
+ $\epsilon_t$ is the observation-level random effect (the residual error term). Since plots are the level of observation in my scenario, this is essentially the effect of each plot measurement on the response. These are assumed to come from an iid normal distribution with a mean of 0 and some shared variance, $\sigma^2$: $\epsilon_t \thicksim N(0, \sigma^2)$
# A single simulation for the two-level model
Let's jump in and start simulating, as I find the statistical model becomes clearer once we have a simulated dataset to look at.
Here is what the dataset I will create will look like. I have a variable for stands (`stand`), plots (`plot`), and the response variable (`resp`).
```{r datex, echo = FALSE, purl = FALSE}
dat[,c("stand", "plot", "resp")]
```
I'll start by setting the seed so these results can be exactly reproduced.
```{r}
set.seed(16)
```
I need to define the "truth" in the simulation by setting all the parameters in the statistical model to a value of my choosing. Here's what I'll do today.
+ The true mean ($\mu$) will be 10
+ The stand-level variance ($\sigma^2_s$) will be set at 4, so the standard deviation ($\sigma_s$) is 2.
+ The observation-level random effect variance ($\sigma^2$) will be set at 1, so the standard deviation ($\sigma$) is 1.
I'll define the number of groups and number of replicates per group while I'm at it. I'll use 5 stands and 4 plots per stand. The total number of plots (and so observations) is the number of stands times the number of plots per stand: `5*4 = 20`.
```{r}
nstand = 5
nplot = 4
mu = 10
sds = 2
sd = 1
```
I need to create a `stand` variable, containing unique names for the five sampled stands. I use capital letters for this. Each stand name will be repeated four times, because each one was measured four times (i.e., there are four plots in each stand).
```{r}
( stand = rep(LETTERS[1:nstand], each = nplot) )
```
I can make a `plot` variable, as well, although it's not needed for modeling since we have a single value per plot. It is fairly common to give plots the same name in each stand (i.e., plots are named 1-4 in each stand), but I'm a big believer in giving plots unique names. I'll name plots uniquely using lowercase letters. There are a total of 20 plots.
```{r}
( plot = letters[1:(nstand*nplot)] )
```
Now I will simulate the stand-level random effects. I defined these as $b_s \thicksim N(0, \sigma^2_s)$, so will randomly draw from a normal distribution with a mean of 0 and standard deviation of 2 (remember that `rnorm()` in R uses standard deviation, not variance). I have five stands, so I draw five values.
```{r}
( standeff = rnorm(nstand, 0, sds) )
```
Every plot in a stand has the same "stand effect", which I simulated with the five values above. This means that the stand itself is causing the measured response variable to be higher or lower than other stands across all plots in the stand. So the "stand effect" must be repeated for every plot in a stand.
The `stand` variable I made helps me know how to repeat the stand effect values. Based on that variable, every stand effect needs to be repeated four times in a row (once for each plot).
```{r}
( standeff = rep(standeff, each = nplot) )
```
The observation-level random effect is simulated the same way as for a linear model. Every unique plot measurement has some effect on the response, and that effect is drawn from a normal distribution with a mean of 0 and a standard deviation of 1 ($\epsilon_t \thicksim N(0, \sigma^2)$).
I make 20 draws from this distribution, one for every plot/observation.
```{r}
( ploteff = rnorm(nstand*nplot, 0, sd) )
```
I'm going to put all of these variables in a dataset together. This helps me keep things organized for modeling but, more importantly for learning how to do simulations, I think this helps demonstrate how every stand has an overall effect (repeated for every observation in that stand) and every plot has a unique effect. This becomes clear when you peruse the 20-row dataset shown below.
```{r}
( dat = data.frame(stand, standeff, plot, ploteff) )
```
I now have the fixed values of the parameters, the variable `stand` to represent the random effect in a model, and the simulated effects of stands and plots drawn from their defined distributions. That's all the pieces I need to calculate my response variable.
The statistical model
$$y_t = \mu + (b_s)_t + \epsilon_t$$
is my guide for how to combine these pieces to create the simulated response variable, $y_t$. Notice I call the simulated response variable `resp`.
```{r}
( dat$resp = with(dat, mu + standeff + ploteff ) )
```
Now that I have successfully created the dataset I showed you at the start of this section, it's time for model fitting! I can fit a model with two sources of variation (stand and plot) with, e.g., the `lmer()` function from package **lme4**.
```{r, warning = FALSE, message = FALSE}
library(lme4) # v. 1.1-27.1
```
The results for the estimated overall mean and standard deviations of random effects in this model look pretty similar to my defined parameter values.
```{r}
fit1 = lmer(resp ~ 1 + (1|stand), data = dat)
fit1
```
# Make a function for the simulation
A single simulation can help us understand the statistical model, but usually the goal of a simulation is to see how the model behaves over the long run. To repeat this simulation many times in R we'll want to "functionize" the data simulating and model fitting process.
In my function I'm going to set all the arguments to the parameter values as I defined them above. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with, say, a different number of replications or different standard deviations at either level.
This function returns a linear model fit with `lmer()`.
```{r}
twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 2, sigma = 1) {
standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
stand = rep(LETTERS[1:nstand], each = nplot)
ploteff = rnorm(nstand*nplot, 0, sigma)
resp = mu + standeff + ploteff
dat = data.frame(stand, resp)
lmer(resp ~ 1 + (1|stand), data = dat)
}
```
I test the function, using the same `seed`, to make sure things are working as expected and that I get the same results as above.
```{r}
set.seed(16)
twolevel_fun()
```
# Repeat the simulation many times
Now that I have a working function to simulate data and fit the model it's time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.
This is a task for `replicate()`, which repeatedly calls a function and saves the output. When using `simplify = FALSE` the output is a list, which is convenient for going through to extract elements from the models later. I'll re-run the simulation 100 times as an example, although I will do 1000 runs later when I explore the long-run performance of variance estimates.
```{r, warning = FALSE, message = FALSE}
sims = replicate(100, twolevel_fun(), simplify = FALSE )
sims[[100]]
```
# Extract results from the linear mixed model
After running all the models we will want to extract whatever we are interested in. The `tidy()` function from package **broom.mixed** can be used to conveniently extract both fixed and random effects.
Below is an example on the practice model. You'll notice there are no p-values for fixed effects. If those are desired and the degrees of freedom can be calculated, see packages **lmerTest** and the `ddf.method` argument in `tidy()`.
```{r, message = FALSE, warning = FALSE}
library(broom.mixed) # v. 0.2.7
tidy(fit1)
```
If we want to extract only the fixed effects:
```{r}
tidy(fit1, effects = "fixed")
```
And for the random effects, which can be pulled out as variances via `scales` instead of the default standard deviations:
```{r}
tidy(fit1, effects = "ran_pars", scales = "vcov")
```
# Explore the effect of sample size on variance estimation
Today I'll look at how well we estimate variances of random effects for different samples sizes. I'll simulate data for sampling 5 stands, 20 stands, and 100 stands.
I'm going to load some helper packages for this, including **purrr** for looping, **dplyr** for data manipulation tasks, and **ggplot2** for plotting.
```{r, message = FALSE, warning = FALSE}
library(purrr) # v. 0.3.4
suppressPackageStartupMessages( library(dplyr) ) # v. 1.0.7
library(ggplot2) # v. 3.3.5
```
I'm going to loop through a vector of the three stand sample sizes and then simulate data and fit a model 1000 times for each one. I'm using **purrr** functions for this, and I end up with a list of lists (1000 models for each sample size). It takes a minute or two to fit the 3000 models. I'm ignoring all warning messages (but this isn't always a good idea).
```{r, message = FALSE, warning = FALSE}
stand_sims = c(5, 20, 100) %>%
set_names() %>%
map(~replicate(1000, twolevel_fun(nstand = .x) ) )
```
Next I'll pull out the stand variance for each model via `tidy()`.
I use `modify_depth()` to work on the nested (innermost) list, and then row bind the nested lists into a data.frame to get things in a convenient format for plotting. I finish by filtering things to keep only the `stand` variance, as I extracted both stand and residual variances from the model.
```{r}
stand_vars = stand_sims %>%
modify_depth(2, ~tidy(.x, effects = "ran_pars", scales = "vcov") ) %>%
map_dfr(bind_rows, .id = "stand_num") %>%
filter(group == "stand")
head(stand_vars)
```
Let's take a look at the distributions of the variances for each sample size via density plots. We know the true variance is 4, so I'll add a vertical line at 4.
```{r, cache = TRUE}
ggplot(stand_vars, aes(x = estimate) ) +
geom_density(fill = "blue", alpha = .25) +
facet_wrap(~stand_num) +
geom_vline(xintercept = 4)
```
Whoops, I need to get my factor levels in order. I'll use package **forcats** for this.
```{r}
stand_vars = mutate(stand_vars, stand_num = forcats::fct_inorder(stand_num) )
```
I'll also add some clearer labels for the facets.
```{r}
add_prefix = function(string) {
paste("Number stands:", string, sep = " ")
}
```
And finally I'll add the median of each distribution as a second vertical line.
```{r, message = FALSE}
groupmed = stand_vars %>%
group_by(stand_num) %>%
summarise(mvar = median(estimate) )
```
Looking at the plots we can really see how poorly we can estimate variances when we have few replications. When only 5 stands are sampled, the variance can be estimated as low as 0 and as high as ~18 (`r emo::ji("open_mouth")`) when it's really 4.
By the time we have 20 stands things look better, and things look quite good with 100 stands (although notice variance still ranges from 1 to ~8).
```{r}
ggplot(stand_vars, aes(x = estimate) ) +
geom_density(fill = "blue", alpha = .25) +
facet_wrap(~stand_num, labeller = as_labeller(add_prefix) ) +
geom_vline(aes(xintercept = 4, linetype = "True variance"), size = .5 ) +
geom_vline(data = groupmed, aes(xintercept = mvar, linetype = "Median variance"),
size = .5) +
theme_bw(base_size = 14) +
scale_linetype_manual(name = "", values = c(2, 1) ) +
theme(legend.position = "bottom",
legend.key.width = unit(.1, "cm") ) +
labs(x = "Estimated Variance", y = NULL)
```
Here are some additional descriptive statistics of the distribution of variances in each group to complement the info in the plot.
```{r, message = FALSE}
stand_vars %>%
group_by(stand_num) %>%
summarise_at("estimate",
list(min = min, mean = mean, med = median, max = max) )
```
So how much of the distribution is below 4 for each estimate?
We can see for 5 samples that the variance is definitely underestimated more often than it is overestimated; almost 60% of the distribution is below the true variance of 4.
Every time I do this sort of simulation I am newly surprised that even large samples tend to underestimate variances slightly more often than overestimate them.
```{r, message = FALSE}
stand_vars %>%
group_by(stand_num) %>%
summarise(mean(estimate < 4) )
```
My take home from all of this? We may need to be cautious with results for studies where the goal is to make inference about the estimates of the variances or for testing variables measured at the level of the largest unit when the number of units is small.
# An actual mixed model (with fixed effects this time)
I simplified things above so much that I didn't have any fixed effects variables. We can certainly include fixed effects in a simulation.
Below is a quick example. I'll create two continuous variables, one measured at the stand level and one measured at the plot level, that have linear relationships with the mean response variable. The study is the same as the one I defined for the previous simulation.
Here's the new statistical model.
$$y_t = \beta_0 + \beta_1*(Elevation_s)_t + \beta_2*Slope_t + (b_s)_t + \epsilon_t$$
Where
+ $\beta_0$ is the mean response when both Elevation and Slope are 0
+ $\beta_1$ is the change in mean response for a 1-unit change in elevation. Elevation is measured at the stand level, so all plots in a stand share a single value of elevation.
+ $\beta_2$ is the change in mean response for a 1-unit change in slope. Slope is measured at the plot level, so every plot potentially has a unique value of slope.
Setting the values for the three new parameters and simulating values for the continuous explanatory variables will be additional steps in the simulation. The random effects are simulated the same way as before.
I define the new parameters below.
+ The intercept ($\beta_0$) will be -1
+ The coefficient for elevation ($\beta_1$) will be set to 0.005
+ The coefficient for slope ($\beta_2$) will be set to 0.1
```{r}
nstand = 5
nplot = 4
b0 = -1
b1 = .005
b2 = .1
sds = 2
sd = 1
```
Here are the variables I simulated previously.
```{r}
set.seed(16)
stand = rep(LETTERS[1:nstand], each = nplot)
standeff = rep( rnorm(nstand, 0, sds), each = nplot)
ploteff = rnorm(nstand*nplot, 0, sd)
```
I will simulate the explanatory variables by randomly drawing from uniform distributions via `runif()`. I change the minimum and maximum values of the uniform distribution as needed to get an appropriate spread for a given variable. If the distribution of your explanatory variables are more skewed you could use a different distribution (like the Gamma distribution).
First I simulate values for elevation. This variable only five values, as it is a stand-level variable. I need to repeat each value for the four plots measured in each stand like I did when making the `stand` variable.
```{r}
( elevation = rep( runif(nstand, 1000, 1500), each = nplot) )
```
I can simulate slope the same way, pulling random values from a uniform distribution with different limits. The slope is measured at the plot level, so I have one value for every plot in the dataset.
```{r}
( slope = runif(nstand*nplot, 2, 75) )
```
We now have everything we need to create the response variable.
Based on our equation $$y_t = \beta_0 + \beta_1*(Elevation_s)_t + \beta_2*Slope_t + (b_s)_t + \epsilon_t$$
the response variable will be calculated via
```{r}
( resp2 = b0 + b1*elevation + b2*slope + standeff + ploteff )
```
Now we can fit a mixed model for `resp2` with `elevation` and `slope` as fixed effects, `stand` as the random effect and the residual error term based on plot-to-plot variation. (*Notice I didn't put these variables in a dataset, which I usually like to do to keep things organized and to avoid problems of vectors in my environment getting overwritten by mistake.*)
We can see some of the estimates in this one model aren't very similar to our set values, and doing a full simulation would allow us to explore the variation in the estimates. For example, I expect the coefficient for elevation, based on only five values, will be extremely unstable.
```{r}
lmer(resp2 ~ elevation + slope + (1|stand) )
```
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "datex", "getlabels", "allcode", "makescript")]
```
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
output = here::here("static", "script", paste(tools::file_path_sans_ext(input), "R", sep = ".") )
knitr::purl(input, output, documentation = 0, quiet = TRUE)
```
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](/script/2018-04-23-simulate-simulate-part-2-a-linear-mixed-model.R).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}
```