/
appendix3b_extended_binomial.Rmd
354 lines (280 loc) · 13.5 KB
/
appendix3b_extended_binomial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
title: 'Appendix 3b: Extended Binomial Example'
subtitle: "Simulating binomial data with crossed random factors"
author: "Lisa M. DeBruine, Dorothy Bishop, Dale J. Barr"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Appendix 3b: Extended Binomial Example}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.width = 8,
fig.height = 5,
out.width = "100%"
)
```
[Download the .Rmd for this example](https://github.com/debruine/lmem_sim/blob/master/vignettes/appendix3b_extended_binomial.Rmd)
To give an overview of the simulation task, we will simulate data from a design with crossed random factors of subjects and stimuli, fit a model to the simulated data, and then try to recover the parameter values we put in from the output. In this hypothetical study, subjects classify the emotional expressions of faces as quickly as possible, and we use accuracy (correct/incorrect) as the primary dependent variable. The faces are of two types: either from the subject's ingroup or from an outgroup. The key question is whether there is any difference in classification accuracy across the type of face.
The important parts of the design are:
* Random factor 1: subjects
* Random factor 2: faces
* Fixed factor 1: expression (level = angry, happy)
* within subject: same subjects see both angry and happy faces
* within face: same faces are both angry and happy
* Fixed factor 2: category (level = ingroup, outgroup)
* within subject: same subjects see both ingroup and outgroup faces
* between face: each face is either ingroup or outgroup
### Required software
```{r, message=FALSE}
# load required packages
library("lme4") # model specification / estimation
library("afex") # deriving p-values from lmer
library("broom.mixed") # extracting data from model fits
library("faux") # data simulation for multivariate normal dist
library("tidyverse") # data wrangling and visualisation
# ensure this script returns the same results on each run
set.seed(8675309)
faux_options(verbose = FALSE)
```
### Probability vs logit
This example presents an extended simulation for a binomial logistic mixed regression. Where response accuracy is measured in terms of probability, the regression needs to work with a link function that uses the logit, which does not have the problems of being bounded by 0 and 1. The functions below are used later to convert between probability and the logit function of probability.
```{r prob-vs-logit, fig.width = 8, fig.height = 3}
logit <- function(x) { log(x / (1 - x)) }
inv_logit <- function(x) { 1 / (1 + exp(-x)) }
data.frame(
prob = seq(0,1,.01)
) %>%
mutate(logit = logit(prob)) %>%
ggplot(aes(prob, logit)) +
geom_point()
```
### 2ww*2wb design
In this example, 30 subjects will respond twice (for happy and angry expressions) to 50 items; 25 items in each of 2 categories. In this example, `expression` is a within-subject and within-item factor and `category` is a within-subject and between-item factor.
### Terms
We use the following prefixes to designate model parameters and sampled values:
* `beta_*`: fixed effect parameters
* `subj_*`: random effect parameters associated with subjects
* `item_*`: random effect parameters associated with items
* `X_*`: effect-coded predictor
* `S_*`: sampled values for subject random effects
* `I_*`: sampled values for item random effects
In previous tutorials, we used numbers to designate fixed effects, but here we will use letter abbreviations to make things clearer:
* `*_0`: intercept
* `*_e`: expression
* `*_c`: category
* `*_ec`: expression * category
Other terms:
* `*_rho`: correlations; a vector of the upper right triangle for the correlation matrix for that group's random effects
* `n_*`: sample size
### Data-generating function
Betas are on logit scale; later we use `inv_logit()` to convert to probabilities.
The defaults for the data-generating function `ext_bin_dgf()` are for null effects. When simulating realistic data, we will call the function using beta values computed using expected probabilities for combinations of independent variables, which will overwrite the defaults specified here.
All SDs, representing variation in item and subject-specific effects, are set to one.
```{r simulatedata}
ext_bin_dgf <- function(
n_subj = 30, # number of subjects
n_ingroup = 25, # number of faces in ingroup
n_outgroup = 25, # number of faces in outgroup
beta_0 = 0, # grand mean - inv_logit(0)=.5
beta_e = 0, # main effect of expression
beta_c = 0, # main effect of category
beta_ec = 0, # interaction between category and expression
item_0 = 1, # by-item random intercept sd
item_e = 1, # by-item random slope for exp
item_rho = 0, # by-item random effect correlation
subj_0 = 1, # by-subject random intercept sd
subj_e = 1, # by-subject random slope sd for exp
subj_c = 1, # by-subject random slope sd for category
subj_ec = 1, # by-subject random slope sd for category*exp
# by-subject random effect correlations
subj_rho = c(0, 0, 0, # subj_0 * subj_e, subj_c, subj_ec
0, 0, # subj_e * subj_c, subj_ec
0) # subj_c * subj_ec
) {
# simulate items; separate item ID for each item;
# simulating each item's individual effect for intercept (I_0)
# and slope (I_e) for expression (the only within-item factor)
items <- faux::rnorm_multi(
n = n_ingroup + n_outgroup,
mu = 0,
sd = c(item_0, item_e),
r = item_rho,
varnames = c("I_0", "I_e")
) %>%
mutate(item_id = faux::make_id(nrow(.), "I"),
category = rep(c("ingroup", "outgroup"),
c(n_ingroup, n_outgroup))) %>%
select(item_id, category, everything())
# simulate subjects: separate subject ID for each subject;
# simulating each subject's individual effect for intercept (I_0)
# and slope for each within-subject factor and their interaction.
subjects <- faux::rnorm_multi(
n = n_subj,
mu = 0,
sd = c(subj_0, subj_e, subj_c, subj_ec),
r = subj_rho,
varnames = c("S_0", "S_e", "S_c", "S_ec")
) %>%
mutate(subj_id = faux::make_id(nrow(.), "S")) %>%
select(subj_id, everything())
# simulate trials
crossing(subjects, items,
expression = factor(c("happy", "angry"), ordered = TRUE)
) %>%
mutate(
# effect code the two fixed factors
X_e = recode(expression, "happy" = -0.5, "angry" = 0.5),
X_c = recode(category, "ingroup" = -0.5, "outgroup" = +0.5),
# add together fixed and random effects for each effect
B_0 = beta_0 + S_0 + I_0,
B_e = beta_e + S_e + I_e,
B_c = beta_c + S_c,
B_ec = beta_ec + S_ec,
# calculate gaussian effect
Y = B_0 + (B_e*X_e) + (B_c*X_c) + (B_ec*X_e*X_c),
pr = inv_logit(Y), # transform to probability of getting 1
Y_bin = rbinom(nrow(.), 1, pr) # sample from bernoulli distribution, ie use pr to set observed binary response to 0 or 1, so if pr = .5, then 0 and 1 equally likely
) %>%
select(subj_id, item_id, expression, category, X_e, X_c, Y, Y_bin)
}
```
You can set the subject and item Ns to very small numbers to check that the design is what you expect. Here, we have two subjects who each see two faces, one ingroup and one outgroup, in each of two expressions, happy and angry, for a total of four trials per subject.
```{r showsimdat}
ext_bin_dgf(n_subj = 2, n_ingroup = 1, n_outgroup = 1)
```
### Single run
The `single_run()` function creates a simulated data set using the `ext_bin_dgf()` function, runs a GLMM analysis, and returns a tidy table of the results. If you set `filename`, it will also record the data to a file, which can be helpful if you need to run a long simulation that might crash or you might need to interrupt.
```{r singlerun_function}
single_run <- function(filename = NULL, ...) {
# ... represents additional arguments to be passed to ext_bin_dgf
dat_sim <- ext_bin_dgf(...)
mod_sim <- glmer(Y_bin ~ 1 + X_e*X_c +
(1 + X_e | item_id) +
(1 + X_e*X_c | subj_id),
data = dat_sim, family = "binomial")
sim_results <- broom.mixed::tidy(mod_sim)
# append the results to a file if filename is set
if (!is.null(filename)) {
append <- file.exists(filename) # append if the file exists
write_csv(sim_results, filename, append = append)
}
# return the tidy table
sim_results
}
```
You can compare a single run output with the parameters that were set. However, unless your Ns are very big, expect a few large deviations by chance. Run this function once with a timing function to see how long it takes. Mixed model simulations can take a long time, so knowing this lets you figure out how long your coffee break can be when you run the simulation below.
```{r}
system.time(
s <- single_run()
)
s
```
### Calculate betas from probabilities
The following function calculates the betas (on a logit scale) from the mean probability correct for each combination of factors. The default is all are 50% correct.
This function only works for a 2x2 design like the one in this example. You will need to figure out a custom function for each design to do this, or estimate fixed effect parameters directly from analysis of pilot data.
```{r prob2param}
# set level 1 for each factor to the +0.5-coded value
prob2param <- function(A1B1 = .5, # angry outgroup
A1B2 = .5, # angry ingroup
A2B1 = .5, # happy outgroup
A2B2 = .5) { # happy ingroup
# convert probabilities to logit
a1b1 <- logit(A1B1)
a1b2 <- logit(A1B2)
a2b1 <- logit(A2B1)
a2b2 <- logit(A2B2)
# calculate and return betas
list(
beta_0 = mean(c(a1b1, a1b2, a2b1, a2b2)), # grand mean
beta_e = (a1b1+a1b2)/2 - (a2b1+a2b2)/2, # angry - happy
beta_c = (a1b1+a2b1)/2 - (a1b2+a2b2)/2, # outgroup - ingroup
beta_ec = (a1b1-a1b2) - (a2b1-a2b2) # angry o-i diff - happy o-i diff
)
}
```
Demonstrate a single run of the function. First we specify proportions correct for the 4 combinations. These are just made-up values to show how it works. With values set to .6, .5, .5, and .4,, we get `beta_0` and `beta_ec` of 0 (no interaction), and `beta_e` and `beta_c` of .405 (same size effect of angry/happy and ingroup/outgroup).
```{r singlerundemo}
# to check false positive rate of analysis, set all these to same value (e.g. .5)
A1B1 <- .6 # angry_outgroup
A1B2 <- .5 # angry_ingroup
A2B1 <- .5 # happy_outgroup
A2B2 <- .4 # happy_ingroup
b <- prob2param(A1B1, A1B2, A2B1, A2B2)
str(b)
```
### Simulations
We can now run many simulations and save to a file on each run. It's a good idea to put the relevant parameters in the filename, or modify the code above to add the parameters as new columns.
```{r filename}
# make a filename indicating this specific set of estimated values
filename <- paste0("sims/binomial_ext_",A1B1,"_",A1B2,"_",A2B1,"_",A2B2,".csv")
```
This can take a very long time, so set `eval = FALSE` to skip this chunk if you've already run it and just load the data from the file. Alternatively, as below, surround the simulation code with a conditional statement to only run if the file isn't already present.
If you want to add additional reps to your file, make sure you change the seed, or each time your run this script from the start, you'll just get identical numbers.
```{r runsims, eval = TRUE}
reps <- 10
# run simulations and save to a file
if (!file.exists(filename)) {
sims <- purrr::map_df(1:reps, ~single_run(
filename = filename,
beta_0 = b$beta_0,
beta_e = b$beta_e,
beta_c = b$beta_c,
beta_ec = b$beta_ec)
)
}
```
```{r read-sims}
# read saved simulation data
sims <- read_csv(filename)
```
### Calculate mean estimates and power
* `mean_estimate` is the mean estimate of that value in the simulations
* `power` is the proportion of alpha values for each fixed effect that are less than your alpha (here set to .05).
* `sim_value` is the true value of beta derived from the probabilities we specified
```{r meanests}
alpha <- 0.05 # justify your alpha
est <- sims %>%
dplyr::filter(effect == "fixed") %>%
dplyr::group_by(term) %>%
dplyr::summarise(
mean_estimate = mean(estimate),
power = mean(p.value < alpha),
.groups = "drop"
)
# Note use of dplyr:: with some tidyverse commands to avoid problems that can arise from conflicts of these names with other packages.
est %>% mutate(
parameter = c("beta_0", "beta_c", "beta_e", "beta_ec"),
sim_value = c(b$beta_0, b$beta_c, b$beta_e, b$beta_ec)
) %>%
mutate_if(is.numeric, round, 4)
```
### Calculate probabilities
Sum estimates for each cell and use inverse logit transform to recover probabilities. Compare with the values of A1B1 ... A2B2 used to calculate the simulation parameters.
```{r}
# get estimates of beta values
int <- est[[1,2]]
cat <- est[[2,2]]
exp <- est[[3,2]]
cat_exp <- est[[4,2]]
# effect-coding of factor levels
happy <- -0.5
angry <- +0.5
ig <- -0.5
og <- +0.5
data.frame(
angry_outgroup = int + og*cat + angry*exp + og*angry*cat_exp,
angry_ingroup = int + ig*cat + angry*exp + ig*angry*cat_exp,
happy_outgroup = int + og*cat + happy*exp + og*happy*cat_exp,
happy_ingroup = int + ig*cat + happy*exp + ig*happy*cat_exp
) %>%
gather(key, estimate, 1:4) %>%
mutate(estimate = inv_logit(estimate)) %>%
mutate(prob = c(A1B1, A1B2, A2B1, A2B2)) %>%
separate(key, c("exp", "cat")) %>%
mutate_if(is.numeric, round, 2)
```