Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time
---
title: 'Simulate! Simulate! - Part 1: A linear model'
author: Ariel Muldoon
date: '2018-01-09'
slug: simulate-simulate-part1
categories: [r, statistics]
tags: [simulation, teaching]
draft: FALSE
description: "Where I discuss simulations, why I love them, and get started on a simulation series with a simple two-group linear model simulation."
---
```{r setup, echo = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851",
filename = 'render_toc.R')
# Make data
set.seed(16)
ngroup = 2
nrep = 10
b0 = 5
b1 = -2
sd = 2
group = rep( c("group1", "group2"), each = nrep)
eps = rnorm(ngroup*nrep, 0, sd)
growth = b0 + b1*(group == "group2") + eps
dat = data.frame(group, growth)
```
Confession: I love simulations.
I have found simulations incredibly useful in growing my understanding of statistical theory and assumptions of models. When someone tells me with great certainty "I don't need to meet that assumption because [fill in the blank]" or asks "Does it matter that [something complicated]?", I often turn to simulations instead of textbooks to check.
I like simulations for the same reasons I like building Bayesian models and using resampling methods (i.e., Monte Carlo) for inference. Building the simulation increases my understanding of the problem and makes all the assumptions clearer to me because I must define them explicitly. Plus I think it is fun to put the code together and explore the results. `r emo::ji("slightly_smiling_face")`
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
render_toc("2018-01-09-simulate-simulate-part-1-a-linear-model.Rmd")
```
# Simulate, simulate, dance to the music
Simulations have been so helpful in my own understanding of models used for analysis that I want to spread the knowledge. Being able to build a simulation can really help folks understand the strengths and weaknesses of their analysis. I haven't managed to fit teaching simulations in my own work so far, but it's always on my mind. Hence, this post.
Today I'm going to go over an example of simulating data from a two-group linear model. I'll work work up to linear mixed models and generalized linear mixed models (the fun stuff! `r emo::ji("laughing")`) in subsequent posts.
# The statistical model
**Warning: Here there be equations.**
If you're like me and your brain says "I think this section must not pertain to me" when your eyes land on mathematical notation, you can jump right down to the R code in the [next section](#a-single-simulation-from-a-two-group-model). But if you can power through, I've found these equations are actually pretty useful when setting up a simulation (honest).
A simulation for a linear model is based on the statistical model. The statistical model is an equation that describes the processes we believe gave rise to the observed response variable. It includes parameters to describe the assumed effect of explanatory variables on the response variable as well as a description of any distributions associated with processes we assume are random variation. (There is more in-depth coverage of the statistical model in Stroup's 2013 [Generalized Linear Mixed Models](https://books.google.com/books/about/Generalized_Linear_Mixed_Models.html?id=GcGrySpkXRMC) book if you are interested and have access to it.)
The statistical model is where we write down the exact assumptions we are making when we fit a linear model to a set of data.
Here is an example of a linear model for two groups. I wrote the statistical model to match the form of the default summary output from a model fit with `lm()` in R.
$$y_t = \beta_0 + \beta_1*I_{(group_t=\textit{group2})} + \epsilon_t$$
+ $y_t$ is the observed values for the quantitative response variable; $t$ goes from 1 to the number of observations in the dataset
+ $\beta_0$ is the mean response variable when the group is *group1*
+ $\beta_1$ is the difference in mean response between the two groups, *group2* minus *group1*.
+ The indicator variable, $I_{(group_t=\textit{group2})}$, is 1 when the group is *group2* and 0 otherwise, so $\beta_1$ only affects the response variable for observations in *group2*
+ $\epsilon_t$ is the random variation present for each observation that is not explained by the group variable. These are assumed to come from an iid normal distribution with a mean of 0 and some shared variance, $\sigma^2$: $\epsilon_t \thicksim N(0, \sigma^2)$
# R packages
I'll use package **purrr** for looping, **broom** for extracting output of models, **dplyr** for any data manipulation, and **ggplot2** for plotting.
```{r packages, message = FALSE, warning = FALSE}
library(purrr) # v. 0.3.4
library(broom) # v. 0.5.6
library(dplyr) # v. 1.0.0
library(ggplot2) # v. 3.3.1
```
# A single simulation from a two-group model
Let's jump in and start simulating, as I find the statistical model becomes clearer once we have a simulated dataset to look at.
Here is what the dataset I will create will look like. I have a variable for the two groups (`group`) and one for the response variable (`growth`). There are 10 observations in each group.
```{r datex, echo = FALSE, purl = FALSE}
dat
```
I'll set my seed so these particular results can be reproduced. I usually set the seed when I'm first working out my simulation process (but not when actually running the simulation many times).
```{r}
set.seed(16)
```
I start out by defining what the "truth" is in the simulation by setting all the parameters in the statistical model to a value of my choosing. Here's what I'll do today.
+ The true group mean ($\beta_0$) for "group1" will be 5
+ The mean of `group2` will be 2 less than `group1` ($\beta_1$)
+ The shared variance ($\sigma^2$) will be set at 4, so the standard deviation ($\sigma$) is 2.
I'll define the number of groups (2) and number of replicates per group (10) while I'm at it. The total number of observations is the number of groups times the number of replicates per group, `2*10 = 20`.
```{r}
ngroup = 2
nrep = 10
b0 = 5
b1 = -2
sd = 2
```
I need to create the variable I'll call `group` to use as the explanatory variable when fitting a model in R. I use the function `rep()` a lot when doing simulations in order to repeat values of variables to appropriately match the scenario I'm building data for.
Here I'll repeat each level of `group` 10 times (`nrep` as I defined above).
```{r}
( group = rep( c("group1", "group2"), each = nrep) )
```
Next I'll simulate the random errors. Remember I defined these in the statistical model as $\epsilon_t \thicksim N(0, \sigma^2)$. To simulate these I'll take random draws from a normal distribution with a mean of 0 and standard deviation of 2. Note that `rnorm()` takes the standard deviation as input, not the variance.
Every observation ($t$) has a unique value for the random error, so I need to make 20 draws total (`ngroup*nrep`).
```{r}
( eps = rnorm(n = ngroup*nrep, mean = 0, sd = sd) )
```
Now I have the fixed estimates of parameters, the variable `group` on which to base the indicator variable, and the random errors drawn from the defined distribution. That's all the pieces I need to calculate my response variable.
The statistical model
$$y_t = \beta_0 + \beta_1*I_{(group_t=\textit{group2})} + \epsilon_t$$
is my guide for how to combine these pieces to create the simulated response variable, $y_t$. Notice I create the indicator variable in R with `group == "group2"` and call the simulated response variable `growth`.
```{r}
( growth = b0 + b1*(group == "group2") + eps )
```
I often store the variables I will use in fitting the model in a dataset to help keep things organized. This becomes more important when working with more variables.
```{r}
dat = data.frame(group, growth)
```
Once the response and explanatory variables have been created, it's time to fit a model. I can fit the two group linear model with `lm()`. I call this test model `growthfit`.
```{r}
growthfit = lm(growth ~ group, data = dat)
summary(growthfit)
```
# Make a function for the simulation
A single simulation can help us understand the statistical model, but it doesn't help us see how the model behaves over the long run. To repeat this simulation many times in R we'll want to "functionize" the data simulating and model fitting process.
In my function I'm going to set all the arguments to the parameter values I've defined above. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with different coefficient values, a different number of replications, or a different amount of random variation.
This function returns a linear model fit with `lm()`.
```{r}
twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
group = rep( c("group1", "group2"), each = nrep)
eps = rnorm(n = ngroup*nrep, mean = 0, sd = sigma)
growth = b0 + b1*(group == "group2") + eps
simdat = data.frame(group, growth)
growthfit = lm(growth ~ group, data = simdat)
growthfit
}
```
I test the function, using the same seed as I used figuring out the process to make sure things are working as expected. If things are working, I should get the same results as I did above. Looks good.
```{r}
set.seed(16)
twogroup_fun()
```
If I want to change some element of the simulation, I can do so by changing the default values to the function arguments. Here's a simulation from the same model but with a smaller standard deviation. All other arguments use the defaults I set.
```{r}
twogroup_fun(sigma = 1)
```
# Repeat the simulation many times
Now that I have a working function to simulate data and fit the model, it's time to do the simulation many times. I want to save the model from each individual simulation to allow exploration of long run model performance.
This is a task for the `replicate()` function, which repeatedly calls a function and saves the output. I use `simplify = FALSE` so the output is a list. This is convenient for going through to extract elements from the models later. Also see `purrr::rerun()` as a convenient version of `replicate()`.
I'll run this simulation 1000 times, resulting in a list of fitted two-group linear models based on data simulated from the parameters I've set set. I print the first model in the list below.
```{r}
sims = replicate(n = 1000, twogroup_fun(), simplify = FALSE )
sims[[1]]
```
# Extracting results from the linear model
There are many elements of our model that we might be interested in exploring, including estimated coefficients, estimated standard deviations/variances, and the statistical results (test statistics/p-values).
To get the coefficients and statistical tests of coefficients we can use `tidy()` from package **broom**.
This returns the information on coefficients and tests of those coefficients in a tidy format that is easy to work with. Here's what the results look like for the test model.
```{r}
tidy(growthfit)
```
I have often been interested in understanding how the variances/standard deviations behave over the long run, in particular in mixed models. For a linear model we can extract an estimate of the residual standard deviation from the `summary()` output. This can be squared to get the variance as needed.
```{r}
summary(growthfit)$sigma
```
# Extract simulation results
Now for the fun part! Given we know the true model, how do the parameters behave over many samples?
To extract any results I'm interested in I will loop through the list of models, which I've stored in `sims`, and pull out the element of interest. I will use functions from the *map* family from package **purrr** for looping through the list of models. I'll use functions from **dplyr** for any data manipulation and plot distributions via **ggplot2**.
## Estimated differences in mean response
As this is a linear model about differences among the two groups, the estimate of $\beta_1$ is one of the statistics of primary interest. What does the distribution of differences in mean growth between groups look like across the 1000 simulated datasets? Here's a density plot.
```{r coef}
sims %>%
map_df(tidy) %>%
filter(term == "groupgroup2") %>%
ggplot( aes(x = estimate) ) +
geom_density(fill = "blue", alpha = .5) +
geom_vline( xintercept = -2)
```
These are the kinds of results I find so compelling when doing simulations. The dataset I made is pretty small and has a fair amount of unexplained variation. While *on average* we get the correct result, since the peak of the distribution is right around the true value of `-2`, there is quite a range in the estimated coefficient across simulations. Some samples lead to overestimates and some to underestimates of the parameter. Some models even get the sign of the coefficient wrong. If we had only a single model from a single sample, as we would when collecting data rather than simulating data, we could very well get a fairly bad estimate for the value we are interested in.
See Gelman and Carlin's 2014 paper, [Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors](http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf) if you are interested in further discussion.
## Estimated standard deviation
I can do a similar plot exploring estimates of the residual standard deviation. In this case I extract `sigma` from the model object and put it in a data.frame to plot the distribution with a density plot.
```{r}
sims %>%
map_dbl(~summary(.x)$sigma) %>%
data.frame(sigma = .) %>%
ggplot( aes(x = sigma) ) +
geom_density(fill = "blue", alpha = .5) +
geom_vline(xintercept = 2)
```
The estimated variation ranges between 1 to just over 3, and the distribution is roughly centered on the true value of 2. Like with the coefficient above, the model performs pretty well on average but any single model can have a biased estimate of the standard deviation.
The standard deviation is underestimated a bit more than 50% of the time. This is not uncommon.
```{r}
sims %>%
map_dbl(~summary(.x)$sigma) %>%
{. < 2} %>%
mean()
```
## Extract hypothesis test results
If the goal of a simulation is to get an idea of the statistical power of a test we could look at the proportion of times the null hypothesis was rejected given a fixed alpha level (often 0.05, but of course it can be something else).
Here the proportion of models that correctly rejected the null hypothesis, given that we know the null hypothesis is not true, is just over 56%. That's an estimate of the statistical power.
```{r}
sims %>%
map_df(tidy) %>%
filter(term == "groupgroup2") %>%
pull(p.value) %>%
{. < 0.05} %>%
mean()
```
The three examples I gave above are common statistics of interest but really any output from the model is fair game. For linear models you could also pull out $R^2$ or the overall F-test, for example.
# Where to go from here?
I'll do future posts about simulating from more complicated linear models, starting with linear mixed models. In particular I will explore interesting issues that crop up when estimating variances in [part 2](https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/) of the series.
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")]
```
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
output = here::here("static", "script", paste(tools::file_path_sans_ext(input), "R", sep = ".") )
knitr::purl(input, output, documentation = 0, quiet = TRUE)
```
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](/script/2018-01-09-simulate-simulate-part-1-a-linear-model.R).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}
```