Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
492 lines (397 sloc) 21 KB
---
title: "Mixed model lab"
date: "`r format(Sys.time(), '%H:%M %d %B %Y')`"
author: Ben Bolker
---
```{r setup,echo=FALSE,message=FALSE}
## ignore this stuff ...
library("reshape2")
library("plyr")
library("lattice")
library("knitr")
opts_chunk$set(fig.align="center",fig.width=5,fig.height=5,tidy=FALSE,message=FALSE)
opts_knit$set(use.highlight=TRUE,error=FALSE)
knit_hooks$set(basefig=function(before, options, envir) {
if (before) {
par(bty="l",las=1)
} else { }
})
```
![cc](pix/cc-attrib-nc.png)
<!---
(http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png)
--->
Licensed under the
[Creative Commons attribution-noncommercial license](http://creativecommons.org/licenses/by-nc-sa/2.5/ca/).
Please share \& remix noncommercially, mentioning its origin.
Packages/versions used:
```{r pkglist,echo=FALSE}
usedpkgs <- sort(c("coefplot2","lme4","nlme","glmmADMB","lmerTest",
"MCMCglmm","pbkrtest","coda","reshape2","plyr",
"gridExtra"))
i1 <- installed.packages()
print(i1[usedpkgs,"Version"],quote=FALSE)
```
* `coefplot2` is still under development: try `install.packages("coefplot2",repos="http://www.math.mcmaster.ca/bolker/R",type="source")`, *after* making sure that all the dependencies are installed (`coda`, `lme4`, `reshape`).
Linear mixed model: starling example
-----------------
Data from Toby Marthews, [r-sig-mixed-models mailing list](http://tolstoy.newcastle.edu.au/R/e13/help/11/01/1364.html):
```{r starling1,fig.height=4,fig.width=7}
## load 'dataf': assumes you have the data stored in a subdirectory;
## adjust according to where you put it
load("data/starling.RData")
dataf <- transform(dataf,
subject=reorder(subject,stmass))
library("lattice")
xyplot(stmass~mnth|roostsitu,data=dataf,
type="l",groups=subject,layout=c(4,1))
```
The `xyplot` function is a little bit "magic", but methods for
plotting multilevel data are either tedious or magical (see below
for more examples). In this case, we are telling R to plot
`stmass` as a function of month, subdividing into different
panels by location, and plotting each subject as its own line
(`groups=subject`). The `layout` argument just says that we
want the panels arranged in 4 columns across the page.
It's pretty obvious that the starting (November) mass varies among roost situations (tree/nest-box/etc.), and that mass increases from November to January, but we might like to know if the slopes differ among situations. That means our fixed effects would be `~roostsitu*mnth`, with our attention focused on the `roostsitu:mnth` (interaction) term. For random effects, we can allow both intercept (obviously) and slope (maybe) to vary among individuals, via `~1+mnth|subject` or equivalent ... in this case, because measurements are only taken in two months - so that fitting a straight line between the values for November and January is equivalent to fitting November and January as two separate categories, we could also write the random term as `~1|subject/mnth` (random effect of subject and month nested within subject). **However**, it turns out that we *can't actually estimate random slopes for this model*. Because every individual is only measured twice, the variance in the slopes would be completely confounded with the residual variance. Put another way, there would be as many separate random-effects values for the subject-month interaction as there are measurements, so there is no way to tell the residual error term from the subject-month interaction. This is a very easy sort of mistake to make, and `lme` will go ahead and let you make it. `lme4` usually warns you.
Once we forget about including the (unidentifiable) random slopes, we have
```{r lme2}
library("nlme")
lme2 <- lme(stmass~mnth*roostsitu,random=~1|subject,data=dataf)
```
We can now get estimates, although the subject-level random effects are *very* uncertain: see `intervals(lme2,which="var-cov")`.
Walking through the output:
```{r echo=FALSE}
lme2out <- capture.output(summary(lme2))
cat(lme2out[1:4],sep="\n")
```
* This reminds us that the model was fitted by restricted maximum likelihood,
and gives us the value of the AIC, BIC, and log-likelihood (only useful for
comparing with other fitted models)
```{r echo=FALSE}
cat(lme2out[6:9],sep="\n")
```
* This tells us about the random effect -- it reminds us of the formula
(intercept variation among subjects) and gives us the standard deviation
of the random effect
(this is *not* an estimate of the uncertainty of the estimate -- the estimate
itself *is* a variance, or standard deviation), and the standard deviation
of the residual variance. We can see that the standard deviation is pretty
small (about 1/8th of the residual variance).
```{r echo=FALSE}
cat(lme2out[11:20],sep="\n")
```
* the standard fixed-effect output. `lme` tells us the denominator degrees
of freedom for the test (36), but please note **lme often reports the denominator df wrong in random-slopes models** (although in this case it's correct). The df are fairly large here, so the p-values
are pretty close to their Normal equivalent (the 95% critical value is `r round(qt(0.975,df=36),2)`).
* the next few lines (21-38) are the correlations among the fixed-effect parameters,
which we're not usually very concerned with -- we might be concerned if the correlations were very large (>0.95?), possibly indicating problems with the fit
```{r echo=FALSE}
cat(lme2out[39:41],sep="\n")
```
* the distribution of standardized residuals should be reasonably consistent with
a sample from the standard normal of the appropriate size (median near zero,
quartiles around $\pm `r round(qnorm(0.75),2)`, min/max appropriate for the sample size)
```{r echo=FALSE}
cat(lme2out[43:44],sep="\n")
```
* the listed number of observations and groups is *very* useful for double-checking that the random effects grouping specification is OK
* Notice that the new residual variance is the same as the old subject-by-month variance plus the old residual variance, and the subject-level (intercept) variance is (very nearly) identical.
Diagnostic plots: fitted vs. residuals, coloured according to roost location:
```{r lmediag}
plot(lme2,col=dataf$roostsitu)
```
There don't seem to be any systematic patterns here, which is good
(means there aren't patterns we missed in the data).
The fitted values don't vary very much within `roostsitu*month` groups.
Q-Q plot (a straight line indicates normality)
```{r qqnorm}
qqnorm(lme2,col=dataf$roostsitu,abline=c(0,1))
```
(There are some deviations here, but not enough that I would worry very much.)
Boxplots of residuals subdivided by `roostsitu` (it's a quirk that you have to put the grouping variable on the *left* side of the formula here):
```{r diagbox}
plot(lme2,roostsitu~resid(.))
```
One good way to assess the results of a model fit is to look at a coefficient plot:
```{r coefplot,message=FALSE,fig.keep="last"}
library("coefplot2")
coefplot2(lme2)
```
Stop and explain to yourself what these parameters mean. If you're not sure, try resetting the base level of the `roostsitu` factor: `dataf2 <- transform(dataf,roostsitu=relevel(roostsitu,ref="other"))`, predict what will happen to the results, and re-run the analysis.)
Plot random effects, ordered by magnitude:
```{r starling_RE}
rr <- ranef(lme2)
plot(rr[order(rr),,drop=FALSE])
```
I don't see any worrying outliers or breakpoints here ...
**Exercise**: This is actually a slightly trivial example, because there are only two measurements for each individual. Thus we can actually analyze the slopes and get the same answers by using a paired analysis, i.e. by computing the mass *difference* in subjects and then analyzing with a single-level linear model. (In the simplest case this would reduce to a paired $t$-test, but in this case it is a 1-way ANOVA on `roostsitu`.
Rearrange the data to get differences by subject:
```{r diffmass}
library("plyr")
dataf2 <- ddply(dataf,"subject", summarise,
roostsitu=roostsitu[1],
massdiff=diff(stmass))
```
This says to split the data frame `dataf` by the `subject` variable;
for each value of `subject`, put together a data frame with a `roostsitu`
variable equal to the first element of the `rootsitu` vector for that
subject and a `massdiff` variable equal to the difference between the masses.
Draw a picture (boxplot):
```{r plotdiffs,message=FALSE}
boxplot(massdiff~roostsitu,data=dataf2)
```
* Analyze these data with `lm` and convince yourself that the estimates (fixed-effect coefficients, $t$ statistics, etc.) are equivalent to those found from the previous analysis.
* It is also possible to rearrange and summarize the data to test the difference in intercepts, or to estimate the among-individual variance in intercepts (how?)
```{r lmcheat,echo=FALSE,results="hide"}
summary(lm(massdiff~roostsitu,dataf2))
```
### analyze with `lme4`
The `lmer` syntax is almost identical, except that the random effects (in parentheses) are added to the fixed effects to make a single formula rather than being expressed separately.
```{r lme4load}
library("lme4")
```
```{r lme4_lmer1}
lmer1 <- lmer(stmass~mnth*roostsitu+(1|subject),data=dataf)
```
Compare the results (you can use `coefplot2(list(lmer1,lme2))` to compare the fixed effects graphically).
Most of the other diagnostic methods should also work, with minor exceptions:
* you need to use `qqmath` instead of `qqnorm` to get the Q-Q plot
* you can use `pp <- profile(lmer1); xyplot(pp)` to look at the likelihood profile, and `confint(pp)` to get likelihood profile confidence intervals (compare them with the results of `intervals` above).
Inference
-----------------
### lme
Wald tests:
```{r coefs}
printCoefmat(summary(lme2)$tTable,digits=3,has.Pval=TRUE)
```
(you can see this in context with `summary(lme2)`; I used the more complicated command here to isolate just the coefficent matrix).
We conclude that the interactions are not doing much,
but there's definitely an effect of roost location.
This agrees with the picture from `coefplot2` (above).
However, we probably want to test the overall effect of the interactions, not the individual levels.
Here are the type II (sequential) ANOVA results:
```{r anovalme2}
anova(lme2)
```
Because of the design of this particular study, the denominator degrees of freedom (`denDF` column) is identical for all effects.
If we want to evaluate the marginal sums of squares, i.e. dropping one term at a time from the model, we usually want to change the model to use sum-to-zero contrasts:
```{r contrlme}
lme2B <- update(lme2,
contrasts=list(mnth="contr.sum",
roostsitu="contr.sum"))
```
The alternative approach is to use `options(contrasts=c("contr.sum","contr.poly"))`, then refit the model, but I prefer to use the `contrasts` argument because it is more explicit.
Type III (marginal) results:
```{r anova2lme2}
anova(lme2B,type="marginal")
```
In this case the results are identical ***because the original design is balanced and the predictors are orthogonal (not collinear)***. Not true if the data are (1) unbalanced (which is often true of ANOVA [categorical-predictor] designs, and almost always true of regression designs) or (2) the models GLMM or nonlinear.
The explicit model-comparison approach uses a likelihood ratio test rather than an $F$ test (i.e., it does not correct for the fact that the denominator sums of squares is estimated with error). In this case it hardly matters.
```{r testmodels}
lme2C <- update(lme2B,method="ML")
lme2D <- update(lme2C,. ~ . - mnth:roostsitu) ## drop interaction
anova(lme2C,lme2D)
```
If we now want to use the model-comparison approach on the reduced (no-interaction) model to test the significance of `roostsitu`, we can use `update` to drop the `roostsitu` effect, but we also have to make sure to update the `contrasts` argument so that it only refers to predictors that remain in the reduced model (otherwise, we get an error).
```{r test2}
lme2E <- update(lme2D,.~.-roostsitu,
contrasts=list(mnth="contr.sum"))
anova(lme2D,lme2E)
```
If we want to test the random effect, we would in principle remove the random effect and test with `anova`, but this is a bit problematic here: `lme` can't fit a model without any random effects.
Let's try this with `gls` instead:
```{r anovacmp}
gls1 <- gls(stmass~mnth*roostsitu,
data=dataf,method="ML")
(a1 <- anova(lme2C,gls1))
```
### lmer
For `lmer` we get the same anova table, but (1) without the `Intercept` term included (2) with sum-of-squares and mean-squares columns included (3) *without* denominator df or $p$-values (4) with slightly different precision (1 more significant figure):
```{r anovalmer1}
(a2 <- anova(lmer1))
```
If you want the degrees of freedom, you can use the `lmerTest` package
to get them. This package wraps the `lme4` fit in another layer,
so we have to go ahead and re-fit the model after it's loaded.
The default `anova` uses a [Satterthwaite approximation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation), and gets (almost)
the same result as `lme`:
```{r lmerTest,message=FALSE}
library("lmerTest")
lmer1T <- lmer(stmass~mnth*roostsitu+(1|subject),data=dataf)
anova(lmer1T)
```
Alternately you can ask for a Kenward-Roger approximation
(in this case identical to `lme`):
```{r lmerTest2,results="hide"}
anova(lmer1T,ddf="Kenward-Roger")
```
```{r detpkg}
detach("package:lmerTest")
```
For more complicated designs, or for `glmer` fits,
we can compare specific models with a parametric bootstrap
(very slow but the most reliable):
the `pbkrtest` package can do this.
Explicitly fitting a model without the month-by-location interaction:
```{r pboot}
lmer2 <- update(lmer1,.~.-mnth:roostsitu)
```
The next step is a bit slow (up to a minute):
```{r pbkrtest,cache=TRUE}
library("pbkrtest")
suppressWarnings(PBmodcomp(lmer1,lmer2))
KRmodcomp(lmer1,lmer2)
```
In this case, the Kenward-Roger correction appropriately does nothing different -- we have a classical balanced design and no correction is actually necessary. But it does give a denominator df and a $p$-value for this lmer model, which is handy ...
Computing predicted values and superimposing them on data plots.
In `lme`, `predict` has a `level` argument that specifies which levels of the random effects should be included (0=none, population level; 1=prediction at the subject level; more complex models, might have additional nested levels).
```{r predictplot,fig.width=10}
dataf$pred <- predict(lme2,level=0) ## population level
dataf$pred1 <- predict(lme2,level=1) ## individual level
ss <- split(dataf,dataf$roostsitu)
par(mfrow=c(1,4),las=1,bty="l")
for (i in 1:4) {
## observed values
## rearrange data with one column per individual
m <- t(matrix(ss[[i]]$stmass,ncol=2))
matplot(m,type="b",lty=1,pch=1,col=1,ylim=c(60,100))
## ... now do the same for both prediction types
mpred <- t(matrix(ss[[i]]$pred,ncol=2))
matlines(mpred,type="l",lty=1,col=2)
mpred1 <- t(matrix(ss[[i]]$pred1,ncol=2))
matlines(mpred,type="l",lty=2,col="gray")
}
```
While it may seem a bit mystical, it's actually quite a bit
easier (and prettier) to do this with `ggplot`:
```{r ggplot,fig.width=10}
library("ggplot2"); theme_set(theme_bw())
gplot1 <- ggplot(dataf,aes(mnth,stmass))+
geom_point()+
geom_line(aes(group=subject))+
facet_grid(.~roostsitu)
gplot1 +
geom_line(colour="gray",aes(y=pred1,group=subject)) +
geom_line(colour="red",aes(y=pred,group=subject),linetype=2)
```
There is so much shrinkage (the among-individual variance is very small) that we can barely see the individual-level predictions (gray lines) behind the population-level predictions (red lines).
Unfortunately computing confidence intervals for the predictions is a little tricky: again, there is some code on the [GLMM faq](http://glmm.wikidot.com/faq) for this (also see below under `MCMCglmm`).
For most cases you will want to set up a new data frame to do prediction rather than just using the covariates from the original data (e.g. if the data are sampled irregularly, or sparsely), and use the `newdata` argument of `predict`. The `expand.grid` function is handy in this context too.
Generalized linear mixed model: *Culcita* example
-----------------
```{r culcdata}
culcdat <- read.csv("data/culcitalogreg.csv",
colClasses=c(rep("factor",2),
"numeric",
rep("factor",6)))
## abbreviate slightly
levels(culcdat$ttt.1) <- c("none","crabs","shrimp","both")
```
Adjust contrasts for the treatment,
to compare (1) no-symbiont vs symbiont cases,
(2) crabs vs shrimp, (3) effects of a single pair/type
of symbionts vs effects of having both:
```{r setcontrasts}
contrasts(culcdat$ttt) <-
matrix(c(3,-1,-1,-1,
0,1,-1,0,
0,1,1,-2),
nrow=4,
dimnames=list(c("none","C","S","CS"),
c("symb","C.vs.S","twosymb")))
```
(this is a little bit obscure: you can see [here](http://static-content.springer.com/esm/art%3A10.1007%2Fs00442-012-2275-2/MediaObjects/442_2012_2275_MOESM1_ESM.pdf) (p. 2-3) for more details if you like)
Fit with PQL, Laplace approximation, Gauss-Hermite quadrature:
```{r glmmfits,fig.keep="last"}
library("MASS")
culcmod0 <- glmmPQL(predation~ttt,random=~1|block,family=binomial,data=culcdat,
verbose=FALSE)
culcmod1 <- glmer(predation~ttt+(1|block),family=binomial,data=culcdat)
culcmod2 <- glmer(predation~ttt+(1|block),family=binomial,data=culcdat,nAGQ=8)
coefplot2(list(glmmPQL=culcmod0,Laplace=culcmod1,
GHQ8=culcmod2),col=c(1,2,4),legend.x="right")
```
Try it with `glmmADMB` and `MCMCglmm`:
```{r glmmfits2}
library("MCMCglmm")
library("glmmADMB")
culcmod3 <- glmmadmb(predation~ttt,random=~1|block,family="binomial",
data=culcdat)
culcdat$nopred <- 1-culcdat$predation
```
Check out the results.
### analyze starling data with `MCMCglmm`
```{r mcmcglmm,message=FALSE}
mcmcglmm1 <- MCMCglmm(stmass~mnth*roostsitu,
random=~subject,data=dataf,
verbose=FALSE)
```
We use `verbose=FALSE` to turn off the progress messages, which would be ugly in this document but are generally useful.
* Compare the results (use `summary()`: printing out the a raw `MCMCglmm` model is ugly).
For MCMC approaches, it is your responsibility to check that the chain(s) are well-behaved.
Try this:
```{r mcmcplot1,fig.keep="none"}
library("coda")
xyplot(as.mcmc(mcmcglmm1$Sol),layout=c(2,4))
```
You can plot the distributions:
```{r plotdens,fig.keep="none"}
densityplot(mcmcglmm1$Sol)
```
If you use `ggplot`, you can do some nice
*violin plots*:
```{r plotviolin,fig.keep="none"}
md <- melt(as.matrix(mcmcglmm1$Sol))
ggplot(subset(md,Var2!="(Intercept)"),
aes(Var2,value))+geom_violin(fill="grey")+
geom_hline(yintercept=0,lty=2)+
coord_flip()
```
```{r mcmcsum}
summary(mcmcglmm1)
```
MCMCglmm makes it a little bit easier to get confidence intervals on the predictions. The `marginal` argument specifies which random effects we want the predictions to "marginalise" (i.e., average over). The default is to use all of the random effects in the original model (i.e. `~subject` in this case), i.e. to predict the average population-level outcome. The approach taken below to get the subject-level predictions (i.e. marginalise over nothing, since `subject` is the only random effect) is a bit of a hack: this may be easier in future versions.
In order to get all the predictions we want,
we need to refit the model using `pr=TRUE` to store the random effects (which are sometimes quite large, hence not stored by default):
```{r mcmcglmm1R}
mcmcglmm1R <- MCMCglmm(stmass~mnth*roostsitu,
random=~subject,data=dataf,
verbose=FALSE,pr=TRUE)
```
```{r mcmcpred,warning=FALSE}
mpred0 <- predict(mcmcglmm1R,interval="confidence")
colnames(mpred0) <- paste("mpred0",c("fit","lwr","upr"),sep=".")
mpred1 <- predict(mcmcglmm1R,interval="confidence",marginal=NULL)
colnames(mpred1) <- paste("mpred1",c("fit","lwr","upr"),sep=".")
dataf <- data.frame(dataf,mpred0,mpred1)
```
Testing that we get the same results for the level-1 predictions:
```{r checkpred}
ggplot(dataf,aes(x=pred1,y=mpred1.fit))+geom_point()+
geom_abline(slope=1,intercept=0)+
labs(x="lme prediction",y="MCMCglmm prediction")
```
Now we can plot confidence intervals
```{r mcconf}
g0 <- ggplot(dataf,aes(x=mnth,y=stmass))+
stat_sum(aes(size=factor(..n..)),alpha=0.5)+
facet_grid(~roostsitu)+
scale_size_discrete(name="n",range=c(2,5))
g0 + geom_line(aes(x=as.numeric(mnth),y=mpred0.fit),colour="red")+
geom_ribbon(aes(x=as.numeric(mnth),y=mpred0.fit,
ymin=mpred0.lwr,ymax=mpred0.upr),fill="red",
alpha=0.3)
```
We can plot individual predictions and confidence intervals as well, although in this case the plot is almost identical:
```{r mcconf2}
g0 + geom_line(aes(x=as.numeric(mnth),y=mpred1.fit,group=subject),
colour="blue")+
geom_ribbon(aes(x=as.numeric(mnth),y=mpred1.fit,group=subject,
ymin=mpred1.lwr,ymax=mpred1.upr),fill="blue",
alpha=0.05)
```
### History
* originally developed for NCEAS summer institute, July 2013
* updated, Alaska ASA workshop, August 2014
* updated, IISC, June 2015
You can’t perform that action at this time.