Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
386 lines (298 sloc) 11.6 KB
---
title: "Linear Models"
author: "Douglas Bates"
date: "10/09/2014"
output:
ioslides_presentation:
wide: true
small: true
---
```{r preliminaries,echo=FALSE,results='hide'}
library(knitr)
library(ggplot2)
library(hexbin)
library(xtable)
opts_chunk$set(cache=TRUE)
options(width=100)
```
# Simple linear regression
## The brain-weight data
A famous data set from a 1976 _Science_ paper provides average body weight (kg) and brain weight (g) for 62 species of mammals.
```{r brains}
data("brains",package="alr4")
str(brains)
head(brains)
```
## Initial plots
```{r p1,fig.align='center'}
p <- ggplot(brains,aes(x=BodyWt,y=BrainWt)) + xlab("Average body weight (kg)") + ylab("Average brain weight (g)")
(p <- p + geom_point())
```
## On a log-log scale
```{r log-log,fig.align='center'}
(p <- p + scale_x_log10() + scale_y_log10())
```
## With a smoother line
```{r smoother,fig.align='center',echo=FALSE,warning=FALSE}
p + geom_smooth()
```
## With a regression line
```{r regr,fig.align='center',echo=FALSE}
p + geom_smooth(method="lm")
```
# Fitting a simple linear regression
## Calling `lm`
- As with other model-fitting functions in `R`, the first argument to `lm` is a formula.
- The second, optional but recommended, argument is `data` which is a data frame in which to evaluate the expressions in the formula
+ it may seem handy to omit it and use variables from the GlobalNameSpace but you lose the audit trail when you do this
```{r fm1}
(fm1 <- lm(log(BrainWt) ~ 1 + log(BodyWt), brains))
```
## The `summary` is more than the object
A peculiarity of the terminology in `R` model fitting is that `print`ing a fitted model provided minimal information but applying `summary` to it provides much more.
```{r summaryfm1}
summary(fm1)
```
## Suppressing "significance stars"
- One of the worst decisions in `R` development was adding "significance stars" to the `summary` output for many models.
- They are optional. Unfortunately, the default is to have them.
```{r signifstars,cache=FALSE}
options(show.signif.stars=FALSE)
summary(fm1)
```
## The model matrix
- the terminology used in `R` is that the formula and the data together generate a `model matrix`
- sometimes the term `design matrix` is used. `model matrix` is more accurate
```{r mm}
head(mm1 <- model.matrix(fm1))
str(mm1)
```
## The Intercept term
- Many "statistical packages" assume that an intercept will be included in a model.
- I prefer to indicate the intercept explicitly by writing the formula as `1 + BodyWt`
- To suppress the intercept you must write the formula as `0 + BodyWt`
- The intercept term generates the initial column of 1's in the model matrix
```{r ones}
all(mm1[,1] == 1)
```
## Extracting the coefficient summary
- Sometimes you just want the coefficient table from the summary, which is available as
```{r coef}
(ctbl <- coef(summary(fm1)))
```
- There is a special method for printing tables with p-values in them. It is used in `print.summary.lm` but not in the more terse form used above.
- Because the special printing method shows `< 2e-16` for small p-values, some people believe that probabilities lower than that are not evaluated. This is not the case.
## Plots of an `lm` object
- There are several "pre-packaged" plots for `lm` objects. A total of 6 are available.
```{r plot1,fig.align='center'}
plot(fm1,which=1)
```
## plot 2
```{r plot2,fig.align='center',echo=FALSE}
plot(fm1,which=2)
```
## plot 3
```{r plot3,fig.align='center',echo=FALSE}
plot(fm1,which=3)
```
## plot 4
```{r plot4,fig.align='center',echo=FALSE}
plot(fm1,which=4)
```
## plot 5
```{r plot5,fig.align='center',echo=FALSE}
plot(fm1,which=5)
```
## plot 6
```{r plot6,fig.align='center',echo=FALSE}
plot(fm1,which=6)
```
## Methods for `lm` object
Most model-fitting functions assign a "class" tag to the return value
```{r classlm}
class(fm1)
```
allowing for "generic" functions to have special methods used with such objects.
The `methods` function provides a list of methods for a given generic or for a given class.
```{r methods}
methods(class=class(fm1))
```
## Using extractor functions
- Many of the methods for `lm` objects are "extractor" functions. That it, they extract some information from the fitted model in a way that does not depend on the internal structure.
- Try to use such functions whenever possible so as to "future proof" your code.
- When writing code for your own model fitting, provide extractors when possible
```{r coeff}
coef(fm1) # estimates of the coefficients
deviance(fm1) # residual sum of squares
logLik(fm1) # log-likelihood
```
## More extractors
```{r extractors}
vcov(fm1) # estimated variance-covariance of coefficient estimators
anova(fm1) # analysis of variance table (trivial in this case)
```
## Even more extractors
```{r confint}
confint(fm1) # confidence intervals on the coefficients
df.residual(fm1) # degrees of freedom for residuals
nobs(fm1) # number of observations used to fit the model
formula(fm1) # formula for the model matrix
```
## Fitted values
```{r residuals}
fitted(fm1)
```
## Residuals (raw)
- weighted residuals are available as `residuals(fm1,type = "pearson")` or as `weighted.residuals(fm1)`
```{r rawresiduals}
unname(residuals(fm1)) # suppress printing of names
```
## Various kinds of cooked residuals
- other generics provide the "studentized" residuals, `rstudent`, and the standardized residuals
```{r studentized}
unname(rstudent(fm1))
```
# Numerical methods, including simulation
## Simulating responses from a fitted model
- For generality the `simulate` generic returns a `data.frame` of simulated responses
```{r strsimulate}
str(simulate(fm1,nsim=10L))
```
## Conversion to a matrix
```{r datamatrix}
str(yy <- data.matrix(simulate(fm1,nsim=10000L)))
```
- It is very slow to fit the simulated responses in a loop calling `lm`
- The response in a call to `lm` can be a matrix.
```{r multipleRHS}
system.time(fms <- lm(yy ~ 1 + log(BodyWt),brains))
str(cc <- coef(fms))
```
## Coefficients from simulated data
```{r plot,echo=FALSE,fig.align='center'}
ccf <- unname(as.data.frame(t(cc)))
names(ccf) <- c("Intercept","logBodyWt")
qplot(Intercept,logBodyWt,data=ccf,geom="hex")
```
## `qr` and `effects`
```{r qr}
str(QR <- qr(fm1))
```
- Objects of class `qr` represent an orthogonal-triangular or QR decomposition of a matrix
```{r effects}
str(ee <- effects(fm1))
```
## Properties of `Q` and `R`
- The orthogonal triangular decomposition of an $n\times p$ matrix $\bf X$ is
$$\bf X = QR = Q_1R_1$$
where $\bf Q$ is $n\times n$ and orthogonal (i.e. $\bf Q'Q=QQ'=I_{\rm n}$) and the $n\times p$ matrix $\bf R$ is zero below the main diagonal. In the condensed form $\bf Q_{\rm 1}$ is $n\times p$ and $\bf R_{\rm 1}$ is $p\times p$ and upper triangular.
- Multiplication of a vector, $\bf y$, by $\bf Q$ or by $\bf Q'$ preserves its length.
$$\bf\|Q'y\|^2=y'QQ'y=y'y=\|y\|^2$$
so the residual sum of squares can be written
$$\|\bf y-X\beta\|^2=\|Q'(y-X\beta)\|^2=\|Q'y-Q'QR\beta\|^2=\|c_{\rm 1}-R_{\rm 1}\beta\|^2+\|c_2\|^2$$
where $\bf c_{\rm 1}$ is the first $p$ elements of $\bf Q'y$ and $\bf c_{\rm 2}$ is the last $n-p$.
- If $\bf X$ has full column rank, which amounts to saying that the diagonal elements of $\bf R_{\rm 1}$ are non-zero, then the least squares estimates, $\widehat{\beta}=\arg\min_{\beta}\|y-X\beta\|^2$, is the solution to
$$\bf R_{\rm 1}\widehat{\beta}=c_{\rm 1}$$
## Methods for `qr` objects
- Functions for working with objects that later got the tag `qr` were introduced in S before S3 methods
- The names of these functions do not follow the usual naming conventions. They begin with `qr.`. The methods are
- `qr.Q` - return $\bf Q_{\rm 1}$ (the default) or $\bf Q$
- `qr.R` - return $\bf R_{\rm 1}$ (the default) or $\bf R$
- `qr.X` - reconstruct $\bf X$ from the decomposition
- `qr.qty` - create $\bf Q'y$ from $\bf y$ without evaluating $\bf Q$ explicitly
- `qr.qy` - as above but for $\bf Qy$
- `qr.resid` - evaluate $\bf Q_{\rm 2}Q_{\rm 2}'y$ without evaluating $\bf Q$
- `qr.fitted` - evaluate $\bf\widehat{y} = Q_{\rm 1}Q_{\rm 1}'y$
- `qr.coef` - evaluate $\widehat{\beta}$ from $\bf y$
## Examples of using `qr.*` methods
```{r qandr}
(R <- qr.R(QR))
str(Q1 <- qr.Q(QR))
zapsmall(crossprod(Q1)) # equivalent to Q₁'Q₁
```
## More methods
```{r moremethods}
str(y <- model.response(model.frame(fm1)))
all.equal(unclass(ee),qr.qty(QR,y),check.attributes=FALSE)
(rss <- as.vector(crossprod(ee[-(1:2)]))) # squared length of c₂
all.equal(deviance(fm1), rss)
```
## Even more methods
```{r evenmore}
(betahat <- backsolve(R,ee[1:2]))
c(all.equal(betahat,unname(qr.coef(QR,y))),all.equal(qr.coef(QR,y),coef(fm1)),
all.equal(qr.fitted(QR,y),fitted(fm1)),all.equal(qr.resid(QR,y),resid(fm1)))
anova(fm1)[["Sum Sq"]]
ee[1:2]^2 # "sequential" sums of squares from the effects vector
```
## Numerical linear algebra is not what you learned in an intro course
- In an intro course you use matrix inverses all the time. In numerical linear algebra you almost never evaluate an inverse.
- In an intro course the rank of a matrix is well-defined. In floating point computation it isn't.
- Numerical linear algebra is all about decompositions. The least used is the eigenvalue-eigenvector. In an intro course the only decomposition you discuss is the eigenvalue-eigenvector decomposition.
- In an intro course you evaluate eigenvalues as the roots of the characteristic polynomial. In numerical work you solve for the roots of a polynomial by determining the eigenvalues of a companion matrix.
- One of the few instances of evaluating the inverse of a matrix is for the covariance of $\widehat{\beta}$ but it is $\bf R_{\rm 1}^{\rm -1}$, not $(\bf X'X)^{\rm -1}$ that is evaluated
## Evaluating the covariance matrix
```{r vcov}
chol2inv(R)*deviance(fm1)/df.residual(fm1)
vcov(fm1)
QR$rank # the computational rank of X
```
# Linear models and categorical covariates
## "one-way" analysis of variance
```{r insectsprays}
str(InsectSprays)
```
```{r boxplot1,echo=FALSE,fig.height=2,fig.align='center'}
qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count")+coord_flip()
```
```{r boxplot2,echo=FALSE,fig.height=2,fig.align='center'}
qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count (square root scale)")+scale_y_sqrt()+coord_flip()
```
## Fitting an `aov` model
```{r fm2}
summary(fm2 <- aov(sqrt(count) ~ 1 + spray,InsectSprays))
class(fm2)
e2 <- effects(fm2)
attr(e2,"assign")
c(e2[1]^2,crossprod(e2[2:6]),crossprod(e2[-(1:6)])) # sequential sum-of-squares decomposition
```
## Using `xtable`
```{r summaryfm2xtable,results='asis',echo=FALSE}
options(xtable.type="html")
xtable(fm2)
oldclass <- class(fm2)
class(fm2) <- "lm"
xtable:::xtable(fm2)
class(fm2) <- oldclass
```
## Model matrix
```{r modelmatrixfm2}
model.frame(fm2)$spray
head(X <- model.matrix(fm2),15)
```
## Model matrix (cont'd)
```{r modelmatrix2fm2}
tail(X,20)
```
## Contrasts
- a categorical covariate with $k$ levels is converted to $k-1$ columns of "contrasts" in the model matrix.
```{r contrasts}
contrasts(model.frame(fm2)$spray)
attr(model.frame(fm2),"terms")
```
## Crossproduct form
- the crossproduct matrix, $\bf X'X$ has the form
```{r xtx1}
crossprod(model.matrix(fm2))
```
## An `aov` model is a `lm` model (class inheritance)
```{r summaryfm2}
summary.lm(fm2)
```
## Interpretation of coefficients
- The default contrasts for a factor are `contr.treatment` and for an ordered factor `contr.poly`
- In the `treatment` contrasts the indicators of the factor levels are generated and the first column dropped
```{r contrtreatment}
contr.treatment(2)
contr.treatment(3)
```