# dmbates/stat692

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
386 lines (298 sloc) 11.6 KB
 --- title: "Linear Models" author: "Douglas Bates" date: "10/09/2014" output: ioslides_presentation: wide: true small: true --- ```{r preliminaries,echo=FALSE,results='hide'} library(knitr) library(ggplot2) library(hexbin) library(xtable) opts_chunk\$set(cache=TRUE) options(width=100) ``` # Simple linear regression ## The brain-weight data A famous data set from a 1976 _Science_ paper provides average body weight (kg) and brain weight (g) for 62 species of mammals. ```{r brains} data("brains",package="alr4") str(brains) head(brains) ``` ## Initial plots ```{r p1,fig.align='center'} p <- ggplot(brains,aes(x=BodyWt,y=BrainWt)) + xlab("Average body weight (kg)") + ylab("Average brain weight (g)") (p <- p + geom_point()) ``` ## On a log-log scale ```{r log-log,fig.align='center'} (p <- p + scale_x_log10() + scale_y_log10()) ``` ## With a smoother line ```{r smoother,fig.align='center',echo=FALSE,warning=FALSE} p + geom_smooth() ``` ## With a regression line ```{r regr,fig.align='center',echo=FALSE} p + geom_smooth(method="lm") ``` # Fitting a simple linear regression ## Calling `lm` - As with other model-fitting functions in `R`, the first argument to `lm` is a formula. - The second, optional but recommended, argument is `data` which is a data frame in which to evaluate the expressions in the formula + it may seem handy to omit it and use variables from the GlobalNameSpace but you lose the audit trail when you do this ```{r fm1} (fm1 <- lm(log(BrainWt) ~ 1 + log(BodyWt), brains)) ``` ## The `summary` is more than the object A peculiarity of the terminology in `R` model fitting is that `print`ing a fitted model provided minimal information but applying `summary` to it provides much more. ```{r summaryfm1} summary(fm1) ``` ## Suppressing "significance stars" - One of the worst decisions in `R` development was adding "significance stars" to the `summary` output for many models. - They are optional. Unfortunately, the default is to have them. ```{r signifstars,cache=FALSE} options(show.signif.stars=FALSE) summary(fm1) ``` ## The model matrix - the terminology used in `R` is that the formula and the data together generate a `model matrix` - sometimes the term `design matrix` is used. `model matrix` is more accurate ```{r mm} head(mm1 <- model.matrix(fm1)) str(mm1) ``` ## The Intercept term - Many "statistical packages" assume that an intercept will be included in a model. - I prefer to indicate the intercept explicitly by writing the formula as `1 + BodyWt` - To suppress the intercept you must write the formula as `0 + BodyWt` - The intercept term generates the initial column of 1's in the model matrix ```{r ones} all(mm1[,1] == 1) ``` ## Extracting the coefficient summary - Sometimes you just want the coefficient table from the summary, which is available as ```{r coef} (ctbl <- coef(summary(fm1))) ``` - There is a special method for printing tables with p-values in them. It is used in `print.summary.lm` but not in the more terse form used above. - Because the special printing method shows `< 2e-16` for small p-values, some people believe that probabilities lower than that are not evaluated. This is not the case. ## Plots of an `lm` object - There are several "pre-packaged" plots for `lm` objects. A total of 6 are available. ```{r plot1,fig.align='center'} plot(fm1,which=1) ``` ## plot 2 ```{r plot2,fig.align='center',echo=FALSE} plot(fm1,which=2) ``` ## plot 3 ```{r plot3,fig.align='center',echo=FALSE} plot(fm1,which=3) ``` ## plot 4 ```{r plot4,fig.align='center',echo=FALSE} plot(fm1,which=4) ``` ## plot 5 ```{r plot5,fig.align='center',echo=FALSE} plot(fm1,which=5) ``` ## plot 6 ```{r plot6,fig.align='center',echo=FALSE} plot(fm1,which=6) ``` ## Methods for `lm` object Most model-fitting functions assign a "class" tag to the return value ```{r classlm} class(fm1) ``` allowing for "generic" functions to have special methods used with such objects. The `methods` function provides a list of methods for a given generic or for a given class. ```{r methods} methods(class=class(fm1)) ``` ## Using extractor functions - Many of the methods for `lm` objects are "extractor" functions. That it, they extract some information from the fitted model in a way that does not depend on the internal structure. - Try to use such functions whenever possible so as to "future proof" your code. - When writing code for your own model fitting, provide extractors when possible ```{r coeff} coef(fm1) # estimates of the coefficients deviance(fm1) # residual sum of squares logLik(fm1) # log-likelihood ``` ## More extractors ```{r extractors} vcov(fm1) # estimated variance-covariance of coefficient estimators anova(fm1) # analysis of variance table (trivial in this case) ``` ## Even more extractors ```{r confint} confint(fm1) # confidence intervals on the coefficients df.residual(fm1) # degrees of freedom for residuals nobs(fm1) # number of observations used to fit the model formula(fm1) # formula for the model matrix ``` ## Fitted values ```{r residuals} fitted(fm1) ``` ## Residuals (raw) - weighted residuals are available as `residuals(fm1,type = "pearson")` or as `weighted.residuals(fm1)` ```{r rawresiduals} unname(residuals(fm1)) # suppress printing of names ``` ## Various kinds of cooked residuals - other generics provide the "studentized" residuals, `rstudent`, and the standardized residuals ```{r studentized} unname(rstudent(fm1)) ``` # Numerical methods, including simulation ## Simulating responses from a fitted model - For generality the `simulate` generic returns a `data.frame` of simulated responses ```{r strsimulate} str(simulate(fm1,nsim=10L)) ``` ## Conversion to a matrix ```{r datamatrix} str(yy <- data.matrix(simulate(fm1,nsim=10000L))) ``` - It is very slow to fit the simulated responses in a loop calling `lm` - The response in a call to `lm` can be a matrix. ```{r multipleRHS} system.time(fms <- lm(yy ~ 1 + log(BodyWt),brains)) str(cc <- coef(fms)) ``` ## Coefficients from simulated data ```{r plot,echo=FALSE,fig.align='center'} ccf <- unname(as.data.frame(t(cc))) names(ccf) <- c("Intercept","logBodyWt") qplot(Intercept,logBodyWt,data=ccf,geom="hex") ``` ## `qr` and `effects` ```{r qr} str(QR <- qr(fm1)) ``` - Objects of class `qr` represent an orthogonal-triangular or QR decomposition of a matrix ```{r effects} str(ee <- effects(fm1)) ``` ## Properties of `Q` and `R` - The orthogonal triangular decomposition of an \$n\times p\$ matrix \$\bf X\$ is \$\$\bf X = QR = Q_1R_1\$\$ where \$\bf Q\$ is \$n\times n\$ and orthogonal (i.e. \$\bf Q'Q=QQ'=I_{\rm n}\$) and the \$n\times p\$ matrix \$\bf R\$ is zero below the main diagonal. In the condensed form \$\bf Q_{\rm 1}\$ is \$n\times p\$ and \$\bf R_{\rm 1}\$ is \$p\times p\$ and upper triangular. - Multiplication of a vector, \$\bf y\$, by \$\bf Q\$ or by \$\bf Q'\$ preserves its length. \$\$\bf\|Q'y\|^2=y'QQ'y=y'y=\|y\|^2\$\$ so the residual sum of squares can be written \$\$\|\bf y-X\beta\|^2=\|Q'(y-X\beta)\|^2=\|Q'y-Q'QR\beta\|^2=\|c_{\rm 1}-R_{\rm 1}\beta\|^2+\|c_2\|^2\$\$ where \$\bf c_{\rm 1}\$ is the first \$p\$ elements of \$\bf Q'y\$ and \$\bf c_{\rm 2}\$ is the last \$n-p\$. - If \$\bf X\$ has full column rank, which amounts to saying that the diagonal elements of \$\bf R_{\rm 1}\$ are non-zero, then the least squares estimates, \$\widehat{\beta}=\arg\min_{\beta}\|y-X\beta\|^2\$, is the solution to \$\$\bf R_{\rm 1}\widehat{\beta}=c_{\rm 1}\$\$ ## Methods for `qr` objects - Functions for working with objects that later got the tag `qr` were introduced in S before S3 methods - The names of these functions do not follow the usual naming conventions. They begin with `qr.`. The methods are - `qr.Q` - return \$\bf Q_{\rm 1}\$ (the default) or \$\bf Q\$ - `qr.R` - return \$\bf R_{\rm 1}\$ (the default) or \$\bf R\$ - `qr.X` - reconstruct \$\bf X\$ from the decomposition - `qr.qty` - create \$\bf Q'y\$ from \$\bf y\$ without evaluating \$\bf Q\$ explicitly - `qr.qy` - as above but for \$\bf Qy\$ - `qr.resid` - evaluate \$\bf Q_{\rm 2}Q_{\rm 2}'y\$ without evaluating \$\bf Q\$ - `qr.fitted` - evaluate \$\bf\widehat{y} = Q_{\rm 1}Q_{\rm 1}'y\$ - `qr.coef` - evaluate \$\widehat{\beta}\$ from \$\bf y\$ ## Examples of using `qr.*` methods ```{r qandr} (R <- qr.R(QR)) str(Q1 <- qr.Q(QR)) zapsmall(crossprod(Q1)) # equivalent to Q₁'Q₁ ``` ## More methods ```{r moremethods} str(y <- model.response(model.frame(fm1))) all.equal(unclass(ee),qr.qty(QR,y),check.attributes=FALSE) (rss <- as.vector(crossprod(ee[-(1:2)]))) # squared length of c₂ all.equal(deviance(fm1), rss) ``` ## Even more methods ```{r evenmore} (betahat <- backsolve(R,ee[1:2])) c(all.equal(betahat,unname(qr.coef(QR,y))),all.equal(qr.coef(QR,y),coef(fm1)), all.equal(qr.fitted(QR,y),fitted(fm1)),all.equal(qr.resid(QR,y),resid(fm1))) anova(fm1)[["Sum Sq"]] ee[1:2]^2 # "sequential" sums of squares from the effects vector ``` ## Numerical linear algebra is not what you learned in an intro course - In an intro course you use matrix inverses all the time. In numerical linear algebra you almost never evaluate an inverse. - In an intro course the rank of a matrix is well-defined. In floating point computation it isn't. - Numerical linear algebra is all about decompositions. The least used is the eigenvalue-eigenvector. In an intro course the only decomposition you discuss is the eigenvalue-eigenvector decomposition. - In an intro course you evaluate eigenvalues as the roots of the characteristic polynomial. In numerical work you solve for the roots of a polynomial by determining the eigenvalues of a companion matrix. - One of the few instances of evaluating the inverse of a matrix is for the covariance of \$\widehat{\beta}\$ but it is \$\bf R_{\rm 1}^{\rm -1}\$, not \$(\bf X'X)^{\rm -1}\$ that is evaluated ## Evaluating the covariance matrix ```{r vcov} chol2inv(R)*deviance(fm1)/df.residual(fm1) vcov(fm1) QR\$rank # the computational rank of X ``` # Linear models and categorical covariates ## "one-way" analysis of variance ```{r insectsprays} str(InsectSprays) ``` ```{r boxplot1,echo=FALSE,fig.height=2,fig.align='center'} qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count")+coord_flip() ``` ```{r boxplot2,echo=FALSE,fig.height=2,fig.align='center'} qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count (square root scale)")+scale_y_sqrt()+coord_flip() ``` ## Fitting an `aov` model ```{r fm2} summary(fm2 <- aov(sqrt(count) ~ 1 + spray,InsectSprays)) class(fm2) e2 <- effects(fm2) attr(e2,"assign") c(e2[1]^2,crossprod(e2[2:6]),crossprod(e2[-(1:6)])) # sequential sum-of-squares decomposition ``` ## Using `xtable` ```{r summaryfm2xtable,results='asis',echo=FALSE} options(xtable.type="html") xtable(fm2) oldclass <- class(fm2) class(fm2) <- "lm" xtable:::xtable(fm2) class(fm2) <- oldclass ``` ## Model matrix ```{r modelmatrixfm2} model.frame(fm2)\$spray head(X <- model.matrix(fm2),15) ``` ## Model matrix (cont'd) ```{r modelmatrix2fm2} tail(X,20) ``` ## Contrasts - a categorical covariate with \$k\$ levels is converted to \$k-1\$ columns of "contrasts" in the model matrix. ```{r contrasts} contrasts(model.frame(fm2)\$spray) attr(model.frame(fm2),"terms") ``` ## Crossproduct form - the crossproduct matrix, \$\bf X'X\$ has the form ```{r xtx1} crossprod(model.matrix(fm2)) ``` ## An `aov` model is a `lm` model (class inheritance) ```{r summaryfm2} summary.lm(fm2) ``` ## Interpretation of coefficients - The default contrasts for a factor are `contr.treatment` and for an ordered factor `contr.poly` - In the `treatment` contrasts the indicators of the factor levels are generated and the first column dropped ```{r contrtreatment} contr.treatment(2) contr.treatment(3) ```