Switch branches/tags
Nothing to show
Find file Copy path
e321b04 Mar 28, 2017
3 contributors

Users who have contributed to this file

@rafalab @alexnones @mikelove
280 lines (209 sloc) 6.59 KB
title author date output layout
Linear Algebra Examples
February 18, 2015
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))


Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.

The average

To compute the sample average and variance of our data, we use these formulas $\bar{Y}=\frac{1}{N} Y_i$ and $\mbox{var}(Y)=\frac{1}{N} \sum_{i=1}^N (Y_i - \bar{Y})^2$. We can represent these with matrix multiplication. First, define this $N \times 1$ matrix made just of 1s:

$$ A=\begin{pmatrix} 1\ 1\ \vdots\ 1 \end{pmatrix} $$

This implies that:

$$ \frac{1}{N} \mathbf{A}^\top Y = \frac{1}{N} \begin{pmatrix}1&1&\dots&1\end{pmatrix} \begin{pmatrix} Y_1\ Y_2\ \vdots\ Y_N \end{pmatrix}= \frac{1}{N} \sum_{i=1}^N Y_i = \bar{Y} $$

Note that we are multiplying by the scalar $1/N$. In R, we multiply matrix using %*%:

y <- father.son$sheight

N <- length(y)
Y<- matrix(y,N,1)
A <- matrix(1,N,1)
barY=t(A)%*%Y / N


The variance

As we will see later, multiplying the transpose of a matrix with another is very common in statistics. In fact, it is so common that there is a function in R:

barY=crossprod(A,Y) / N

For the variance, we note that if:

$$ \mathbf{r}\equiv \begin{pmatrix} Y_1 - \bar{Y}\ \vdots\ Y_N - \bar{Y} \end{pmatrix}, ,, \frac{1}{N} \mathbf{r}^\top\mathbf{r} = \frac{1}{N}\sum_{i=1}^N (Y_i - \bar{Y})^2 $$

In R, if you only send one matrix into crossprod, it computes: $r^\top r$ so we can simply type:

r <- y - barY

Which is almost equivalent to:


Linear models

Now we are ready to put all this to use. Let's start with Galton's example. If we define these matrices:

$$ \mathbf{Y} = \begin{pmatrix} Y_1\ Y_2\ \vdots\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\ 1&x_2\ \vdots\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\ \varepsilon_2\ \vdots\ \varepsilon_N \end{pmatrix} $$

Then we can write the model:

$$ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N $$


$$ , \begin{pmatrix} Y_1\ Y_2\ \vdots\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\ 1&x_2\ \vdots\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\ \varepsilon_2\ \vdots\ \varepsilon_N \end{pmatrix} $$

or simply:

$$ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} $$

which is a much simpler way to write it.

The least squares equation becomes simpler as well since it is the following cross-product:

$$ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) $$

So now we are ready to determine which values of $\beta$ minimize the above, which we can do using calculus to find the minimum.

Advanced: Finding the minimum using calculus

There are a series of rules that permit us to compute partial derivative equations in matrix notation. By equating the derivative to 0 and solving for the $\beta$, we will have our solution. The only one we need here tells us that the derivative of the above equation is:

$$ 2 \mathbf{X}^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0 $$

$$ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y}

$$ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}

and we have our solution. We usually put a hat on the $\beta$ that solves this, $\hat{\beta}$ , as it is an estimate of the "real" $\beta$ that generated the data.

Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of $f(x)^2$ being $2f(x)f\prime (x)$.

Finding LSE in R

Let's see how it works in R:

X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )

Now we can see the results of this by computing the estimated $\hat{\beta}_0+\hat{\beta}_1 x$ for any value of $x$:

newx <- seq(min(x),max(x),len=100)
X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height")

This $\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}$ is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:

g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1)

Notice that we are using almost the same exact code:

X <- cbind(1,tt,tt^2)
y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y)
newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2)
fitted <- X%*%betahat

And the resulting estimates are what we expect:


The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.

The lm Function

R has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:

X <- cbind(tt,tt^2)

Note that we obtain the same values as above.


We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because $y$ is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.