## $\S$ 3.2.3. Multiple Regression from Simple Univariate Regression

The linear model with $p \gt 1$ inputs is called *multiple linear regression model*. The least squares estimates for this model are best understood in terms of the estimates for the *univariate* ($p=1$) linear model, as we indicate in this section.

Suppose a univariate model with no intercept,

\begin{equation}
Y=X\beta+\epsilon.
\end{equation}

The least squares estimtae and residuals are

\begin{align}
\hat\beta &= \frac{\sum_1^N x_i y_i}{\sum_1^N x_i^2} = \frac{\langle\mathbf{x},\mathbf{y}\rangle}{\langle\mathbf{x},\mathbf{x}\rangle}, \\
r_i &= y_i - x_i\hat\beta, \\
\mathbf{r} &= \mathbf{y} - \mathbf{x}\hat\beta,
\end{align}

where $\mathbf{y}=(y_1,\cdots,y_N)^T$, $\mathbf{x}=(x_1,\cdots,x_N)^T$ and $\langle\cdot,\cdot\rangle$ denotes the dot product notation.

This simple univariate regression provides the building block for multiple linear regression.

### Building blocks for multiple linear regression

Suppose next that the columns of the data matrix $\mathbf{X} = \left[\mathbf{x}_1,\cdots,\mathbf{x}_p\right]$ are orthogonal, i.e., 

\begin{equation}
\langle \mathbf{x}_j,\mathbf{x}_k\rangle = 0\text{ for all }j\neq k.
\end{equation}

Then it is easy to check that the multiple least squares estimates $\hat\beta_j$ are equal to the univariate estimates $\frac{\langle\mathbf{x}_j,\mathbf{y}\rangle}{\langle\mathbf{x}_j,\mathbf{x}_j\rangle}$. In other words, when the inputs are orthogonal, they have no effect on each other's parameter estimates in the model. In order to carry this idea further, we need to orthogonalize the observational data, in which the orthogonal inputs almost never occur.

Suppose next that we have an intercept and a single input $\mathbf{x}$. Then the least squares coefficient of $\mathbf{x}$ has the form

\begin{equation}
\hat\beta_1 = \frac{\langle\mathbf{x}-\bar{x}\mathbf{1},\mathbf{y}\rangle}{\langle\mathbf{x}-\bar{x}\mathbf{1},\mathbf{x}-\bar{x}\mathbf{1}\rangle},
\end{equation}

where $\bar{x} = \sum x_i /N$ and $\mathbf{1} = \mathbf{x}_0$. And also note that

\begin{equation}
\bar{x}\mathbf{1} = \frac{\langle\mathbf{1},\mathbf{x}\rangle}{\langle\mathbf{1},\mathbf{1}\rangle}\mathbf{1},
\end{equation}

which means the fitted value in the case we regress $\mathbf{x}$ on $\mathbf{x}_0=\mathbf{1}$. Therefore we can view $\hat\beta_1$ as the result of two application of the simple regression with the following steps:
1. Regress $\mathbf{x}$ on $\mathbf{1}$ to produce the residual $\mathbf{z}=\mathbf{x}-\bar{x}\mathbf{1}$;
2. regress $\mathbf{y}$ on the residual $\mathbf{z}$ to give the coefficient $\hat\beta_1$.

> In this procedure, "regreess $\mathbf{b}$ on $\mathbf{a}$" means a simple univariate regression of $\mathbf{b}$ on $\mathbf{a}$ with no intercept, producing coefficient $\hat\gamma=\langle\mathbf{a},\mathbf{b}\rangle/\langle\mathbf{a},\mathbf{a}\rangle$ and residual vector $\mathbf{b}-\hat\gamma\mathbf{a}$. We say that $\mathbf{b}$ is adjusted for $\mathbf{a}$, or is "orthogonalized" w.r.t. $\mathbf{a}$.

In other words,
1. orthogonalize $\mathbf{x}$ w.r.t. $\mathbf{x}_0=\mathbf{1}$;
2. just a simple univariate regression, using the orthogonal predictors $\mathbf{1}$ and $\mathbf{z}$.

The orthogonalization does not change the subspace spanned by $\mathbf{x}_0$ and $\mathbf{x}_1$, it simply produces an orthogonal basis for representing it. See FIGURE 3.4 in the textbook for geometric interpretation.

This recipe gerenalizes to the case of $p$ inputs.

#### Algorithm 3.1. Regression by successive orthogonalization
1. Initialize $\mathbf{z}_0=\mathbf{x}_0=\mathbf{1}$.
2. For $j = 1, 2, \cdots, p$,  
  regress $\mathbf{x}_j$ on $\mathbf{z}_0,\mathbf{z}_1,\cdots,\mathbf{z}_{j-1}$ to produce
  * coefficients $\hat\gamma_{lj} = \langle\mathbf{z}_l,\mathbf{x}_j\rangle/\langle\mathbf{z}_l,\mathbf{z}_l\rangle$, for $l=0,\cdots, j-1$ and
  * residual vector $\mathbf{z}_j=\mathbf{x}_j - \sum_{k=0}^{j-1}\hat\gamma_{kj}\mathbf{z}_k$.
3. Regress $\mathbf{y}$ on the residual $\mathbf{z}_p$ to give the estimate $\hat\beta_p$,
\begin{equation}
\hat\beta_p = \frac{\langle\mathbf{z}_p,\mathbf{y}\rangle}{\langle\mathbf{z}_p,\mathbf{z}_p\rangle}.
\end{equation}

In step 2, $\mathbf{x}_j = \mathbf{z}_j + \sum\hat\gamma_{kj}\mathbf{z}_k$, which is a linear combination of the $\mathbf{z}_k$ for $k\le j$. Since the $\mathbf{z}_j$ are all orthogonal, they form a basis for the $\text{col}(\mathbf{X})$, and hence the least squares projection onto this subspace is $\hat{\mathbf{y}}$. Since $\mathbf{z}_p$ alone involves $\mathbf{x}_p$ (with coefficient 1), we see that the coefficient $\hat\beta_p$ is indeed the multiple regression coefficient of $\mathbf{y} on \mathbf{x}_p$. This key result exposes the effect of correlated inputs in mutiple regression.

> The multiple regression coefficient $\hat\beta_j$ represents the additional contribution of $\mathbf{x}_j$ on $\mathbf{y}$, after $\mathbf{x}_j$ has been adjusted for $\mathbf{x}_0,\mathbf{x}_1,\cdots,\mathbf{x}_{j-1},\mathbf{x}_{j+1},\cdots,\mathbf{x}_p$.

We obtain an alternative formula for the variance estimates

\begin{equation}
\text{Var}(\hat\beta_p) = \text{Var}\left(\frac{\langle\mathbf{z}_p,\mathbf{y}\rangle}{\langle\mathbf{z}_p,\mathbf{z}_p\rangle}\right) = \frac{\sigma^2}{\|\mathbf{z}_p\|^2},
\end{equation}

implying that the precision with which we can estimate with $\hat\beta_p$ depends on the length of the residual vector $\mathbf{z}_p$; this represents how much of $\mathbf{x}_p$ is unexplained by the other $\mathbf{x}_k$'s.

### Gram-Schmidt procedure and QR decomposition

Algorithm 3.1 is known as the *Gram-Schmidt* procedure for multiple regression, and is also a useful numerical strategy for computing the estimates. We can represent step 2 of Algorithm 3.1 in matrix form:

\begin{equation}
\mathbf{X} = \mathbf{Z\Gamma},
\end{equation}

where
* $\mathbf{Z}$ has as columns the $\mathbf{z}_j$ (in order)
* $\mathbf{\Gamma}$ is the upper triangular matrix with entries $\hat\gamma_{kj}$.

Introducing the diagonal matrix $\mathbf{D}$ with $D_{jj}=\|\mathbf{z}_j\|$, we get

\begin{align}
\mathbf{X} &= \mathbf{ZD}^{-1}\mathbf{D\Gamma} \\
&= \mathbf{QR},
\end{align}

the so-called *QR decomposition* of $\mathbf{X}$. Here
* $\mathbf{Q}$ is an $N\times(p+1)$ orthogonal matrix s.t. $\mathbf{Q}^T\mathbf{Q}=\mathbf{I}$,
* $\mathbf{R}$ is a $(p+1)\times(p+1)$ upper triangular matrix.

The QR decomposition represents a convenient orthogonal basis for the $\text{col}(\mathbf{X})$. It is easy to see, for example, that the least squares solution is given by

\begin{align}
\hat\beta &= \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}, \\
\hat{\mathbf{y}} &= \mathbf{QQ}^T\mathbf{y}.
\end{align}

Note that the triangular matrix $\mathbf{R}$ makes it easy to solve.