## $\S$ 3.5.2. Partial Least Squares

Unlike PCR, partial least squares (PLS) uses $\mathbf{y}$ (in addition to $\mathbf{X}$) for the construction for a set of linear combinations of the inputs.

PLS is not scale invariant like PCR, so we assume that each $\mathbf{x}_j$ is standardized to have mean $0$ and variance $1$.

### Algorithm 3.3. Partial least squares.

1. Standardized each $\mathbf{x}_j$ to have mean $0$ and variance $1$.  
Set
\begin{align}
\hat{\mathbf{y}}^{(0)} &= \bar{y}\mathbf{1} \\
\mathbf{x}_j^{(0)} &= \mathbf{x}_j, \text{ for } j=1,\cdots,p.
\end{align}

2. For $m = 1,2,\cdots,p$
  * $\mathbf{z}_m = \sum_{j=1}^p \hat\rho_{mj}\mathbf{x}_j^{(m-1)}$, where $\hat\rho_{mj} = \langle \mathbf{x}_j^{(m-1)},\mathbf{y}\rangle$.
  * $\hat\theta_m = \langle\mathbf{z}_m,\mathbf{y}\rangle \big/ \langle\mathbf{z}_m,\mathbf{z}_m\rangle$.
  * $\hat{\mathbf{y}}^{(m)} = \hat{\mathbf{y}}^{(m-1)} + \hat\theta_m \mathbf{z}_m$.
  * Orthogonalize each $\mathbf{x}_j^{(m-1)}$ w.r.t. $\mathbf{z}_m$:  
  $\mathbf{x}_j^{(m)} = \mathbf{x}_j^{(m-1)} - \frac{\langle\mathbf{z}_m,\mathbf{x}_j^{(m-1)}\rangle}{\langle\mathbf{z}_m,\mathbf{y}\rangle}\mathbf{z}_m, \text{ for } j=1,2,\cdots,p$.

3. Output the sequence of fitted vectors $\left\lbrace \hat{\mathbf{y}}^{(m)}\right\rbrace_1^p$.  
Since the $\left\lbrace \mathbf{z}_l \right\rbrace_1^m$ are linear in the original $\mathbf{x}_j$, so is  
\begin{equation}
\hat{\mathbf{y}}^{(m)} = \mathbf{X}\hat\beta^{\text{pls}}(m).
\end{equation}
These linear coefficients can be recovered from the sequence of PLS transformations.

### Gist of the PLS algorithm

PLS begins by computing the weights

\begin{equation}
\hat\rho_{1j} = \langle \mathbf{x}_j,\mathbf{y} \rangle, \text{ for each } j,
\end{equation}
which are in fact the univariate regression coefficients, since $\mathbf{x}_j$ are standardized (only for the first step $m=1$).

From this we construct derived input

\begin{equation}
\mathbf{z}_1 = \sum_j \hat\rho_{1j}\mathbf{j},
\end{equation}

which is the first PLS direction. Hence in the construction of each $\mathbf{z}_m$, the inputs are weighted by the strength of their univariate effect on $\mathbf{y}$.

The outcome $\mathbf{y}$ is regressed on $\mathbf{z}_1$ giving coefficient $\hat\theta_1$, and then we orthogonalize $\mathbf{x}_1,\cdots,\mathbf{x}_p$ w.r.t. $\mathbf{z}_1$.

We continue this process, until $M\le p$ directions have been obtained. In this manner, PLS produces a sequence of derived, orthogonal inputs or directions $\mathbf{z}_1,\cdots,\mathbf{z}_M$.

* As with PCR, if $M=p$, then $\hat\beta^{\text{pls}} = \hat\beta^{\text{ls}}$.
* Using $M<p$ directions produces a reduced regression.

### Relation to the optimization problem

> PLS seeks direction that have high variance *and* have high correlation with the response, in contrast to PCR with keys only on high variance (Stone and Brooks, 1990; Frank and Friedman, 1993).

Since it uses the response $\mathbf{y}$ to construct its directions, its solution path is a nonlinear function of $\mathbf{y}$.

In particular, the $m$th principal component direction $v_m$ solves:

\begin{equation}
\max_\alpha \text{Var}(\mathbf{X}\alpha)\\
\text{subject to } \|\alpha\| = 1, \alpha^T\mathbf{S} v_l = 0 \text{ for } l = 1,\cdots, m-1,
\end{equation}

where $\mathbf{S}$ is the sample covariance matrix of the $\mathbf{x}_j$. The condition $\alpha^T\mathbf{S} v_l= 0$ ensures that $\mathbf{z}_m = \mathbf{X}\alpha$ is uncorrelated with all the previous linear combinations $\mathbf{z}_l = \mathbf{X} v_l$.

The $m$th PLS direction $\hat\rho_m$ solves:

\begin{equation}
\max_\alpha \text{Corr}^2(\mathbf{y},\mathbf{S}\alpha)\text{Var}(\mathbf{X}\alpha)\\
\text{subject to } \|\alpha\| = 1, \alpha^T\mathbf{S}\hat\rho_l = 0 \text{ for } l=1,\cdots, m-1.
\end{equation}

Further analysis reveals that the variance aaspect tends to dominate, and so PLS behaves much like ridge regression and PCR. We discuss further in the next section.

If the input matrix $\mathbf{X}$ is orthogonal, then PLS finds the least squares estimates after the first $m=1$ step, and subsequent steps have no effect since the $\hat\rho_{mj} = 0$ for $m>1$ (Exercise 3.14).

It can be also shown that the sequence of PLS coefficients for $m=1,2,\cdots,p$ represents the conjugate gradient sequence for computing the least squares solutions (Exercise 3.18).