#  The Conjugate Gradient Method

In this discussion, we will explore
* The linear conjugate gradient method

## Line Search Review

Recall that every line search algorithm beginning at $\mathbf{x}_k$ chooses some direction $\mathbf{p}_k$ in which to step and then chooses some positive scalar $\alpha_k$ which determines the step length. That is, set
$$ \mathbf{x}_{k+1}=\mathbf{x}_k+\alpha_k\mathbf{p}_k $$

and repeat until convergence. We have discussed many algorithms of this type so far in this course, each with a different choice of step direction, $\mathbf{p}_k$. For each choice, we have also talked about various ways of determining or at least approximating the optimal value of $\alpha_k$ during each iteration, but the choice of step direction is really what defines the method; any algorithm for choosing a step size can be used in conjunction with any choice of step direction.

Among the most staightforward choices of step direction is that of **steepest descent**, where we choose $\mathbf{p}_k=-\nabla f_k$. We showed that this method converges at least linearly for most "nice" functions. We also explored the properties of **Newton's method**, where $\mathbf{p}_k = -[\nabla^2f_k]^{-1}\nabla f_k$ and showed that it converges quadratically, but only if the initial guess is "close enough" to the minimizer. To improve to global convergence and also to reduce the computational complexity of Newton's method, we introduced several **Quasi-Newton methods**, which set $\mathbf{p}_k = -H_k\nabla f_k$, where $H_k$ is some approximation to the inverse Hessian in the "pure" Newton's method. Various choices of $H_k$ led to different quasi-Newton methods, but all of these discussed are  still under the larger umbrella of line search methods.

## Conjugate Gradient Method

Today we describe yet another line search method, the **conjugate gradient method**, which corresponds to yet another choice of step direction. Before defining the actual iteration, let us first recall from lecture that two vectors $\mathbf{u}$ and $\mathbf{v}$ are said to be $Q$-conjugate if $\mathbf{u}^TQ\mathbf{v}=0$, a kind of modified orthogonality condition. Indeed if $Q$ is SPD, we can take its [square root](https://en.wikipedia.org/wiki/Square_root_of_a_matrix), so we can think of $Q$-conjugacy as orthogonality in the space transformed by $Q^{1/2}$.

It was shown that each new step of the Quasi-Newton methods was $Q$-conjugate to all previous steps, which led to the proof that SR1 converges in at most $n$ steps if the objective function is quadratic in $n$ variables, since it is not possible to "cram" $n+1$ orthogonal vectors into $\mathbf{R}^n$. The conjugate gradient method seeks to mimic this property in the following way:

1. Set $\mathbf{p}_0 = -\nabla f_0$, as in steepest descent.
2. Determine $\alpha_k$ and update $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k\mathbf{p}_k$.
3. Set $\mathbf{p}_{k+1} = -\nabla f_{k+1} + \beta_k\mathbf{p}_k$, where $\beta_k$ is chosen so that $\mathbf{p}_{k+1}$ and $\mathbf{p}_k$ are conjugate in some sense.
4. Repeat 2-3 until convergence.

### Linear conjugate gradient method (Quadratic functions)

The conjugate gradient was first introduced as an iterative method to solve *linear* equations of the form $Q\mathbf{x}=\mathbf{b}$ where $Q$ is SPD and the dimensionality of the problem is so large that direct methods like Gaussian elimination are time- or even storage-prohibitive. As we have seen before, solving $Q\mathbf{x}=\mathbf{b}$ is equivalent to minimizing the quadratic function $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TQ\mathbf{x} - \mathbf{b}^T\mathbf{x}$ since $\nabla f(\mathbf{x}) = Q\mathbf{x}-\mathbf{b}$ and thus if $\nabla f(\mathbf{x})=\mathbf{0}$ then $Q\mathbf{x}=\mathbf{b}$.

Similar to a previous calculation for quadratic functions (See Discussion 5), we can easily determine the optimal value of $\alpha_k$ for a given step by taking a derivative of $\phi(\alpha)=f(\mathbf{x}_k +\alpha\mathbf{p}_k)$ and setting it equal to zero, giving

$$ \begin{align*}
    0 &= \mathbf{p}_k^T\big(Q(\mathbf{x}_k+\alpha\mathbf{p}_k)-\mathbf{b}\big) \\
    &= \mathbf{p}_k^T(Q\mathbf{x}_k-\mathbf{b}) + \alpha\mathbf{p}_k^TQ\mathbf{p}_k \\
    &= \mathbf{p}_k^T\nabla f_k + \alpha\mathbf{p}_k^TQ\mathbf{p}_k \\
    \implies &\alpha_k = -\frac{\mathbf{p}_k^T\nabla f_k}{\mathbf{p}_k^TQ\mathbf{p}_k}
\end{align*}$$

With this step size, we now show that if someone sets in our lap a magical set $\{\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_{n-1}\}$ which happen to be a $Q$-conjugate (and thus linearly independent) family, we have convergence in at most $n$ steps, where $\mathbf{x}\in\mathbb{R}^n$. First, note that if the set of direction vectors are linearly independent, they form a basis of $\mathbb{R}^n$, and thus we can write any vector as a unique linear combination of them. In particular, we can write

$$ \mathbf{x}^*-\mathbf{x}_0 = \sigma_0\mathbf{p}_0 + \sigma_1\mathbf{p}_1  + \cdots + \sigma_{n-1}\mathbf{p}_{n-1}$$

Multiplying the above on the left by $\mathbf{p}_k^TQ$ and using conjugacy, we have

$$ \mathbf{p}_k^TQ(\mathbf{x}^*-\mathbf{x}_0) = \sigma_0\mathbf{p}_k^TQ\mathbf{p}_0 + \sigma_1\mathbf{p}_kQ\mathbf{p}_1  + \cdots + \sigma_{n-1}\mathbf{p}_kQ\mathbf{p}_{n-1} = \sigma_k\mathbf{p}_k^TQ\mathbf{p}_k \implies \sigma_k = \frac{\mathbf{p}_k^TQ(\mathbf{x}^*-\mathbf{x}_0)}{\mathbf{p}_k^TQ\mathbf{p}_k}$$

We now note that
$$ \mathbf{x}_k = \mathbf{x}_0 + \alpha_0\mathbf{p}_0 + \alpha_1\mathbf{p}_1 + \cdots + \alpha_{k-1}\mathbf{p}_{k-1}$$
which by the same premultiplying trick leads to
$$ \mathbf{p}_k^TQ(\mathbf{x}_k-\mathbf{x}_0) = 0$$
Then the numerator of $\sigma_k$ above is equal to

$$ \mathbf{p}_k^TQ(\mathbf{x}^* - \mathbf{x}_0) = \mathbf{p}_k^TQ(\mathbf{x}^*-\mathbf{x}_k+\mathbf{x}_k-\mathbf{x}_0) = \mathbf{p}_k^TQ(\mathbf{x}^*-\mathbf{x}_k) = \mathbf{p}_k^T(\mathbf{b}-Q\mathbf{x}_k) = -\mathbf{p}_k^T\nabla f_k $$

i.e. $\sigma_k=\alpha_k$. Thus $\mathbf{x}^* = \mathbf{x}_0 + \alpha_0\mathbf{p}_0 + \cdots + \alpha_{n-1}\mathbf{p}_{n-1}$ and we have shown convergence in at most $n$ steps.

Now the question becomes, how does one generate such a set of $Q$-conjugate step directions? This is where the conjugate gradient update comes into play: If $\mathbf{p}_{k+1} = -\nabla f_{k+1} + \beta_k\mathbf{p}_k$ then multiplying on the left by $\mathbf{p}_k^TQ$ gives

$$\mathbf{p}_k^TQ \mathbf{p}_{k+1} = \mathbf{p}_k^TQ(-\nabla f_{k+1} + \beta_k\mathbf{p}_k) $$

which, if we want to vanish, requires

$$ \beta_k = \frac{\mathbf{p}_k^TQ\nabla f_{k+1}}{\mathbf{p}_k^TQ\mathbf{p}_k} $$

We skip the details here but it is shown in lecture that even though we only assert $\mathbf{p}_k$ and $\mathbf{p}_{k+1}$ are $Q$-conjugate, this choice of $\beta_k$ happens to give us $Q$-conjugacy with *all* previous step directions, as desired! In fact an even stronger intermediate result is that $\nabla f_k^T\mathbf{p}_j=0$ for all $j<k$.

Thus, a self-contained way of implementing the linear conjugate gradient method is as follows:

1. Set $\mathbf{p}_0 = -\nabla f_0$, as in steepest descent.
2. Set $\alpha_k = -\dfrac{\mathbf{p}_k^T\nabla f_k}{\mathbf{p}_k^TQ\mathbf{p}_k}$ and update $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k\mathbf{p}_k$.
3. Set $\beta_k =\dfrac{\mathbf{p}_k^TQ\nabla f_{k+1}}{\mathbf{p}_k^TQ\mathbf{p}_k}$ and update $\mathbf{p}_{k+1} = -\nabla f_{k+1} + \beta_k\mathbf{p}_k$.
4. Repeat 2-3 until convergence.

#### Efficiency improvements

While the above seems bulletproof in terms of analysis, it does require quite a few matrix multiplications and two different evaluations of the gradient during each iteration, which can be costly if the dimensionality of the problem is large. We can actually make some clever adjustments to this calculation based on niceties of quadratic functions to improve the computational efficiency of the algorithm.

Note first that
$$ \nabla f_k = Q\mathbf{x}_k-\mathbf{b}\implies \mathbf{p}_k = -\nabla f_k + \beta_{k-1}\mathbf{p}_{k-1}$$

and thus the numerator of $\alpha_k$ simplifies to

$$ \mathbf{p}_k^T\nabla f_k = (-\nabla f_k + \beta_{k-1}\mathbf{p}_{k-1})^T\nabla f_k = -\nabla f_k^T\nabla f_k +\beta_{k-1}\mathbf{p}_{k-1}^T\nabla f_k = -\nabla f_k^T\nabla f_k $$

where the second term vanishes due to the orthognality of $\nabla f_k$ and $\mathbf{p}_j$ for all $j<k$.

Furthermore, we see that
$$ \nabla f_{k+1} - \nabla f_k = Q(\mathbf{x}_{k+1}-\mathbf{x}_k) = \alpha_kQ\mathbf{p}_k \implies \nabla f_{k+1} = \nabla f_k + \alpha_kQ\mathbf{p}_k$$

gives us a recursive update of the gradient, and a similar argument as above leads to the simplification 

$$ \beta_k = \frac{\nabla f_{k+1}^T\nabla f_{k+1}}{\nabla f_k^T\nabla f_k}$$

With each of the above, then, we state a final, **efficient version of the linear conjugate gradient method**:

1. Set $\mathbf{r}_0= Q\mathbf{x}_0-\mathbf{b}$ (often called the [residual](https://en.wikipedia.org/wiki/Residual_(numerical_analysis))), $\mathbf{p}_0 = -\mathbf{r}_0$.
2. Calculate and store $\mathbf{y}_k=Q\mathbf{p}_k$.
3. Set $\alpha_k = \dfrac{\mathbf{r}_k^T\mathbf{r}_k}{\mathbf{p}_k^T\mathbf{y}_k}$ and update $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k\mathbf{p}_k$.
4. Update $\mathbf{r}_{k+1} = \mathbf{r}_k + \alpha_k\mathbf{y}_k$.
5. Set $\beta_k =\dfrac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{r}_k^T\mathbf{r}_k}$ and update $\mathbf{p}_{k+1} = -\mathbf{r}_{k+1} + \beta_k\mathbf{p}_k$.
6. Repeat 2-5 until convergence.

Note that in the above, there is only a *single* matrix multiplication each iteration, and the gradient is never explicitly evaluated. Furthermore, no matrices other than $Q$ ever need to be stored in memory, and each of $\mathbf{x}_k$, $\mathbf{p}_k$, and $\mathbf{y}_k$ can be overwritten during each iteration. We only need to keep $\mathbf{r}_k$ for two iterations (in the calculation of $\beta_k$), so a smart implementation of conjugate gradient is also very memory efficient.

### Nonlinear conjugate gradient method

We have shown above that if the objective function happens to be a nice quadratic function with $\mathbf{x}\in\mathbb{R}^n$, we are guaranteed convergence in at most $n$ steps, similar to previously discussed quasi-Newton methods. If the objective function is not quadratic, much of this argument breaks down, but the main idea still holds. There are several variations of nonlinear conjugate gradient methods, each corresponding to a different calculation of the coefficient $\beta_k$ (and, to a lesser extent, $\alpha_k$) since we no longer have the nice constant Hessian $Q$.  We will discuss these nonlinear conjugate gradient methods next time.