#  Quadratic Functions, Newton's Method

 We will talk about:
* Rate of convergence for steepest descent on quadratic functions
* Newton's method

---

## Rate of Convergence

Last week, we defined the rate of convergence $p\ge1$ to be the value such that

$$ 0< \lim_{k\to\infty} \frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|}{\|\mathbf{x}_k-\mathbf{x}^*\|^p} \equiv L < \infty $$

where $\mathbf{x}^*$ is the known minimizer, i.e. the terminal point satisfying $\nabla f(\mathbf{x}^*)=\mathbf{0}$. If $p=1$ and $L=1$, we say the convergence is **sub-linear**. If $p=1$ and $L<1$, we say the convergence is **linear**. If $p>1$, we say the convergence is **super-linear**. If the limit is equal to 0 for all $p\ge1$, we say the order of convergence is $\infty$.

We also showed that given the steps $\{x_k\}$ of an iteration, we can determine $p$ empirically by

$$ p\approx \frac{\ln\left(\frac{\|\mathbf{x}_{k+1}-\mathbf{x}_k\|}{\|\mathbf{x}_k-\mathbf{x}_{k-1}\|}\right)}{\ln\left(\frac{\|\mathbf{x}_{k}-\mathbf{x}_{k-1}\|}{\|\mathbf{x}_{k-1}-\mathbf{x}_{k-2}\|}\right)} $$
as $k\to\infty$.

While this seems to be some abstract definition that can only be determined after the fact, we have shown some nice analytic results for a special class of functions called **quadratic functions**, i.e. those of the form

$$ f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TQ\mathbf{x}-\mathbf{b}^T\mathbf{x} $$

for $\mathbf{b}\in\mathbb{R}^d$, $Q$ a symmetric positive definite $d\times d$ matrix. This may seem like an obscure function, but most of the example functions we have used to this point have been functions of this type.

### Example: Quadratic function

Note that if $\mathbf{x}\in\mathbb{R}^2$, we have
$$ \begin{align*}
    \frac{1}{2}\mathbf{x}^TQ\mathbf{x}-\mathbf{b}^T\mathbf{x} &= \frac{1}{2}\begin{pmatrix} x_1 & x_2 \end{pmatrix}
    \begin{pmatrix}
        Q_{1,1} & Q_{1,2} \\
        Q_{1,2} & Q_{2,2}
    \end{pmatrix}\begin{pmatrix} x_1 \\ x_2\end{pmatrix} - \begin{pmatrix} b_1 & b_2\end{pmatrix}\begin{pmatrix} x_1 \\x_2 \end{pmatrix} \\
    & = \frac{1}{2}Q_{1,1}x_1^2 + \frac{1}{2}Q_{2,2}x_2^2 + Q_{1,2}x_1x_2 - b_1x_1 - b_2x_2
\end{align*}$$
so, using our tried and true example function

$$ g(x,y) = (x-1)^2+(2y-1)^2 = x^2+4y^2 - 2x - 4y + 2 $$

we can identify
$$ Q=\begin{pmatrix} 2 & 0 \\ 0 & 8 \end{pmatrix},\quad \mathbf{b}=\begin{pmatrix} 2 \\ 4 \end{pmatrix} $$

with the minimization of the function $f(x,y)\equiv g(x,y)-2$. Note that the subtraction of a constant will not affect the location of the minimizer, $\mathbf{x}^*=(1,1/2)$. 

### Steepest Descent on Quadratic Functions

Thanks to the nice form of quadratic functions, there are some reasonably straightforward theoretical convergence results that can be shown. Since we have the simple expression $\nabla f(\mathbf{x}) = Q\mathbf{x}-\mathbf{b}$, one step of steepest descent is given by

$$ \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k\nabla f_k = \mathbf{x}_k - \alpha_k(Q\mathbf{x}_k-\mathbf{b}) $$

Then the optimal value of $\alpha_k$ is easily determined by taking a derivative of $\phi(\alpha)=f(\mathbf{x}_k -\alpha\nabla f_k)$ and setting it equal to zero, giving

$$ \begin{align*}
    0 &= -\nabla f_k^T\big(Q(\mathbf{x}_k-\alpha\nabla f_k)-\mathbf{b}\big) \\
    &= -\nabla f_k^T(Q\mathbf{x}_k-\mathbf{b}) + \alpha\nabla f_k^TQ\nabla f_k \\
    &= -\nabla f_k^T\nabla f_k + \alpha\nabla f_k^TQ\nabla f_k \\
    \implies &\alpha_k = \frac{\|\nabla f_k\|^2}{\|\nabla f_k\|_Q^2} = \frac{(Q\mathbf{x}_k-\mathbf{b})^T(Q\mathbf{x}_k-\mathbf{b})}{(Q\mathbf{x}_k-\mathbf{b})^TQ(Q\mathbf{x}_k-\mathbf{b})}
\end{align*}$$

where $\|\mathbf{u}\|_Q^2 = \mathbf{u}^TQ\mathbf{u}$. 

Now, concerning rate of convergence, we need to find the ratio between the error in successive iterations,

$$ \frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|}{\|\mathbf{x}_{k}-\mathbf{x}^*\|} $$

Noting that $Q\mathbf{x}^*=\mathbf{b}$ by design (i.e. $\nabla f(\mathbf{x}^*)=\mathbf{0}$), we can rewrite one iteration of steepest descent as

$$ \begin{align*}
    \mathbf{x}_{k+1} &= \mathbf{x}_k - \alpha_k(Q\mathbf{x}_k-Q\mathbf{x}^*) \\
    \implies \mathbf{x}_{k+1}-\mathbf{x}^* &= \mathbf{x}_k -\mathbf{x}^*- \alpha_kQ(\mathbf{x}_k-\mathbf{x}^*) \\
            &= (I-\alpha_kQ)(\mathbf{x}_k-\mathbf{x}^*) \\
    \implies \|\mathbf{x}_{k+1}-\mathbf{x}^*\|^2 &= (\mathbf{x}_{k+1}-\mathbf{x}^*)^T(I-\alpha_kQ)(\mathbf{x}_k-\mathbf{x}^*) \\
            &=(\mathbf{x}_k-\mathbf{x}^*)^T(I-\alpha_kQ)^2(\mathbf{x}_k-\mathbf{x}^*) \\
    \implies \frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|^2}{\|\mathbf{x}_{k}-\mathbf{x}^*\|^2} &\le \lambda_{max}\big((I-\alpha_kQ)^2\big)
\end{align*}$$

where the last inequality uses the fact that $\|\mathbf{u}\|_Q\le\lambda_{max}(Q)\|\mathbf{u}\|$. 

It was shown in lecture that this bound is at a minimum when
$$ \alpha_k = \frac{2}{\lambda_{max}(Q) + \lambda_{min}(Q)} $$

which is independent of $\mathbf{x}_k$ and thus is a good choice for fixed step size steepest descent for quadratic functions. Indeed if we substitute this value of $\alpha$ in, we get the inequality
$$ \frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|^2}{\|\mathbf{x}_{k}-\mathbf{x}^*\|^2} \le \left(\frac{\lambda_{max}-\lambda_{min}}{\lambda_{max}+\lambda_{min}}\right)^2 \equiv \left(\frac{\kappa-1}{\kappa+1}\right)^2 $$

where $\kappa\equiv \lambda_{max}/\lambda_{min}$ is called the **condition number** of $Q$. Note that (the square root of) this inequality tells us steepest descent with this choice of $\alpha$ convergens **linearly at worst** ($p=1$, $L<1$), and in certain special cases, e.g. when $Q=\mu I$ so that $\lambda_{max}=\lambda_{min}\implies \kappa=1$ or when $\mathbf{x}_0-\mathbf{x}^*$ is a multiple of an eigenvector of $Q$ (homework exercise), the iteration converges in *a single step*!

Indeed though this seems like a very special type of function, the above result can be extended to any function $f$, not necessarily quadratic, so long as $H\equiv \nabla^2 f(\mathbf{x}^*)$ is symmetric positive definite at the minimizer, with the substitution $\kappa\to\kappa_H$, where $\kappa_H$ is the condition number of $H$.

### Example: Fixed step steepest descent

Perform fixed step steepest descent on the function $f(x,y)=x^2 + 2y^2 + 4x + y + 6$ starting from an initial guess of $\mathbf{x}_0=(3,-2)$ and choosing $\alpha_k$ to be the constant defined above.

**Solution:** First, note that this is a quadratic function with
$$ Q=\begin{pmatrix}
    2 & 0 \\
    0 & 4
\end{pmatrix},\quad \mathbf{b}=\begin{pmatrix} -4 \\ -1\end{pmatrix} $$

Then $\lambda_{max}(Q)=4$, $\lambda_{min}(Q)=2$, and thus $\alpha_k=2/(4+2)=1/3$ should provide adequately fast convergence.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# function and gradient
f = lambda x: x[0]**2 + 2*x[1]**2 + 4*x[0] + x[1] + 6
Df = lambda x: np.array([2*x[0]+4, 4*x[1]+1])

x = np.array([3,-2])  # initial point
path = [x]
print(f'Initial x={x}')
alpha = 1/3           # step size
tol = 1e-8            # stop when gradient is smaller than this amount
max_steps = 1000      # Maximum number of steps to run the iteration
i=0                   # iteration count
dx = Df(x)            # current gradient
while np.linalg.norm(dx)>tol and i<max_steps:
    xnew = x - alpha*dx
    path.append(xnew)
    x = xnew
    dx = Df(x)
    i += 1

path=np.array(path)
print(f'After {i} iterations, approximate minimum is {f(x)} at {x}')

**Exercise:** Investigate what happens for values of $\alpha$ slightly larger or slightly smaller than the "optimal" fixed value of $1/3$. You should see that convergence takes more iterations in each case.

In [None]:
err = np.linalg.norm(path-np.array([-2,-1/4]),axis=1) # ||x_k - x*||
print(err[-1]/err[-2])   # limiting convergence bound, should be ≤ 1/3 = (K-1)/(K+1)

---

## Newton's Method

While linear convergence is decent, super-linear convergence ($p>1)$ would be nice, and the holy grail would be convergence in a single step for *all* functions, not just special edge cases. Here, we introduce another line search method, **Newton's method**, which we show shortly will have super-linear (*quadratic*, $p=2$ in this case) convergence, better than steepest descent.

You have likely heard of [Newton's method](https://en.wikipedia.org/wiki/Newton's_method) for finding the zeros of a single-variable, nonlinear function, i.e. an $x^*$ such that $f(x^*)=0$. The iteration is given by
$$ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} $$

and can be interpreted geometrically as constructing a tangent line to $f$ at $x_k$, finding where the tangent line crosses the $x$-axis, and repeating, as shown in the image below:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Newton_method_scheme.svg/992px-Newton_method_scheme.svg.png" width=30% />

In optimization problems, we're usually searching for a point where $f'(x)=0$, not $f(x)=0$, so simply replacing $f\to f'$ yields an algorithm for finding a minimum, albeit requiring information about the second derivative $f''(x)$. We have for [Newton's method in optimization](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization),

$$ x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)} $$
or the straightforward generalization to multiple dimensions
$$ \mathbf{x}_{k+1} = \mathbf{x}_k - \left[\nabla^2 f_k\right]^{-1}\nabla f_k $$

It should be easy to see that this does indeed correspond to a line search algorith with $\mathbf{p}_k = -\left[\nabla^2 f_k\right]^{-1}$, called the **Newton direction**, and $\alpha_k=1$.

Let's now apply Newton's method to the example function $f(x,y)=x^2+2y^2+4x+y+6$ defined above:

In [None]:
# function, gradient, and now HESSIAN
f = lambda x: x[0]**2 + 2*x[1]**2 + 4*x[0] + x[1] + 6
Df = lambda x: np.array([2*x[0]+4, 4*x[1]+1])
D2f = lambda x: np.array([[2,0],[0,4]])  # technically doesn't have to be a function here, but for uniformity's sake

x = np.array([3,-2])  # initial point
path = [x]
print(f'Initial x={x}')
alpha = 1             # step size is 1 in Newton's method
tol = 1e-8            # stop when gradient is smaller than this amount
max_steps = 1000      # Maximum number of steps to run the iteration
i=0                   # iteration count
dx = Df(x)            # current gradient
while np.linalg.norm(dx)>tol and i<max_steps:
    pk = np.linalg.solve(D2f(x),dx)  # faster to solve a system than manually invert
    xnew = x - pk
    path.append(xnew)
    x = xnew
    dx = Df(x)
    i += 1

path=np.array(path)
print(f'After {i} iterations, approximate minimum is {f(x)} at {x}')

Wow! Newton's method converged in a single iteration! Indeed, this is the power of Newton's method, at least for quadratic functions. Since for any quadratic function $\nabla^2 f = Q$, a single iteration sets

$$ \mathbf{x}_1 = \mathbf{x}_0 - Q^{-1}\nabla f_0 = \mathbf{x}_0 - Q^{-1}(Q\mathbf{x}_0-\mathbf{b}) = \mathbf{x}_0 - \mathbf{x}_0 + Q^{-1}\mathbf{b} = \mathbf{x}^* $$

since $\nabla f = Q\mathbf{x}-\mathbf{b}$ and thus $Q\mathbf{x}^*=\mathbf{b}$. The result is that Newton's method **converges in one iteration** for quadratic functions!

Indeed, by a similar argument to the above for steepest descent, it can be shown that the ratio of errors goes to zero, showing that Newton's method **converges quadratically** at worst for any function with $\nabla^2 f(\mathbf{x}^*)$ SPD.

### Drawbacks of Newton's method

Although Newton's method does converge more quickly than steepest descent, there are a few drawbacks:

1. First and foremost, it requires calculating not just the gradient but also the Hessian (at least ostensibly) by hand.
2. Even if we can find a way to numerically calculate the Hessian, Newton's method also requires inverting the Hessian (or, more efficiently, solving a linear system) during each iteration, which can be costly, especially for high-dimensional functions.
3. The Hessian may even be singular far away from the minimizer (though it can be shown the Hessian is nonsingular at least in a neighborhood of $\mathbf{x}^*$), so Newton's method cannot even be defined globally; it is only **locally** convergent.
4. Even if the Hessian is nonsingular at each point on the iteration, the Newton direction may not be a descent direction in general.

We will talk next time about modifications to Newton's method which attempt to remedy all of the above, called **Quasi-Newton methods**.