# Quasi-Newton Methods

These methods are classified by their unique ability to require only the gradient of the objective function at each iteration.
By measuring the change in gradients between iterations, it can construct a model of the objective function, $f$, that produces superlinear convergence.

We begin with the derivation of a Quasi-Newton Broyden class approach by constructing a quadratic model of $f$ at an iterate $x_k$:  

$$ m_k(p) \equiv f_k + \nabla f_K^T p + \frac12 p^T B_K p, $$

where $B_k$ is a square SPD matrix that will be updated at every iteration.

Note, $m_k(\vec{0}) = f(x_k)$ and $\nabla m_k(\vec{0}) = \nabla f(x_k)$.
Since $B_k$ is an SPD matrix, we have an anylical solution for $\min_{p_k}\{ m_k(p_k)\}$ as $p_k = -B_k^{-1}\nabla f_k$. 
Using this as our search direction, the next iterate is $x_{x+1} = x_k + \alpha_k x_k$ such that the step length $\alpha_k$ satisfies the Wolfe conditions.
Then how do we update $B_k$ to determine $B_{k+1}$? Well, we use all the information known, i.e. $B_k$, $\nabla f$ $\alpha_k$, $p_k$, $x_k$, and $x_{k+1}$, and impose some desirable conditions on $B_{k+1}$.
One reasonable condition is that $\nabla m_{k+1}(p)$ should match $\nabla f$ at the latest two iteratioins.
We have shown that $\nabla m_{k+1}(0) = \nabla f_{k+1}$, so this doesn't provide us with any machinery.
However, the second bit of our reasonable condition gives

$$ \nabla m_{k+1} (-\alpha p_k) = \nabla f_{k+1} - \alpha_k B_{k+1} p_k = \nabla f_k, $$

and solving for $B_{k+1}$ gives  

$$ B_{k+1}\alpha_k p_k = \nabla f_{k+1} - \nabla f_k \iff  B_{k+1}(x_{k+1} - x_k) = \nabla f_{k+1} - \nabla f_k.$$  

The letting $s_k = x_{k+1} - x_k$ and $y_k = \nabla f_{k+1} - \nabla f_k$ we can simplify our formula as  

$$B_{k+1}s_k = y_k,$$  

which is refereed to as _secant equation_.

The _secant equation_ asserts that our new $B_{k+1}$ must map the given displacement, $s_k$, into the change of gradients $y_k$.
By premultiplying $s_k^T$ on both sides of the _secant equation_, we obtain the fact that $s_k^T y_k > 0$ if we hope to preserve the positive definite requirement of $B_{k+1}$. (When $f$ is stronly convex, $s_k^T y_k > 0$ does note hold for any displacement $s_k$).
However, if $f$ is non-convex then the secant equation will hold so long as $\alpha_k$ was choosen to satisfy the Wolfe conditions.
We refer to $s_k^T y_k > 0$, as the _curvature condition_ of our update.


### Solutions of Secant Equation

When the curvature condition is meet, there will always be an infinite number of solutions to the _secant equation_ becuase there are $\frac{n(n+1)}2$ degrees of freedom in an SPD that exceed the $n^2$ conditions imposed by the a symmetric matrix meeting the secant equation and the matrix must have positive principal minors (to meet the positive definite requirement of our sollution). Thus, a solution satisfying the _secant equation_ is not unique when it meets the curvature condition and we must impose additional criterion.

A resonable criterion to introduce is that of all the $B_{k+1}$ satisfying the secant equation, we desire the solution a $B_{k+1}$ that is closest to the matrix $B_k$.
This can easily be written as a standard programming problem of $\min || B-B_k ||$ subject to $B=B^T Bs_k = y_k$. (note, this problem only makes sense to phrase if there is a solution set, i.e. $s_k^T y_k > 0$)


Differnt norms can be used in the calculation of $||B - B_k||$ but the most common is to use the weighted Frobenius norm  

$$||A||_W \cong ||W^{1/2}AW^{1/2}||_F,$$  

where $|| C ||^2_F = \sum_{i=1}^n\sum_{j=1}^n c_{ij}^2$.
Then the Weight matrix $W$ is chosen as _any_ matrix satisfying the relation $W y_k = s_k$.
Most often, we will use an _average Hessian_ weight defined by $W = \bar{G}^{-1}_k$ where  

$$ \bar{G}^{-1}_k = \big[ \int_0^1 \nabla^2 f(x_k + \tau \alpha_k p_k) d\tau \big]^{-1}.$$  

Then we have $y_k = \bar{G_k}\alpha_k p_k = \bar{G_k} s_k$ and it can be shown that this chooce gives rise to a scale-invariant optimization method.

###### DFP Method:
Using the weighted Frobenius norm in the minimzation problem yeilds a unique solution of  

$$ B_{k+1} = (I - \rho_k y_k s_k^T) B_k (I - \rho_k s_k y_k^T) + \rho_k y_k y_k^T, $$

with $\rho_k = \frac1{y_k^Ts_k}$.
This was a result of Davidon in $1959$ and subsequently studied and implemented by Fletcher and Powell. 
If we let $H_k = B_K^{-1}$, we can derive a more economical version of DFP, where the update to $H_k$ is done via matrix-vector multiplication:

$$ H_{k+1} = H_k - \frac{H_k y_k y_k^T H_k}{y_k^T H_K y_k} + \frac{s_k s_k^T}{y_k^T s_k}. $$

See disscussion below, which disscuss our ability to use the inverse of $B_k$ in our Quasi-Newton method.


###### BFGS Method:
A more powerful update formula is generated when phrasing the _secant equation_ in an equivalent form  

$$ H_{k+1} y_k = s_k $$

The condition of closeness is now phrased as the program  $\min || H-H_k ||$ subject to $H=H^T Hy_k = s_k$.
Then using a weighted Frobenius norm and selecting $W$ such that $W s_k = y_k$ the unique solution to the program is 

$$ H_{k+1} = (I - \rho_ks_ky_k^T) H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T, $$

with $\rho_k$ defined the same as DFP. And similiar to DFP, we can write the BFGS in terms of the Hessian approximation update as  

$$ B_{k+1} = B_k - \frac{B_k s_k s_k^T B_K}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}. $$  

This version of BFGS is not as efficient because it requires the solution of $p_k$ in the system $B_k p_k = -\nabla f_k$, increasing the computation of one step to $O(n^3)$ from $O(n^2)$.


### Summary

Note, our discussion of BFGS and DFS did not mention the selection of $H_0$ or $B_0$.
There is no magic formula, but emprical results show that it is best to choose $\beta I$ (scaled identity) or an SPD that approximates the Hessian at $x_0$ by finite differences. Furthermore, both of the formulas do not require that the updates produce a SPD at the next iteration. It can be shown that starting with an SPD will result in an SPD. (pg. 141) Furthermore, both BFGS and DFP are a result of a weighted Frobenius norm and there has been an extensive search for methods using other norms; however, none yeild a more desireable result. Lastly, both methods are members of the Broyden Class of Quasi-Newton updates, which is a generalized classification set of updating formula where DPF and BFGS belong. See pg. 142 for implementation notes.

Note, the updates of DFP and BFGS differ from their predecessor by a rank-2 matrix. There is a popular method called SR1 that produces an update that differs by a rank-1 matrix, while preserving symmetry (but no gaurentee of positive definiteness). See pg. 144 for an extensive disscussion. There are many advantages of SR1, which is also a member of the Broyden class See pg. 150.

Algorithm 6.2 on page 146 shows SR1 Trust-Region Method.


