In line search methods, each iteration is given by xk + 1 = xk + αkpk, where pk is the search direction and αk is the step length.
The search direction is often of the form pk = − Bk − 1∇fk where Bk is a symmetric and non-singular matrix. The form of pk is dependent on algorithm choice.
The ideal step length would be minα > 0f(xk + αpk) but this is generally too expensive to calculate. Instead an inexact line search condition such as the Wolfe Conditions can be used:
With 0 < c1 < c2 < 1. Here, the first condition ensures that αk gives a sufficient decrease in f, whilst the second condition rules out unacceptably short steps. [Nocedal]
Simple method where search direction pk is set to be − ∇fk, i.e. the direction along which f decreases most rapidly.
- Advantages:
- Low storage requirements
- Easy to compute
- Disadvantages:
- Slow convergence for nonlinear problems
Conjugate Gradient methods have a faster convergence rate than Steepest Descent but avoid the high computational cost of methods where the inverse Hessian is calculated.
Given an iterate x0, evaluate f0 = f(x0), ∇f0 = ∇f(x0).
Set p0 ← − ∇f0, k ← 0
Then while ∇fk ≠ 0:
Carry out a line search to compute the next iterate, then evaluate ∇fk + 1 and use this to determine the subsequent conjugate direction pk + 1 = − ∇f(xk + 1) + βkpk
Different variations of the Conjugate Gradient algorithm use different formulas for βk, for example:
Fletcher-Reeves:
- Advantages:
- Considered to be one of the best general purpose methods.
- Faster convergence rate compared to Steepest Descent and only requires evaluation of objective function and it's gradient - no matrix operations.
- Disadvantages:
- For Fletcher-Reeves method it can be shown that if the method generates a bad direction and step, then the next direction and step are also likely to be bad. However, this is not the case with the Polak Ribiere method.
- Generally, the Polak Ribiere method is more efficient that the Fletcher-Reeves method but it has the disadvantage is requiring one more vector of storage.
Most popular quasi-Newton method, which uses an approximate Hessian rather than the true Hessian which is used in a Newton line search method.
Starting with an initial Hessian approximation H0 and starting point x0:
While ∥∇fk∥ > ϵ:
Compute the search direction pk = − Hk∇fk
Then find next iterate xk + 1 by performing a line search.
Next, define sk = xk + 1 − xk and yk = ∇fk + 1 − ∇fk, then compute
Hk + 1 = (I − ρkskykT)Hk(I − ρkyksKT) + ρkskskT
with
- Advantages:
- Superlinear rate of convergence
- Has self-correcting properties - if there is a bad estimate for Hk, then it will tend to correct itself within a few iterations.
- No need to compute the Jacobian or Hessian.
- Disadvantages:
- Newton's method has quadratic convergence but this is lost with BFGS.
Modified Newton's method with line search. Instead of solving standard Newton equations
∇2f(xk)p = − ∇f(xk),
solve the system
JkTJkpkGN = − JkTrk
(where Jk is the Jacobian) to obtain the search direction pkGN. The next iterate is then set as xk + 1 = xk + pkGN.
Here, the approximation of the Hessian ∇2fk ≈ JkTJk has been made, which helps to save on computation time as second derivatives are not calculated.
- Advantages:
- Calculation of second derivatives is not required.
- If residuals or their second order partial derivatives are small, then JkTJk is a close approximation to ∇2fk and convergence of Gauss-Newton is fast.
- The search direction pJGN is always a descent direction as long as Jk has full rank and the gradient ∇fk is nonzero.
- Disadvantages:
- Without a good initial guess, or if the matrix JkTJk is ill-conditioned, the Gauss Newton Algorithm is very slow to converge to a solution.
- If relative residuals are large, then large amounts of information will be lost.
- Jk must be full rank.
- Floater
Michael S. Floater (2018), Lecture 13: Non-linear least squares and the Gauss-Newton method, University of Oslo
- Nocedal
Jorge Nocedal, Stephen J. Wright (2006), Numerical Optimization
- Poczos
Barnabas Poczos, Ryan Tibshirani (2012), Lecture 10: Optimization, School of Computer Science, Carnegie Mellon University