Line Search Methods

In line search methods, each iteration is given by x_k + 1 = x_k + α_kp_k, where p_k is the search direction and α_k is the step length.

The search direction is often of the form p_k = − B_k^− 1∇f_k where B_k is a symmetric and non-singular matrix. The form of p_k is dependent on algorithm choice.

The ideal step length would be min_α > 0f(x_k + αp_k) but this is generally too expensive to calculate. Instead an inexact line search condition such as the Wolfe Conditions can be used:

$$\begin{aligned} f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T p_k \\\ f(x_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f_k^T p_k \end{aligned}$$

With 0 < c₁ < c₂ < 1. Here, the first condition ensures that α_k gives a sufficient decrease in f, whilst the second condition rules out unacceptably short steps. [Nocedal]

Steepest Descent (`steepest_descent`)

Simple method where search direction p_k is set to be − ∇f_k, i.e. the direction along which f decreases most rapidly.

Advantages:

Low storage requirements
Easy to compute

Disadvantages:

Slow convergence for nonlinear problems

[Nocedal]

Conjugate Gradient (`conjugate_gradient`)

Conjugate Gradient methods have a faster convergence rate than Steepest Descent but avoid the high computational cost of methods where the inverse Hessian is calculated.

Given an iterate x₀, evaluate f₀ = f(x₀), ∇f₀ = ∇f(x₀).

Set p₀ ← − ∇f₀, k ← 0

Then while ∇f_k ≠ 0:

Carry out a line search to compute the next iterate, then evaluate ∇f_k + 1 and use this to determine the subsequent conjugate direction p_k + 1 = − ∇f(x_k + 1) + β_kp_k

Different variations of the Conjugate Gradient algorithm use different formulas for β_k, for example:

Fletcher-Reeves: $\beta_{k+1} = \frac{f_{k+1}^T \nabla f_{k+1}}{\nabla f_k^T \nabla f_k}$ Polak-Ribiere: $\beta_{k+1} = \frac{ \nabla f_{k+1}^T ( \nabla f_{k+1} - \nabla f_k)}{\|\nabla f_k\|^2}$

[Nocedal] [Poczos]

Advantages:

Considered to be one of the best general purpose methods.
Faster convergence rate compared to Steepest Descent and only requires evaluation of objective function and it's gradient - no matrix operations.

Disadvantages:

For Fletcher-Reeves method it can be shown that if the method generates a bad direction and step, then the next direction and step are also likely to be bad. However, this is not the case with the Polak Ribiere method.
Generally, the Polak Ribiere method is more efficient that the Fletcher-Reeves method but it has the disadvantage is requiring one more vector of storage.

[Nocedal]

BFGS (`bfgs`)

Most popular quasi-Newton method, which uses an approximate Hessian rather than the true Hessian which is used in a Newton line search method.

Starting with an initial Hessian approximation H₀ and starting point x₀:

While ∥∇f_k∥ > ϵ:

Compute the search direction p_k = − H_k∇f_k

Then find next iterate x_k + 1 by performing a line search.

Next, define s_k = x_k + 1 − x_k and y_k = ∇f_k + 1 − ∇f_k, then compute

H_k + 1 = (I − ρ_ks_ky_k^T)H_k(I − ρ_ky_ks_K^T) + ρ_ks_ks_k^T

with $\rho_k = \frac{1}{y_k^T s_k}$

Advantages:

Superlinear rate of convergence
Has self-correcting properties - if there is a bad estimate for H_k, then it will tend to correct itself within a few iterations.
No need to compute the Jacobian or Hessian.

Disadvantages:

Newton's method has quadratic convergence but this is lost with BFGS.

[Nocedal]

Gauss Newton (`gauss_newton`)

Modified Newton's method with line search. Instead of solving standard Newton equations

∇²f(x_k)p = − ∇f(x_k),

solve the system

J_k^TJ_kp_k^GN = − J_k^Tr_k

(where J_k is the Jacobian) to obtain the search direction p_k^GN. The next iterate is then set as x_k + 1 = x_k + p_k^GN.

Here, the approximation of the Hessian ∇²f_k ≈ J_k^TJ_k has been made, which helps to save on computation time as second derivatives are not calculated.

Advantages:

Calculation of second derivatives is not required.
If residuals or their second order partial derivatives are small, then J_k^TJ_k is a close approximation to ∇²f_k and convergence of Gauss-Newton is fast.
The search direction p_J^GN is always a descent direction as long as J_k has full rank and the gradient ∇f_k is nonzero.

Disadvantages:

Without a good initial guess, or if the matrix J_k^TJ_k is ill-conditioned, the Gauss Newton Algorithm is very slow to converge to a solution.
If relative residuals are large, then large amounts of information will be lost.
J_k must be full rank.

[Nocedal] [Floater]

Floater: Michael S. Floater (2018), Lecture 13: Non-linear least squares and the Gauss-Newton method, University of Oslo
Nocedal: Jorge Nocedal, Stephen J. Wright (2006), Numerical Optimization
Poczos: Barnabas Poczos, Ryan Tibshirani (2012), Lecture 10: Optimization, School of Computer Science, Carnegie Mellon University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

line_search.rst

line_search.rst

Line Search Methods

Steepest Descent (`steepest_descent`)

Conjugate Gradient (`conjugate_gradient`)

BFGS (`bfgs`)

Gauss Newton (`gauss_newton`)

Files

line_search.rst

Latest commit

History

line_search.rst

File metadata and controls

Line Search Methods

Steepest Descent (steepest_descent)

Conjugate Gradient (conjugate_gradient)

BFGS (bfgs)

Gauss Newton (gauss_newton)

Steepest Descent (`steepest_descent`)

Conjugate Gradient (`conjugate_gradient`)

BFGS (`bfgs`)

Gauss Newton (`gauss_newton`)