#  Backtracking Algorithm, Rate of Convergence

We will talk about:
* Review of exact line search and Wolfe conditions
* Backtracking algorithm
* Rate of convergence

---

## Choice of step length

Every line search algorithm beginning at $\mathbf{x}_k$ chooses some direction $\mathbf{p}_k$ in which to step and some positive scalar $\alpha_k$ which determines the step length. That is, set
$$ \mathbf{x}_{k+1}=\mathbf{x}_k+\alpha_k\mathbf{p}_k $$

and repeat until convergence. In the most staightforward algorithm, **steepest descent**, we choose $\mathbf{p}_k=-\nabla f_k$, but the choice of $\alpha_k$ is more complicated. If we want to choose an *optimal* value of $\alpha_k$ in the direction $\mathbf{p}_k$, i.e. **exact line search**, we define

$$\phi(\alpha)=f(\mathbf{x}_k+\alpha \mathbf{p}_k)$$

(a "slice" of the function along the $\mathbf{p}_k$ direction, as shown below) and determine $\alpha_k>0$ such that $\phi'(\alpha)=0$.

<img src="https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-11184-7_5/MediaObjects/447852_1_En_5_Fig2_HTML.png" width="75%" />

It is in general difficult to compute a solution to this equation for a general starting point $\mathbf{x}_k$, particularly for very high-dimensional nonlinear, nonconvex functions. We showed last time that even for the simple convex function $g(x,y)=(x-1)^2+(2y-1)^2$, the optimal value of $\alpha_k$ is given by the rather complicated expression
$$ \alpha_k=\frac{(x_k-1)^2+4(2y_k-1)^2}{2(x_k-1)^2+32(2y_k-1)^2}$$


In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
alpha = lambda x,y: ((x-1)**2+4*(2*y-1)**2)/(2*(x-1)**2+32*(2*y-1)**2)

x,y=np.random.rand()*5,np.random.rand()*5
print(f'At {x}, {y}\nalpha={alpha(x,y)}') # random point in [0,5]x[0,5]

In [None]:
# define the function
g = lambda x,y: (x-1)**2 + (2*y-1)**2
# define derivatives of f to make the gradient
Dg = lambda x,y: np.array([2*(x-1), 4*(2*y-1)])

In [None]:
%%time
# plot the figure first
plt.figure(figsize=(6, 6))
X = np.linspace(0,5,300)  # 300 evenly spaced points on x-axis [0,5]
Y = np.linspace(0,5,300)  # 300 evenly spaced points on y-axis [0,5]
Xmesh, Ymesh = np.meshgrid(X,Y)  # 300x300 grid of points defined by X and Y above
Z = g(Xmesh,Ymesh)
CS = plt.contour(Xmesh, Ymesh, Z, 20, cmap='jet')
plt.clabel(CS,inline_spacing=0,fmt='%d')
plt.axis([0,5,0,5])
plt.xlabel('x')
plt.ylabel('y')

x0 = np.random.rand(2)*5  # initial point randomly chosen
x = x0.copy()
print(f'Initial x={x}')
dx = np.array([np.inf,np.inf]) # initial large gradient so while loop runs
tol = 1e-3            # stop when gradient is smaller than this amount
max_steps = 100       # Maximum number of steps to run the iteration
i=0                   # iteration count
while np.linalg.norm(dx)>tol and i<max_steps:
    dx = Dg(x[0],x[1])
    # new value of x
    xnew = x - alpha(x[0],x[1])*dx # note alpha is a function here!
    # add arrow to plot
    plt.arrow(x[0],x[1],-alpha(x[0],x[1])*dx[0],-alpha(x[0],x[1])*dx[1],color='b',
                      head_width=.1,length_includes_head=True)
    # update old value
    x = xnew
    # update iteration count
    i += 1
    print(f'In iteration {i}, alpha={alpha(x[0],x[1])}, and newx={x}')

print(f'After {i} iterations, approximate minimum is {g(x[0],x[1])} at {x}')
plt.title('Exact line search')
plt.show()

### Wolfe conditions

It is not feasible to perform exact line search for a general function since determining the optimal $\alpha_k$ requires excessive analytic calculation even for low-dimensional functions. Instead of calculating the optimal step size, we introduced the **Wolfe conditions** to determine a "good enough" step size, i.e. one that guarantees convergence in a reasonable number of iterations. Recall that the Wolfe conditions are given by

* Wolfe I (Armijo condition): $\qquad f_{k+1}\le f_k+c_1\alpha_k\mathbf{p}_k^T\nabla f_k \qquad \leftrightarrow\qquad \phi(\alpha_k)\le l(\alpha_k)$
* Wolfe II (curvature condition): $\quad \mathbf{p}_k^T\nabla f_{k+1}\ge c_2\mathbf{p}_k^T\nabla f_k \qquad\quad \leftrightarrow\qquad \phi'(\alpha_k)\ge c_2\phi'(0)$

where $0<c_1<c_2<1$, and generally provide an upper bound (Wolfe I) and a lower bound (Wolfe II) for $\alpha_k$ that guarantee convergence, though note that the "acceptable" values of $\alpha_k$ may form a disjoint set, particularly for highly non-convex functions, as shown in the image below:

<img src="https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-11184-7_5/MediaObjects/447852_1_En_5_Fig7_HTML.png" width="50%" />

It is hard to make a general statement about the step size, but if one could be made, it would be that the upper bound provided by Wolfe I has a much higher relative importance than the Wolfe II lower bound. We are always guaranteed convergence as $\alpha_k\to0$, although for excessively small step sizes, convergence may take forever and/or end up at a suboptimal minimum. If convergence (even suboptimal) is all we care about, we can never have *too small* of a step size.

On the other hand, choosing a large but not *too large* step size can cause the iteration to "jump out" of a local valley, thus reaching a more optimal global minimum, which is always desirable. However, choosing too large of a step size can prevent convergence all together, even causing the function value to *increase* in successive iterations. Since Wolfe I is the condition that provides an upper limit, it is therefore generally of much higher importance.

### Backtracking algorithm

Though the Wolfe conditions provide a theoretical upper and lower bound on $\alpha_k$, in practice determining these bounds is essentially as difficult as determmining the optimal value with exact line search since we still have to solve a generally nonlinear equation. However, one advantage to the Wolfe conditions over exact line search is this: although they are difficult to solve analytically, it is very easy to check if they are satisfied for a given $\alpha_k$. Because we want to be as aggressive as possible to allow the possibility of jumping out of a local valley, we can propose a relatively large $\tilde\alpha_k$ to begin with, then slowly decrease it, called **backtracking**, until Wolfe I is met, ensuring sufficient decrease and thus convergence. The **backtracking algorithm** is as follows:

1. Set a large initial value $\tilde\alpha$, e.g. $\tilde\alpha=1$.
2. Check if Wolfe I is satisfied, i.e. if $f_{k+1}\le f_k+c_1\tilde\alpha\mathbf{p}_k^T\nabla f_k$.
3. If true, terminate. If false, set $\tilde\alpha\leftarrow\rho\tilde\alpha$, for $\rho\in(0,1)$.
4. Repeat 2-3 until true.

Below we implement the backtracking algorithm on our test function $g(x,y)$ with the same initial starting point as exact line search above and compare the results. We choose $c_1=0.1$ and $\rho=0.75$ with an initial value of $\tilde\alpha=1$.

In [None]:
def WolfeI(alpha,f,x,p,c1=0.1):
    '''Return True/False if Wolfe condition I is satisfied for the given alpha'''
    LHS = f(x[0]+alpha*p[0], x[1]+alpha*p[1])
    RHS = f(x[0],x[1])-c1*alpha*np.dot(p,p)
    return LHS <= RHS

In [None]:
%%time
# plot the figure first
plt.figure(figsize=(6, 6))
X = np.linspace(0,5,300)  # 300 evenly spaced points on x-axis [0,5]
Y = np.linspace(0,5,300)  # 300 evenly spaced points on y-axis [0,5]
Xmesh, Ymesh = np.meshgrid(X,Y)  # 300x300 grid of points defined by X and Y above
Z = g(Xmesh,Ymesh)
CS = plt.contour(Xmesh, Ymesh, Z, 20, cmap='jet')
plt.clabel(CS,inline_spacing=0,fmt='%d')
plt.axis([0,5,0,5])
plt.xlabel('x')
plt.ylabel('y')

x = x0.copy()         # same initial point as before
print(f'Initial x={x}')
dx = np.array([np.inf,np.inf]) # initial large gradient so while loop runs
tol = 1e-3            # stop when gradient is smaller than this amount
max_steps = 100       # Maximum number of steps to run the iteration
rho = 0.75            # parameter for backtracking algorithm
i=0                   # iteration count
while np.linalg.norm(dx)>tol and i<max_steps:
    dx = Dg(x[0],x[1])
    
    # backtracking
    a = 1
    j = 0   # keep track of how many backtracking iterations
    while not WolfeI(a,g,x,-dx):
        a *= rho
        j += 1

    # new value of x
    xnew = x - a*dx
    
    # add arrow to plot
    plt.arrow(x[0],x[1],-a*dx[0],-a*dx[1],color='b',head_width=.1,length_includes_head=True)
    # update old value
    x = xnew
    # update iteration count
    i += 1
    print(f'In iteration {i}, alpha={a} after {j} backtracks, and newx={x}')

print(f'After {i} iterations, approximate minimum is {g(x[0],x[1])} at {x}')
plt.title('Backtracking')
plt.show()

We see that even though the backtracking takes more iterations, the total amount of time taken – at least for this very simple function – is comparable to exact line search, and we didn't need to take the time to calculate any values of $\alpha_k$ by hand.

One downside to our backtracking implementation is that we do not check the Wolfe condition II, so we may or may not satisfy it at any given point. We could in principle also check this condition, perhaps iteratively *increasing* $\alpha$ by a small amount after the initial backtracking step if Wolfe II is not satisfied. However, we would then run the risk of breaking Wolfe I satisfaction, thus having to backtrack yet again. Put bluntly, the amount of extra code and computation that would be required to also satisfy Wolfe II is probably more trouble than it is worth since having one or two iterations with step sizes slightly too small will not change the overall convergence.

Indeed since Wolfe II requires calculating the gradient not just at the current iterate but at the proposed future iterate, $\nabla f_{k+1}$, it is more computationally costly. For large systems with thousands or even millions of variables, calculating the gradient quickly becomes the bottleneck of the process, so avoiding this calculation as much as possible is a good rule of thumb. Indeed this is why most often for large functions (e.g. [artificial neural networks](https://en.wikipedia.org/wiki/Artificial_neural_network)), the value of $\alpha_k$ is often chosen to just be some "small enough" constant throughout all iterations, avoiding all calculations of the gradient except for the determination of $\mathbf{p}_k$ itself.

---

## Rate of Convergence

We have mentioned the idea of converging "fast enough" several times, e.g. in relation to Wolfe condition II, but can we actually quantify the rate of convergence? Indeed, we define the rate of convergence $p\ge1$ to be the value such that

$$ 0< \lim_{k\to\infty} \frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|}{\|\mathbf{x}_k-\mathbf{x}^*\|^p} \equiv L < \infty $$

where $\mathbf{x}^*$ is the known minimizer, i.e. the terminal point satisfying $\nabla f(\mathbf{x}^*)=\mathbf{0}$. If $p=1$ and the limit is also equal to 1, we say the convergence is **sub-linear**. If $p=1$ and the limit is smaller than 1, we say the convergence is **linear**. If $p>1$, we say the convergence is **super-linear**. If the limit is equal to 0 for all $p\ge1$, we say the order of convergence is $\infty$.

We have shown in lecture that convergence for steepest descent is linear at worst, though in certain cases it may converge faster. We will show later that other methods converge more quickly than steepest descent, and generally will investigate the convergence properties of any new method we define. One drawback to this analysis is that we are required to know the eventual minimizer $\mathbf{x}^*$ to even begin calculating. But if we knew the minimizer, we wouldn't have to perform any iterative calculations in the first place! Of course there are certain situations where we can analytically find $\mathbf{x}^*$ (e.g. quadratic functions, as in lecture), but those are few and far between. Here we describe a way to calculate the order of convergence of any method on any objective function, provided that method does converge.

Suppose that we have run an iterative procedure to the point of convergence after $N$ iterations and kept track of the sequence $\{\mathbf{x}_k\}_{k=0}^N$. Then we can simply assume $\mathbf{x}^*\equiv\mathbf{x}_N$ and find a value of $p$ for which the limit above is finite and non-zero. But how to find that $p$?

Note that if we define
$$ e_k = \|\mathbf{x}_k-\mathbf{x}^*\| $$
then according to the limit above, as $k\to\infty$,
$$ e_{k+1}\approx L e_k^p $$

Then, since this is true of any large $k$, we can simply shift indices down and take a ratio to get

$$ \frac{e_{k+1}}{e_k} \approx \frac{Le_k^p}{Le_{k-1}^p} = \left(\frac{e_k}{e_{k-1}}\right)^p $$

which upon solving for $p$ yields
$$ p\approx \frac{\ln(e_{k+1}/e_k)}{\ln(e_k/e_{k-1})} $$

Furthermore, it can be shown that as $k\to\infty$,
$$ \frac{e_{k+1}}{e_k} =\frac{\|\mathbf{x}_{k+1}-\mathbf{x}^*\|}{\|\mathbf{x}_k-\mathbf{x}^*\|} \approx \frac{\|\mathbf{x}_{k+1}-\mathbf{x}_k\|}{\|\mathbf{x}_k-\mathbf{x}_{k-1}\|} $$
so that
$$ p\approx \frac{\ln\left(\frac{\|\mathbf{x}_{k+1}-\mathbf{x}_k\|}{\|\mathbf{x}_k-\mathbf{x}_{k-1}\|}\right)}{\ln\left(\frac{\|\mathbf{x}_{k}-\mathbf{x}_{k-1}\|}{\|\mathbf{x}_{k-1}-\mathbf{x}_{k-2}\|}\right)} $$


Below, we calculate the rate of convergence for our favorite function $g(x,y)$, this time decreasing the tolerance in the gradient to $10^{-8}$ and relaxing the value of $c_1$ in the Wolfe condition to $0.001$ to force the algorithm to take a decently large number of steps:

In [None]:
x = x0.copy()         # same initial point as before
path = [x]
print(f'Initial x={x}')
dx = np.array([np.inf,np.inf]) # initial large gradient so while loop runs
tol = 1e-8            # stop when gradient is smaller than this amount
max_steps = 1000      # Maximum number of steps to run the iteration
rho = 0.75            # parameter for backtracking algorithm
i=0                   # iteration count
while np.linalg.norm(dx)>tol and i<max_steps:
    dx = Dg(x[0],x[1])
    
    # backtracking
    a = 1
    j = 0   # keep track of how many backtracking iterations
    while not WolfeI(a,g,x,-dx,c1=0.001):
        a *= rho
        j += 1

    # new value of x
    xnew = x - a*dx
    path.append(xnew)
    
    # update old value
    x = xnew
    # update iteration count
    i += 1

path=np.array(path)
print(f'After {i} iterations, approximate minimum is {g(x[0],x[1])} at {x}')

In [None]:
print(path[:20,:])  # first 20 steps

In [None]:
print(np.diff(path,axis=0)[:20])  # x_{k+1}-x_k

In [None]:
err = np.linalg.norm(np.diff(path,axis=0),axis=1) # ||x_{k+1}-x_k||
print(err[:20]) # error in first 20 steps

In [None]:
pp=np.zeros(len(err)-2)
for i in range(len(pp)):
    pp[i]=(np.log(err[i+2]/err[i+1])/np.log(err[i+1]/err[i]))
    
p=np.mean(pp[-10:])  # p is mean of last 10 iterations
    
plt.plot(pp)
plt.plot(pp*0+p)
plt.xlabel('k')
plt.ylabel('p')
plt.title(f'p={p}')
plt.show()

In [None]:
err[-1]/err[-2]   # limit value

Indeed we see that the algorithm does converge linearly since $p=1$, $L<1$.