# Lecture notes for 2020-04-06

## Nonlinear equations and optimization

If $f : \mathbb{R}^n \rightarrow \mathbb{R}^n$, then solving the system $f(x) = 0$
is equivalent to minimizing $\|f(x)\|^2$.  Similarly, if $g : \mathbb{R}^n
\rightarrow \mathbb{R}$ is continuously differentiable, then any local
minimizer $x_*$ satisfies the nonlinear equations $\nabla g(x_*) = 0$.
There is thus a close connection between nonlinear equation solving
on the one hand and optimization on the other, and methods used for
one problem can serve as the basis for methods for the other.

As with nonlinear equations, the one-dimensional case is the simplest,
and may be the right place to start our discussion.  As
with the solution of nonlinear equations, our main strategy for
dealing with multi-variable optimization problems will be to find a
promising search direction and then solve (approximately) a
one-dimensional line search problem.

For illustrating the basics, we will use the objective function
$g(x) = \cos(x) \log(x)$.

In [None]:
using Plots
using LinearAlgebra

In [None]:
gtest(x) = cos(x) * log(x)
xx = range(0.1, stop=2*π, length=100)
plot(xx, gtest.(xx), legend=false)

A little calculus gives us
$$
  g'(x) = -\sin(x) \log(x) + \cos(x)/x.
$$
and
$$
  g''(x) = -\cos(x) \log(x) - 2\sin(x)/x - \cos(x)/x^2.
$$

In [None]:
dgtest(x) = -sin(x)*log(x) + cos(x)/x
d2gtest(x) = -cos(x)*log(x) - 2*sin(x)/x - cos(x)/x^2

I usually code up finite difference sanity checks when coding derivatives,
so let's not let this be an exception.  We discussed in the last lecture the
centered finite difference approximation for a first derivative
$$
  f'(x) \approx \frac{f(x+h)-f(x-h)}{2h} + O(h^2),
$$
and we can similarly derive a finite difference approximation for the second
derivative
$$
  f''(x) \approx \frac{f(x+h)-2f(x)+f(x+h)}{h^2} + O(h^2).
$$
Some sense of magnitudes is useful in cases like this.  The second derivative
gets pretty big for values of $x$ near zero, and there is a lot of cancellation
in these formulas, so we don't necessarily expect that a tiny finite difference step
will give great results.  Fortunately, we are not currently using the finite
difference formulas in lieu of the true derivatives, but as a sanity check to
make sure that we did not swap a negative sign or add a factor of two somewher.

In [None]:
deriv_fd(f, h)  = (x) -> (f(x+h)-f(x-h))/2/h
deriv2_fd(f, h) = (x) -> (f(x+h)-2*f(x)+f(x-h))/h^2

dgtest_fd  = deriv_fd(gtest, 1e-4)
d2gtest_fd = deriv2_fd(gtest, 1e-4)

dg_diff = maximum(abs.(dgtest.(xx) - dgtest_fd.(xx)))
d2g_diff = maximum(abs.(d2gtest.(xx) - d2gtest_fd.(xx)))

println("Check on g': $(dg_diff)")
println("Check on g'': $(d2g_diff)")
plot(xx, abs.(d2gtest.(xx) - d2gtest_fd.(xx)), yscale=:log10, legend=false)

## Minimization via Newton

Suppose $g : {\mathbb{R}}\rightarrow {\mathbb{R}}$ has at least two
continuous derivatives. If we can compute $g'$ and $g''$, then one of
the simplest ways to find a local minimum is to use Newton iteration to
find a stationary point: $$x_{k+1} = x_k - \frac{g'(x_k)}{g''(x_k)}.$$
Geometrically, this is equivalent to finding the maximum (or minimum) of
a second-order Taylor expansion about $x_{k}$; that is, $x_{k+1}$ is
chosen to minimize (or maximize)
$$\hat{g}(x_{k+1}) = g(x_k) + g'(x_k)(x_{k+1}-x_k) + \frac{1}{2} g''(x_k) (x_{k+1}-x_k)^2.$$

In [None]:
function newton_v1(x, dg, d2g; dtol=1e-6, atol=1e-6, nsteps=100, monitor=(x)->nothing)
    monitor(x)
    for k = 1:nsteps
        dgx = dg(x)
        p = -dgx/d2g(x)
        x += p
        monitor(x)
        if abs(dgx) < dtol || abs(p) < atol
            return x
        end
    end
    error("Did not converge within $(nsteps) steps")
end

In [None]:
function plot_newton_v1(x0, xrange=(0.1, 2*π))
    xhist = Array{Float64,1}([])
    x = newton_v1(x0, dgtest, d2gtest, monitor=(x)->push!(xhist,x))
    xx = range(xrange[1], stop=xrange[2], length=100)
    l = @layout [a b]  
    p1 = plot(xx, gtest.(xx), xlabel="x", ylabel="g(x)", legend=false)
    plot!([x], [gtest(x)], marker=true)
    yy = abs.(dgtest.(xhist))
    p2 = plot(yy[yy .> 0], yscale=:log10, xlabel="k", ylabel="|g'(x_k)|", legend=false)
    plot(p1, p2, layout=l)
end
plot_newton_v1(3)

As in the case of root finding, we can illustrate the behavior of the method by plotting successive quadratic approximations.

In [None]:
xhist = Array{Float64,1}([])
xref = newton_v1(2.7, dgtest, d2gtest, monitor=(x)->push!(xhist,x))
yy = abs.(dgtest.(xhist))

anim = @animate for (i, x) in enumerate(xhist)
    err = abs(dgtest(x))
    l = @layout [a b]    
    p1 = plot(xx, gtest.(xx), legend=false)
    plot!([xref], [gtest(xref)], marker=true)
    plot!([x], [gtest(x)], marker=true)
    plot!(xx, gtest(x) .+ dgtest(x)*(xx.-x) .+ d2gtest(x)/2*(xx.-x).^2, linestyle=:dash)
    plot!([x - dgtest(x)/d2gtest(x)], [gtest(x)-dgtest(x)^2/d2gtest(x)/2], marker=true, markercolor="white")
    plot!(yaxis=[-2,2])
    p2 = plot(yy[yy .> 0], yscale=:log10, legend=false)
    if err > 0
        plot!([i], [err], marker=true)
    end
    plot(p1, p2, layout=l)   
end
gif(anim, fps=1)

There are two gotchas in using Newton iteration in this way. We have
already run into the first issue: Newton’s method is only locally
convergent. We can take care of that problem by combining Newton with
bisection, or by scaling down the length of the Newton step. But there
is another issue, too: saddle points and local maxima are also
stationary points!

In [None]:
plot_newton_v1(2)

There is a simple precaution we can take to avoid converging to a
maximum: insist that $g(x_{k+1}) < g(x_k)$. If
$x_{k+1} = x_k - \alpha_k u$ for some $\alpha_k > 0$, then
$$g(x_{k+1}) - g(x_k) = -\alpha_k g'(x_k) u + O(\alpha_k^2).$$ So if
$g'(x_k) u > 0$, then $-u$ is a *descent direction*, and thus
$g(x_{k+1}) < g(x_k)$ provided $\alpha_k$ is small enough. 
The direction $-g'(x)$ is always a descent direction.  Note that if
$x_k$ is not a stationary point, then $-u = -g'(x_k)/g''(x_k)$ is a
descent direction iff $g'(x_k) u = g'(x_k)^2 / g''(x_k) > 0$. That is,
we will only head in the direction of a minimum if $g''(x_k)$ is
positive. Of course, $g''$ will be positive and the Newton step will
take us in the right direction if we are close enough to a strong local
minimum.

In [None]:
function newton_v2(x, g, dg, d2g; dtol=1e-6, atol=1e-6, nsteps=100, monitor=(x)->nothing)
    monitor(x)
    
    # Compute initial value and step
    gx = g(x)
    dgx = dg(x)
    d2gx = d2g(x)
    if d2gx > 0.
        p = -dgx/d2g(x)
    else
        p = -dgx
    end
    α = 1.0

    for k = 1:nsteps

        # Evaluate attempted step
        gxnew = g(x+α*p)
        if gxnew < gx
            x += α*p
            gx = gxnew
            dgx = dg(x)
            if d2gx > 0.
                p = -dgx/d2g(x)
            else
                p = -dgx
            end
            p = -dgx/d2g(x)
            α = 1.0
            monitor(x)
            if abs(dgx) < dtol || abs(p) < atol
                return x
            end
        else
            p *= 0.5
        end
    end
    error("Did not converge within $(nsteps) steps")
end

function plot_newton_v2(x0, xrange=(0.1,2*π))
    xhist = Array{Float64,1}([])
    x = newton_v2(x0, gtest, dgtest, d2gtest, monitor=(x)->push!(xhist,x))
    xx = range(xrange[1], stop=xrange[2], length=100)
    l = @layout [a b]  
    p1 = plot(xx, gtest.(xx), xlabel="x", ylabel="g(x)", legend=false)
    plot!([x], [gtest(x)], marker=true)
    yy = abs.(dgtest.(xhist))
    p2 = plot(yy[yy .> 0], yscale=:log10, xlabel="k", ylabel="|g'(x_k)|", legend=false)
    plot(p1, p2, layout=l)
end
    
plot_newton_v2(2)

In [None]:
plot_newton_v1(2)

#### Questions

1.  In `plot_newton_v2`, explain the line
            plot!([x - dgtest(x)/d2gtest(x)], [gtest(x)-dgtest(x)^2/d2gtest(x)/2], marker=true, markercolor="white")

*Answer*: 

2.  What happens if you start the `newton_v2` code at the initial guess of $x_0 = 2$?

*Answer*: 

## Approximate bisection and golden section

Assuming that we can compute first derivatives, minimizing in 1D reduces
to solving a nonlinear equation, possibly with some guards to prevent
the solver from wandering toward a solution that does not correspond to
a minimum. We can solve the nonlinear equation using Newton iteration,
secant iteration, bisection, or any combination thereof, depending how
sanguine we are about computing second derivatives and how much we are
concerned with global convergence. But what if we don’t even want to
compute first derivatives?

To make our life easier, let’s suppose we know that $g$ is twice
continuously differentiable and that it has a unique minimum at some
$x_* \in [a,b]$. We know that $g'(x) < 0$ for $a \leq x < x_*$ and
$g'(x) > 0$ for $x_* < x \leq b$; but how can we get a handle on $g'$
without evaluating it? The answer lies in the mean value theorem.
Suppose we evaluate $g(a)$, $g(b)$, and $g(x)$ for some $x \in (a,b)$.
What can happen?

1.  If $g(a)$ is smallest ($g(a) < g(x) \leq g(b)$), then by the mean
    value theorem, $g'$ must be positive somewhere in $(a,x)$.
    Therefore, $x_* < x$.

2.  If $g(b)$ is smallest, $x_* > x$.

3.  If $g(x)$ is smallest, we only know $x_* \in [a,b]$.

Cases 1 and 2 are terrific, since they mean that we can improve our
bounds on the location of $x_*$. But in case 3, we have no improvement.
Still, this is promising. What could we get from evaluating $g$ at
*four* distinct points $a < x_1 < x_2 < b$? There are really two
cases, both of which give us progress.

1.  If $g(x_1) < g(x_2)$ (i.e. $g(a)$ or $g(x_1)$ is smallest) then
    $x_* \in [a, x_2]$.

2.  If $g(x_1) > g(x_2)$ (i.e. $g(b)$ or $g(x_2)$ is smallest) then
    $x_* \in [x_1,b]$.

We could also conceivably have $g(x_1) = g(x_2)$, in which case the
minimum must occur somewhere in $(x_1,x_2)$.

There are now a couple options. We could choose $x_1$ and $x_2$ to be
very close to each other, thereby nearly bisecting the interval in all
four cases. This is essentially equivalent to performing a step of
bisection to find a root of $g'$, where $g'$ at the midpoint is
estimated by a finite difference approximation. With this method, we
require two function evaluations to bisect the interval, which means we
narrow the interval by $1/\sqrt{2} \approx 71\%$ per evaluation.

We can do a little better with a *golden section search*, which uses
$x_2 = a+(b-a)/\phi$ and $x_1 = b + (a-b)/\phi$, where $\phi =
(1+\sqrt{5})/2$ (the *golden ratio*). We then narrow to the interval
$[a,x_2]$ or to the interval $[x_1,b]$. This only narrows the interval
by a factor of $\phi^{-1}$ (or about 61%) at each step. But in the
narrower interval, we get one of the two interior function values “for
free” from the previous step, since $x_1 =
x_2+(a-x_2)/\phi$ and $x_2 = x_1+(b-x_1)/\phi$. Thus, each step only
costs one function evaluation.

In [None]:
function golden_section(g, a, b; atol=1e-6, monitor=(a,b)->nothing)
    monitor(a,b)
    ga = g(a)
    gb = g(b)
    ϕ = (1+sqrt(5))/2
    x1 = b+(a-b)/ϕ
    x2 = a+(b-a)/ϕ
    g1 = g(x1)
    g2 = g(x2)
    while abs(b-a) > 2*atol
        if g1 < g2
            b, x2 = x2, x1
            gb, g2 = g2, g1
            x1 = b+(a-b)/ϕ
            g1 = g(x1)
        elseif g1 > g2
            a, x1 = x1, x2
            ga, g1 = g1, g2
            x2 = a+(b-a)/ϕ
            g2 = g(x2)
        elseif g1 == g2
            x1 = b+(a-b)/ϕ
            x2 = a+(b-a)/ϕ
            g1 = g(x1)
            g2 = g(x2)
        end
        monitor(a,b)
    end
    return (a+b)/2
end

In [None]:
ab_hist = Array{Tuple{Float64,Float64},1}([])
golden_section(gtest, 2, 5, monitor=(a,b) -> push!(ab_hist, (a,b)))

In [None]:
ϕ = (1+sqrt(5))/2
xmids = [(a+b)/2 for (a,b) in ab_hist]
plot(abs.(dgtest.(xmids)), yscale=:log10, legend=false)
plot!(abs(dgtest(3.5)) * ϕ.^-(0:length(xmids)-1), linestyle=:dash)

## Successive parabolic interpolation

Bisection and golden section searches are only linearly convergent. Of
course, these methods only use coarse information about the relative
sizes of function values at the sample points. In the case of
root-finding, we were able to get a superlinearly convergent algorithm,
the secant iteration, by replacing the linear approximation used in
Newton’s method with a linear interpolant. We can do something similar
in the case of optimization by interpolating $g$ with a *quadratic*
passing through three points, and then finding a new guess based on the
minimum of that quadratic. This *method of successive parabolic interpolation* 
does converge locally superlinearly. But even when $g$
is unimodular, successive parabolic interpolation must generally be
coupled with something slower but more robust (like golden section
search) in order to guarantee good convergence.

In [None]:
function simple_spi(g, a, b; atol=1e-6, nsteps=100, monitor=(a,b,c) -> nothing)
    c = (a+b)/2
    monitor(a, b, c)
    ga = g(a)
    gb = g(b)
    gc = g(c)
    for k = 1:nsteps
        g_ab = (ga-gb)/(a-b)
        g_bc = (gb-gc)/(b-c)
        g_abc = (g_ab-g_bc)/(a-c)
        x = (a+b-g_ab/g_abc)/2
        a, b, c = b, c, x
        ga, gb, gc = gb, gc, g(x)
        monitor(a, b, c)
        if abs(b-c) < atol
            return c
        end
    end
    error("Did not converge within given number of steps")
end

In [None]:
chist = Array{Float64,1}([])
x = simple_spi(gtest, 3, 4, atol=1e-8, monitor=(a,b,c) -> push!(chist, c))

In [None]:
plot(abs.(dgtest.(chist)), yscale=:log10)

## Chebyshev surrogates (optional)

This section is *completely* optional, and uses a reasonable amount of mathematics only touched
on in this course.  However, it connects beautifully to our discussion of the uses
of eigenvalue problems (and the closely related problem of polynomial root finding) in our discussion.
We therefore present it for your reading pleasure.

For sufficiently smooth functions, a reasonable strategy for root-finding and optimization is to first
approximate the target function by a polynomial, then solve the equivalent problem for the polynomial.
So far, when talking about polynomial approximation, we have mostly discussed the power basis,
but a more stable approach for approximation on the interval $[-1, 1]$ is to use
the *Chebyshev polynomials*
$$
  T_{j+1}(x) = 2x T_j(x) - T_{j-1}(x)
$$
with $T_0(x) = 1$ and $T_1(x) = x$.  We can fit the coefficients in a Chebyshev series
$$
  f(x) \approx \sum_{j=0}^{N-1} a_j T_j(x)
$$
by a discrete cosine transform.

In [None]:
function chebfit(f, N)
    a = zeros(N)
    x = cos.(π*((0:N-1) .+ 0.5)/N)
    fx = f.(x)
    a[1] = sum(fx)/N
    a[2] = 2*sum(x.*fx)/N
    Tm = 0.0*x.+1.0
    Tp = x
    for k = 3:N
        Tp, Tm = 2*(x.*Tp)-Tm, Tp
        a[k] = 2*sum(Tp.*fx)/N
    end
    return a
end

function chebeval(a, x)
    f = a[1] .+ a[2]*x
    Tm = 0.0*x.+1.0
    Tp = x
    for k = 3:length(a)
        Tp, Tm = 2*(x.*Tp)-Tm, Tp
        f += a[k]*Tp
    end
    return f
end

Chebyshev approximation of our test function on an interval not too near zero does a pretty good job.

In [None]:
gtests(x) = gtest(0.5*(1-x)/2 + 5*π*(1+x)/2)
a = chebfit(gtests, 30)
xxs = range(-1.0, stop=1.0, length=100)
maxerr = maximum(abs.(gtests.(xxs) - chebeval(a,xxs)))
println("Max approx error: $maxerr; a[N] = $(a[end])")
plot(xxs, gtests.(xxs), label="g(x)")
plot!(xxs, chebeval(a, xxs), label="Approx g(x)", linestyle=:dash)

In [None]:
plot(abs.(a), yscale=:log10)

The approximation by a (truncated) Chebyshev series is ultimately a polynomial approximation,
and we can compute the roots of a polynomial in terms of an associated matrix for which the
polynomial is the characteristic polynomial.

In [None]:
function chebzeros(a)
    N = length(a)
    C = zeros(N-1, N-1)
    C[1,:] = -a[end-1:-1:1]/a[end]
    for j = 2:N-1
        C[j,j-1] += 1.0
        C[j-1,j] += 1.0
    end
    C[N-1,N-2] = 2.0
    C /= 2.0
    x = eigvals(C)
    x = real.(x[imag.(x) .== 0])
    x = x[x .>= -1]
    x = x[x .<= 1]
end

In [None]:
xroot = chebzeros(a)
plot(xxs, gtests.(xxs), label="g(x)")
plot!(xxs, chebeval(a, xxs), label="Approx g(x)", linestyle=:dash)
scatter!(xroot, 0*xroot, label="Roots")

Of course, our goal in the current lecture is to find not the zeros of a function, but the extrema (and particularly
the minima).  To do this, we want the roots of the derivative of the approximant $\hat{g}(x)$.  This can be computed
in terms of a backward recurrence on the coefficients; we do not attempt to derive this rather inscrutable-looking
formula here, but rather refer to the delightful book of Boyd (*Chebyshev and Fourier Spectral Methods*, second ed)
from Appendix A, "A Bestiary of Basis Functions".

In [None]:
function chebderiv(a)
    N = length(a)
    b = zeros(N-1)
    for k = N-1:-1:1
        b[k] = 2*k*a[k+1] + (k+2 < N ? b[k+2] : 0)
    end
    b[1] /= 2
    return b[1:N-1]
end

In [None]:
dgtests(x) = (5*π-0.5)/2 * dgtest(0.5*(1-x)/2 + 5*π*(1+x)/2)
b = chebderiv(a)
xxs = range(-1.0, stop=1.0, length=100)
maxerr = maximum(abs.(dgtests.(xxs) - chebeval(b,xxs)))
println("Max approx error: $maxerr; b[N] = $(b[end])")
plot(xxs, dgtests.(xxs), label="g'(x)")
plot!(xxs, chebeval(b, xxs), label="Approx g''(x)", linestyle=:dash)

Having computed the coefficients for $\hat{g}'(x)$, we can now compute all the critical points
of $\hat{g}$, i.e. all $x$ such that $\hat{g}'(x) = 0$.

In [None]:
xroot = chebzeros(b)
plot(xxs, gtests.(xxs), label="g(x)")
plot!(xxs, chebeval(a, xxs), label="Approx g(x)", linestyle=:dash)
scatter!(xroot, gtests.(xroot), label="Extrema")

To find the global minimum of $g(x)$ on $[-1,1]$, then, we have the following proposed approach:

1.  Approximate $g(x)$ by a polynomial $\hat{g}(x)$ expressed in a Chebyshev basis.
2.  Find the critical points of $\hat{g}(x)$.
3.  Return the critical point (or end point) for which $\hat{g}(x)$ is minimal.

One can further refine this computation with a few steps of Newton iteration (or approximate Newton iteration
in which $\hat{g}'(x)$ is used to estimate $g'(x)$).  Of course, this method can be fooled if one uses too coarse
an approximation to $g$.

## Problems to ponder

1.  Suppose I know $f(0)$, $f(1)$, and a bound $|f''| < M$ on $[0,1]$.
    Under what conditions could $f$ possibly have a local minimum in
    $[0,1]$?

*Answer*: 

2.  Suppose $f(x)$ is approximated on $[0,1]$ by a polynomial $p(x)$ of degree at most $d$,
    and we know that $|f(x)-p(x)| < \delta$ on the interval. Using a polynomial zero-finding function,
    how would we find tight subintervals of $[0,1]$ in which the global minimum of $f(x)$ might lie?

*Answer*: 