# Lecture notes for 2020-04-17

## Life beyond Newton

Newton’s method has many attractive properties, but Newton steps may not
be cheap. At each step, we need to:

-   Form the function $f$ *and* the Jacobian. This involves not only
    computational work, but also analytical work – someone needs to
    figure out those derivatives!

-   Solve a linear system with the Jacobian. This is no easier than any
    other linear solve problem! Indeed, it may be rather expensive for
    large systems, and factorization costs cannot (in general) be
    amortized across Newton steps.

The Jacobian (or the Hessian if we are looking at optimization problems)
is the main source of difficulty. Now we consider several iterations
that deal with this difficulty in one way or the other.

## A running example, redux

It is always helpful to illustrate methods with an actual example.
We will continue to work with the example from last time of a nonlinear
reaction-diffusion problem:
$$
  f_i(v) \equiv \frac{v_{i-1}-2v_i+v_{i+1}}{h^2} + \exp(v_i) = 0.
$$
with $h = (N+1)^{-1}$ and $v_0 = v_{N+1} = 0$.

In [None]:
using LinearAlgebra
using Plots

In [None]:
function ϕ_autocatalytic(v)
    N = length(v)
    C = 0.5*(N+1)^2
    ϕ = C*v[1]^2 - exp(v[1])
    for j = 1:N-1
        ϕ += C*(v[j]-v[j+1])^2 - exp(v[j])
    end
    ϕ += C*v[N]^2 - exp(v[N])
    return ϕ
end

function autocatalytic(v)
    N = length(v)
    fv        = exp.(v)
    fv        -= 2*(N+1)^2*v
    fv[1:N-1] += (N+1)^2*v[2:N  ]
    fv[2:N  ] += (N+1)^2*v[1:N-1]
    fv
end

function Jautocatalytic(v)
    N = length(v)
    SymTridiagonal(exp.(v) .- 2*(N+1)^2, (N+1)^2 * ones(N-1))
end

In the last lecture, we used an initial guess of the form
$$
  v_i^0 = \alpha x_i \left( 1-x_i \right) = \alpha q_i, \quad x_i \equiv \frac{i}{N+1}
$$
and then tried various values of $\alpha$.  An alternative is to characterize the solution
as a stationary point of the objective $\phi$ and try to find $\alpha$ such that $\phi(\alpha q)$
is minimum (or at least stationary).  It is helpful to get a picture -- and an estimate -- by first doing a plot.

In [None]:
N = 100
xx = range(1, N, length=N)/(N+1)
q = xx.*(1.0 .- xx)

αs = range(0, 20, length=1001)
ϕs = [ϕ_autocatalytic(α*q) for α in αs]

println("Optimal α (stable eq):   $(αs[argmin(ϕs)])")
println("Optimal α (unstable eq): $(αs[argmax(ϕs)])")

plot(αs, ϕs, legend=false)

#### Questions

1.  Write a one-dimensional Newton iteration for finding critical points of $\phi(\alpha q)$.
    Use the initial guesses of $0$ and $15$.

## Analyzing Newton and almost-Newton analysis (optional)

In these notes, we will be somewhat careful about the analysis, but in
general you are *not* responsible for remembering this level of
detail. We will try to highlight the points that are important in
practice for understanding when solvers might run into trouble, and why.

A common theme in the analysis of “almost Newton” iterations is that we
can build on Newton convergence.  We assume throughout
that $f$ is $C^1$ and the Jacobian is Lipschitz with constant $M$.
To simplify life, we will also assume that $\|f'(x)^{-1}\|$ is bounded
in some neighborhood of a desired $x^*$ such that $f(x^*) = 0$.
Consider what happens when we subtract the equation defining 
the Newton step from a Taylor expansion *with remainder* of $f(x^*)$
centered at $f(x)$:
$$\begin{aligned}
  f(x) + f'(x) &p(x) = 0 \\
  -[f(x) + f'(x) &(x^*-x) + R(x) = 0] \\\hline
  f'(x) &[p(x) - (x^*-x)] - R(x) = 0
\end{aligned}$$
or
$$
  p(x) = -(x-x^*) + f'(x)^{-1} R(x) = -(x-x^*) + d(x)
$$
Under the bounded inverse hypothesis and Lipschitz boundedness of $f'$,
we know
$$
  \|x + p(x) - x^*\| = \|d(x)\| \leq \frac{BM}{2} \|x-x^*\|^2,
$$
and so the iteration $x \mapsto x + p(x)$ converges quadratically from
starting points near enough $x^*$.  Moreover, a sufficient condition for
convergence is that the initial error is less than $2/(BM)$. 
This differs from our earlier bound of $2/(3BM)$ only because we assumed
a uniform bound on the inverse of the Jacobian in the relevant region rather
than assuming a bound at the solution and using the Lipschitz behavior to
get everything else.

Now suppose we have an iteration 
$$
  x^{k+1} = x^k + \hat{p}^k
$$ 
where $\hat{p}^k$ is an approximation to the Newton step $p(x^k)$. Subtracting
$x^*$ from both sides and adding $p(x^k)-p(x^k)$ to the right side gives
$$
  e^{k+1} = e^k + p(x^k) - (p(x^k)-\hat{p}^k),
$$
and taking norms gives
$$
  \|e^{k+1}\| \leq \frac{BM}{2} \|e^k\|^2 + \|p(x^k)-\hat{p}^k\|.
$$
Therefore, we can think of our convergence analysis in two steps: 
we first analyze the error in the Newton iteration, 
then analyze how close our approximate Newton step is to a true Newton step.

## Newton iteration

We ran Newton iteration for the autocatalytic problem last time, but let's run it again this
time using an initial guess of $\alpha = 0.5$.  Convergence from this initial guess is extremely rapid.

In [None]:
v = 0.5*q
rhist = []
for k = 1:10
    fv = autocatalytic(v)
    v -= Jautocatalytic(v)\fv
    push!(rhist, norm(fv))
    if norm(fv) < 1e-9
        break
    end
end

rhist_newton = rhist
plot(rhist_newton, yscale=:log10, label="Newton")

#### Questions

1.  What do you observe if you change the tolerance from $10^{-9}$ to $10^{-16}$?  Why?

## Chord iteration

The *chord iteration* is $$x^{k+1} = x^k - f'(x^0)^{-1} f(x^k).$$
Written in this way, the method differs from Newton in only one
character — but what a difference it makes! By re-using the Jacobian at
$x^0$ for all steps, we degrade the progress per step, but each step
becomes cheaper. In particular, we can benefit from re-using a
factorization across several steps (though this is admittedly more of
an issue when the matrix is not tridiagonal!).

In [None]:
rhist = []
v = 0.5*q
J0F = ldlt(Jautocatalytic(v))  # Compute an LDL^T factorization of J
for k = 1:10
    fv = autocatalytic(v)
    v -= J0F\fv
    push!(rhist, norm(fv))
    if norm(fv) < 1e-9
        break
    end
end

rhist_chord = rhist
plot(rhist_newton, yscale=:log10, label="Newton")
plot!(rhist_chord, label="Chord")

In terms of the approximate Newton framework, the chord iteration
involves errors $\|E_k\| = \|f'(x^k)-f'(x^0)\| \leq M \|e^0\|$.
Therefore, the iteration is guaranteed to converge for starting points
such that $\|e^0\| < 1/(3BM)$, and the error in successive iterates is
bounded by.
$$
  \|e^{k+1}\| \leq \left( \frac{BM(\|e^k\| + \|e^0\|)}{1-BM\|e^0\|}
  \right) \|e^k\| = O(\|e^0\| \|e^k\|).
$$

#### Questions

1.  Run the chord iterations from several different starting values of $\alpha$ to verify that the linear
    rate of convergence depends on the initial error.

## Shamanskii iteration

The chord method involves using one approximate Jacobian forever. The
Shamanskii method involves freezing the Jacobian for $m$ steps before
getting a new Jacobian; that is, one step of Shaminskii looks like
$$\begin{aligned}
  x^{k+1,0} & = x^k \\
  x^{k+1,j+1} &= x^{k+1,j} - f'(x^k)^{-1} f(x^{k+1,j}) \\
  x^{k+1} &= x^{k+1,m}.
\end{aligned}$$ 

In [None]:
rhist = []
v = 0.5*q
JF = ldlt(Jautocatalytic(v))  # Compute an LDL^T factorization of J
for k = 1:10
    fv = autocatalytic(v)
    v -= JF\fv
    push!(rhist, norm(fv))
    if norm(fv) < 1e-9
        break
    end
    if mod(k, 2) == 0
        JF = ldlt(Jautocatalytic(v))
    end
end

rhist_shamanskii = rhist
plot(rhist_newton, yscale=:log10, label="Newton")
plot!(rhist_shamanskii, label="Shamanskii")

Like the chord iteration,
Shaminskii is guaranteed to converge for starting points such that
$\|e^0\| < 1/(3BM)$. The error for each iteration (from $x^k$ to
$x^{k+1}$, not from $x^{k+1,j}$ to $x^{k+1,j+1}$) satisfies
$$
  \|e^{k+1}\|
  \leq \left( \frac{2BM}{1-BM\|e^k\|} \right) \|e^{k}\|^{m+1}
  = O(\|e^k\|^{m+1}).
$$
Beyond the chord and Shaminskii iterations, the
idea of re-using Jacobians occurs in several other methods.

## Finite-difference Newton

So far, we have assumed that we can compute the Jacobian if we want it.
What if we just don’t want to do the calculus to compute Jacobians? A
natural idea is to approximate each column of the Jacobian by a finite
difference estimate:
$$
  f'(x^k) e_j \approx \frac{f(x^k+he_j)-f(x^k)}{h}.
$$
In general, the more analytic information that we have about the derivatives,
the better off we are.  Even knowing only the sparsity pattern of the Jacobian
gives us a lot of information.  In our example, changing $v_j$ affects 
$f_{j-1}$, $f_j$, and $f_{j+1}$, but not any other.  Hence, we don't actually
need $N+1$ evaluations of $f$ to get the Jacobian; we can do it with four
that are cleverly chosen.

In [None]:
function Jtridiagonal_fd(f, x, h)
    N = length(x)

    dd = zeros(N)   # Diagonal elements
    dl = zeros(N-1) # Subdiagonal elements
    du = zeros(N-1) # Superdiagonal elements
    
    fx = f(x)
    xp = copy(x)
    for j = 1:3
        xp[:] = x
        xp[j:3:N] .+= h
        df = (f(xp)-fx)/h
        for i = 1:N
            if mod(i-j,3) == 0
                dd[i] = df[i]
            elseif mod(i-j,3) == 1 && i > 1
                dl[i-1] = df[i]
            elseif mod(i-j,3) == 2 && i < N
                du[i] = df[i]
            end
        end
    end

    return Tridiagonal(dl, dd, du)
end

In [None]:
Jfd = Jtridiagonal_fd(autocatalytic, 0.5*q, 1e-6)
Jref = Jautocatalytic(0.5*q)
norm(Jfd-Jref)

Using Lipschitz bounds on $f'$ gives the error bound
$$
  \left\| f'(x^k) - \frac{f(x^k+he_j)-f(x^k)}{h} \right\| \leq Mh,
$$
and an approximation to $f'(x^k)$ based on finite difference approximation
would have a two norm error of at most $\|E^k\| \leq \sqrt{n} M h$. The
convergence is bounded by 
$$
  \|e^{k+1}\| \leq
  \left( \frac{BM\|e^k\| + \sqrt{n}BMh}{1-\sqrt{n} BMh} \right) \|e^k\| =
  O(h\|e^k\|).
$$

#### Questions

1.  Can you explain what is going on in the `Jtridiagonal_fd` code above?

## Inexact Newton

So far, we have considered approximations to the Newton step based on
approximation of the Jacobian matrix. What if we instead used the exact
Jacobian matrix, but allowed the update linear systems to be solved
using an iterative solver? In this case, there is a small residual, i.e.
$$
  f'(x^k) \hat{p}^k = -f(x^k) + r^k
$$
where $\|r^k\| \leq \eta_k \|f(x^k)\|$ (i.e. $\eta_k$ is a relative residual
tolerance on the solve). In this case,
$$
  \|\hat{p}^k-p(x^k)\| = \|f'(x^k)^{-1} r^k\| \leq B \|r^k\| \leq
  \eta_k B \|f(x^k)\|.
$$
We also have that
$$
  \|f(x^k)\| = \|f(x^k)-f(x^*)\| = \|f'(\tilde{x}) e^k\| \leq C \|e^k\|
$$
where $C$ is a bound on the norm of $f'$. Thus
$$
  \|\hat{p}^k-p(x^k)\| \leq \eta_k BC \|e^k\|,
$$
which we combine with the bound from the start of the notes to give
$$
  \|e^{k+1}\| \leq B(M \|e^k\| + \eta_k C) \|e^k\|
  = O(\|e^k\|^2) + O(\eta_k \|e^k\|).
$$
Hence, we have the following trade-off. If we solve the systems very 
accurately ($\eta_k$ small), then inexact Newton will behave much like ordinary Newton. 
Thus, we expect to require few steps of the outer, nonlinear iteration; but the
inner iteration (the linear solver) may require many steps to reach an
acceptable residual tolerance. In contrast, if we choose $\eta_k$ to be
some modest constant independent of $k$, then we expect linear
convergence of the outer nonlinear iteration, but each step may run
relatively fast, since the linear systems are not solved to high
accuracy.

One attractive feature of Krylov subspace solvers for the Newton system
is that they only require matrix-vector multiplies with the Jacobian —
also known as directional derivative computations. We can approximate
these directional derivaties by finite differences to get a method that
may be rather more attractive than computing a full Jacobian
approximation by finite differencing. However, it is necessary to use a
Krylov subspace method that tolerates inexact matrix vector multiplies
(e.g. FGMRES).