# MATH 405/607 

# Numerical Methods for Differential Equations

[[Instructor: Christoph Ortner]](http://www.math.ubc.ca/~ortner/)  [[course page]](https://github.com/cortner/math405_2022)


## Optimization Problems

* Newton method again 
* Steepest descent
* Line-search 
* outlook

In [None]:
include("math405.jl")

### General Optimisation 

In optimisation we are given an objective $\Phi : D \to \mathbb{R}$ and wish to solve 
$$ 
    \min_{x \in D} \Phi(x)
$$

### Unconstrained Optimization 

Let $\Phi : \mathbb{R}^N \to \mathbb{R}$ and solve $\min_{x \in \mathbb{R}^N} \Phi(x)$.

By this we mean that we with to find $x \in \mathbb{R}^N$ such that 

$$
   f(x) \leq f(x') \qquad \forall x' \in \mathbb{R}^N
$$

But usually we must be content with local minimizers - more on this later.

### Applications of Optimization 



### Optimization and Nonlinear Systems

From calculus we know that we can solve the nonlinear system 

$$f(x) = \nabla \Phi(x) = 0$$ 

and then pick those roots that are minima (check that the hessian is positive definite). This is rarely a good strategy. Because the objective is scalar there are much more powerful tools available to solve optimisation problems than to solve nonlinear systems. 


Nevertheless, some ideas from nonlinear systems remain enourmously important in optimization as well; e.g. Newton's method now becomes 

$$
   x_{n+1} = x_n - \nabla^2 \Phi(x_n) \nabla \Phi(x_n).
$$

* Of course, as in the nonlinear system case we obtain quadratic convergence. 
* As before the challenge remains to evaluate and invert the hessian matrix. While gradients are normally cheap to compute (cf. adjoint method = backpropagation) hessians are often very expensive.
* Further, Newton's method may converge to maxima or saddle point since we have not built in any information about minimality into the method!

### A First Optimization Scheme 

A general idea in optimization that incorporates the idea of minimality is to enforce decrease of the objective at every step, i.e., let $x_n \to x_{n+1}$ be an optimization step then we require that 

$$
   \Phi(x_{n+1}) < \Phi(x_n)
$$

(unless $x_n$ is already a minimizer)

How should we create updates that achieve this? 

## Suggestions? 

A general idea: search in directions of descent: 

$$
   x_{n+1} = x_n + \alpha_n p_n
$$

where $p_n$ is a descent direction if 

$$
   \frac{d}{d\alpha} \Phi(x_n + \alpha p_n) \Big|_{\alpha = 0} < 0.
$$

What is a good descent direction? 

### Direction of Steepest Descent

$$\begin{aligned}
    \arg\min_{\|p\| = 1} \frac{d}{d\alpha} \Phi(x + \alpha p) \Big|_{\alpha = 0}
    &= \arg\min_{\|p\| = 1} \nabla\Phi(x) \cdot p 
    = - \frac{\nabla\Phi(x)}{\| \nabla \Phi(x) \|}
\end{aligned}$$

Usually we don't normalize (it will be clear in a moment!) We call 

$$
     p = - \nabla \Phi(x)
$$

the direction of steepest descent.

**WARNING:** The norm used for normalization $\|p\|=1$ determines what the steepest descent direction is. There are other options that lead to different directions but they are rarely used. (maybe unwisely...)

### The Steepest Descent Method 

$$
  x_{n+1} = x_n - \alpha_n \nabla \Phi(x_n)
$$


How should we choose $\alpha_n$? 

Maybe we can first try with fixed $\alpha_n \equiv \alpha$ and just experiment? 

In [None]:
function steepest_descent1(fgrad, x0, α; 
                           tol = 1e-3, maxiter = 100, verbose = true)
    x = x0 
    if verbose; @printf(" iter |   |∇Φ| \n"); end 
    for n = 1:maxiter 
        g = fgrad(x)
        if verbose; @printf(" %4d |  %.2e \n", n, norm(g, Inf)); end 
        if norm(g, Inf) < tol 
            println("success: |g| < tol, iter = $n")
            return x 
        end 
        x = x - α * g 
    end
    println("failure: iter > maxiter")
end

In [None]:
rosenbrock(x) =  (1.0 - x[1])^2 + 10.0 * (x[2] - x[1]^2)^2
grad_rosenbrock(x) = ForwardDiff.gradient(rosenbrock, x)
x0 = zeros(2)

steepest_descent1(grad_rosenbrock, x0, 1e-1) # try other alpha? 

In [None]:
# after a lot of fiddling ...
steepest_descent1(grad_rosenbrock, x0, 1e-2, maxiter = 1_000, tol=1e-2, verbose = false)

* no fun fiddling with the parameters, we need a robust way to pick α
* we needed a lot of iterations, there are much better choices of the search direction - to be discussed.

### Backtracking Line Search 

If $\Phi \in C^2$ then 

$$
    \Phi(x + \alpha p) = \Phi(x) + \alpha \nabla \Phi(x) \cdot p + O(\alpha^2)
$$

If $p$ is a descent direction, i.e., $\nabla \Phi(x) \cdot p < 0$ then it immediately follows that $\Phi(x + \alpha p) < \Phi(x)$ for $\alpha$ sufficiently small. So why not keep reducing $\alpha$ until this descent condition is satisfied? This is called backtracking.

In practice we actually enforce something stronger, called the **Armijo condition**: for some fixed parameter $0 < \theta < 1$ we demand that  

$$
   \Phi(x_n + \alpha_n p_n) \leq \Phi(x_n) + \theta \alpha_n \nabla \Phi(x_n) \cdot p_n.
$$

In [None]:
function steepest_descent2(ffun, fgrad, x0, α0; 
                           tol = 1e-3, maxiter = 100, verbose = true, 
                           θ = 0.01, αmin = 1e-8)
    x = x0 
    α = 0.0
    if verbose; @printf(" iter   α  |  |∇Φ|   Φ \n"); end 
    for n = 1:maxiter 
        # ------- steepest descent direction 
        f = ffun(x)
        g = fgrad(x)
        
        # ------- remination criterion
        if verbose; @printf(" %4d %.2e |  %.2e  %.2e\n", n, α, norm(g, Inf), f); end 
        if norm(g, Inf) < tol 
            println("success: |g| < tol, iter = $n")
            return x 
        end 

        # ------- backtracking linesearch
        α = α0
        while ffun(x - α * g) > f - θ * α * dot(g, g)
            α *= 0.5 
            if α < αmin
                println("failure: α < αmin")
                return x
            end
        end
        x = x - α * g 
    end
    println("failure: iter > maxiter")
    return x     
end

In [None]:
steepest_descent2(rosenbrock, grad_rosenbrock, x0, 1e-1; maxiter = 1_000, tol=1e-2, verbose=false)

* less fine-tuning. And even the remaining parameters, $\alpha_0, \theta$ can be removed with some additional analysis.
* fewer iterations, though we now do more work per iteration, and still far from great ... 

**Proposition:** Let $\Phi \in C^2$. The algorithm `steepest_descent2` but with the termination conditions removed produces a sequence $(x_n)_{n = 1, 2, \dots}$ such that, *either* $\Phi(x_n) \downarrow - \infty$, *or*, 

$$
    \nabla \Phi(x_n) \to 0 \qquad \text{as } n \to \infty
$$

**Proof:** see board/tablet/recording.

### Bigger Picture

Numerical optimization is a very mature subject. There is therefore a vast number of possible optimisation algorithms or indeed classes of algorithms available. Some random examples: 

* (Nonlinear) conjugate gradients: enforce approximate orthogonality (in a specific metric) of subsequent search directions
* Quasi-Newton methods: try to "learn" the hessian from the iteration history 
* numerous different line-search methods
* trust-region methods: another class of methods that search in a ball instead of along a line
* Gauss-Newton and Levenberg Marquardt for nonlinear least squares
* Nelder-Mead: derivative-free optimization
* stochastic gradient descent: specifically designed for ultra-large-scale parameter estimation problems (cf ANNs)

... and many more. 

Optimization software is usually so mature that one should almost never write ones own code. 
In Julia, there are several nice optimization packages, with the standard package maybe being [`Optim.jl`](https://github.com/JuliaNLSolvers/Optim.jl): the following result speaks for itself ... 

In [None]:
using Optim
result = optimize(rosenbrock, zeros(2), BFGS())