# Méthode de plus forte pente

Considérons $f \in C^1$. Le méthode de plus forte pente consiste à calculer itérativement
$$
x_{k+1} = x_k - \alpha^* \nabla f(x^k)
$$
où $\alpha^* \in \arg\min_{\alpha \geq 0} f(x_k - \alpha \nabla f(x_k))$.

In [1]:
using Optim
using Plots
plotly()

┌ Info: Precompiling Optim [429524aa-4258-5aef-a3af-852621145aeb]
└ @ Base loading.jl:1278
┌ Info: For saving to png with the Plotly backend ORCA has to be installed.
└ @ Plots C:\Users\bastin\.julia\packages\Plots\6RLiv\src\backends.jl:373


Plots.PlotlyBackend()

In [None]:
import Pkg
Pkg.add("Optim")

Nous avons besoin de la librairie `LinearAlgebra` pour accéder à des méthodes comme `det`, qui calcul le déterminant d'une matrice.

In [2]:
using LinearAlgebra

## Example 1

Consider the bivariate function
$$
f(x, y) = 4x^2 - 4xy + 2y^2
$$

In [None]:
f1(x) = 4x[1]*(x[1]-x[2])+2*x[2]*x[2]

default(size=(600,600), fc=:heat)
x, y = -2.5:0.1:2.5, 0.5:0.1:2.5
z = Surface((x,y)->f1([x,y]), x, y)
surface(x,y,z)

Its gradient is
$$
\nabla f(x, y) = \begin{pmatrix} 8x - 4y \\ 4y - 4x \end{pmatrix}
$$
The Hessian is
$$
\nabla f^2(x,y) =
\begin{pmatrix}
8 & -4 \\ -4 & 4
\end{pmatrix}
$$

In [None]:
A = [8 -4; -4 4 ]

The principal minors determinants are

In [None]:
8
det( A )

Therefore, the matrix is positive definite. We can confirm this by computing the eigenvalues:

In [None]:
eigvals(A)

We compute the gradient as

In [None]:
function f1grad(x)
    return [8*x[1]-4*x[2], 4*x[2]-4*x[1]]
end

Consider $x_0 = (2, 3)$. Therefore $\nabla f(x_0) = (4, 4)$.

We have to minimize the univariate function
$$
m(\alpha) = f((2, 3) - \alpha(4, 4)) = f(2 - 4\alpha, 3 - 4\alpha)
$$
The derivative of $m(\alpha)$ is
\begin{align*}
m'(\alpha) &= \nabla_{(x,y)} f(2 - 4\alpha, 3 - 4\alpha)^T \nabla_{\alpha} \begin{pmatrix} 2 - 4\alpha \\ 3 - 4\alpha \end{pmatrix} \\
&= \begin{pmatrix} 8(2-4\alpha) - 4(3-4\alpha) & 4(3-4\alpha) - 4(2-4\alpha)\end{pmatrix}\begin{pmatrix} -4 \\ -4 \end{pmatrix} \\
&= -\begin{pmatrix} 4 - 16\alpha & 4\end{pmatrix}\begin{pmatrix} 4 \\ 4 \end{pmatrix} \\
&= -16+64\alpha-16\\
&= 64\alpha-32
\end{align*}

The second derivate of $m(\alpha)$ is
$$
m''(\alpha) = 64
$$
Therefore the unidimensionel model is strictly convex. The minimizer can be found by setting $m'(\alpha^*) = 0$, leading to $\alpha^* = \frac{1}{2}$. Therefore
$$
x_1 = x_0 - \frac{1}{2}\nabla f(x_0) = (2, 3) - \frac{1}{2}(4, 4) = (0, 1),
$$
and
$$
\nabla f(x_1) = \begin{pmatrix} -4 \\ 4 \end{pmatrix}
$$
The univariate function to minimize is now
$$
m(\alpha) = f((0, 1) - \alpha(-4, 4)) = f(4\alpha, 1 - 4\alpha)
$$
and its derivative is
\begin{align*}
m'(\alpha) &= \nabla_{(x,y)} f(4\alpha, 1 - 4\alpha)^T \nabla_{\alpha} \begin{pmatrix} 4\alpha \\ 1 - 4\alpha \end{pmatrix} \\
&= ( 8 \times 4\alpha - 4(1-4\alpha), 4(1-4\alpha) - 4\times(4\alpha))\begin{pmatrix} 4 \\ -4 \end{pmatrix} \\
&= ( -4 + 48\alpha, 4 - 32 \alpha)\begin{pmatrix} 4 \\ -4 \end{pmatrix} \\
&= -32+320\alpha
\end{align*}
The root of $m'(\alpha)$ is $\alpha^* = \frac{1}{10}$, and $m''(\alpha) = 320$, thus $\alpha^*$ is a global minimizer.
We obtain
$$
x_2 = \begin{pmatrix} 0 \\ 1 \end{pmatrix} - \frac{1}{10}\begin{pmatrix} -4 \\ 4 \end{pmatrix}
= \begin{pmatrix} \frac{4}{10} \\ \frac{6}{10} \end{pmatrix}
= \begin{pmatrix} \frac{2}{5} \\ \frac{3}{5} \end{pmatrix}
$$
We could continue, but such a hand computation is tedious. We will automatize the procedure by constructing a Julia function.

In [None]:
function steepestdescent(f::Function, fprime::Function, x0, h::Float64, verbose::Bool = true,
                         record::Bool = false, tol::Float64 = 1e-7, maxiter::Int64 = 1000)

    function fsearch(α::Float64)
        return(f(x-α*grad))
    end

    x = x0
    k = 0

    grad = fprime(x)

    if (verbose || record)
        fx = f(x)
    end
    if (verbose)
        println("$k. x = $x, f($x) = $fx")
    end
    if (record)
        iterates = [ fx x' ]
    end
    
    while ((k < maxiter) && (norm(grad) > tol))
        α = Optim.minimizer(optimize(fsearch, 0, h, GoldenSection()))
        x = x-α*grad
        k += 1
        grad = fprime(x)       
        if (verbose || record)
            fx = f(x)
        end
        if (verbose)
            println("$k. x = $x, f($x) = $fx")
        end
        if (record)
            iterates = [ iterates; fx x' ]
        end
    end

    if (k == maxiter)
        println("WARNING: maximum number of iterations reached")
    end

    if (record)
        return x, iterates
    else
        return x
    end
end

The following variant proposes to enlarge the interval where the unidimensional search is done when the upper bound is reached.

This is only valid for convex functions!

But the idea will be adapted and generalized when discussing about trust regions.

In [None]:
function steepestdescent_convex(f::Function, fprime::Function, x0, h::Float64, verbose::Bool = true,
        record::Bool = false, tol::Float64 = 1e-7, maxiter::Int64 = 1000)

    function fsearch(α::Float64)
        return(f(x-α*grad))
    end

    x = x0
    k = 0

    grad = fprime(x)

    if (verbose || record)
        fx = f(x)
    end
    if (verbose)
        println("$k. x = $x, f($x) = $fx")
    end
    if (record)
        iterates = [ fx x' ]
    end

    Δ = 1e-6
    
    while ((k < maxiter) && (norm(grad) > tol))
        α = Optim.minimizer(optimize(fsearch, 0, h, GoldenSection()))
        while ((h-α) <= Δ)
            h *= 2
            α = Optim.minimizer(optimize(fsearch, α, h, GoldenSection()))
        end
        h = α
        x = x-α*grad
        k += 1
        grad = fprime(x)       
        if (verbose || record)
            fx = f(x)
        end
        if (verbose)
            println("$k. x = $x, f($x) = $fx")
        end
        if (record)
            iterates = [ iterates; fx x' ]
        end
    end

    if (k == maxiter)
        println("WARNING: maximum number of iterations reached")
    end

    if (record)
        return x, iterates
    else
        return x
    end
end

Executing this function on the problem, we obtain

In [None]:
sol, iter = steepestdescent(f1, f1grad, [2.0,3.0], 2.0, true, true)

In [None]:
sol, iter = steepestdescent(f1, f1grad, [10.0,10.0], 2.0, true, true)

In [None]:
sol, iter = steepestdescent(f1, f1grad, [100.0,100.0], 2.0, true, true)

We converge to the solution $(0,0)$, but the method was quite slow close to the solution.

In [None]:
sol, iter = steepestdescent(f1, f1grad, [2.0,3.0], 0.1, true, true)

In [None]:
sol, iter = steepestdescent_convex(f1, f1grad, [2.0,3.0], 0.1, true, true)

In [None]:
k = [x = i for i=1:length(iter[:,1])]
Plots.plot(k,iter[:,1])

In [None]:
k

In [None]:
k = [x = i for i=10:length(iter[:,1])]
Plots.plot(k,iter[10:length(iter[:,1]),1])

## Coordinate descent

In [None]:
function Jacobi(f::Function, x0, h::Float64, verbose::Bool = true, δ::Float64 = 1e-6, maxiter::Int64 = 1000)

    function fsearch(α::Float64)
        return(f(x0-α*d))
    end

    x = copy(x0)
    n = length(x)
    k = 0
    d = zeros(n)
    
    while true
        x0[:] = x[:]
        k += 1
        
        for i = 1:n
            d[i] = 1.0  # d is now the i^th vector of the canonical basis
            α = Optim.minimizer(optimize(fsearch, 0, h, GoldenSection()))
            x[i] -= α
            d[i] = 0.0
        end
        
        if verbose
            println(k, ". ", f(x), " ", x, " ", x0)
        end
        
        if norm(x-x0) < δ
            break
        end
    end
    
    return x
end

In [None]:
sol = Jacobi(f1, [2.0,3.0], 1.0)

## Exemple 2

Consider the bivariate function
$$
f(x,y) = \frac{(2-x)^2}{2y^2}+\frac{(3-x)^2}{2y^2} + \ln y
$$
that is computed in Julia as

In [None]:
f(x) = (2-x[1])*(2-x[1])/(2*x[2]*x[2])+(3-x[1])*(3-x[1])/(2*x[2]*x[2])+log(x[2])

Its derivative is
$$
\nabla f(x) =
\begin{pmatrix}
\frac{-2(2-x)}{2y^2}+\frac{-2(3-x)}{2y^2} \\
-\frac{(2-x)^2}{y^3}-\frac{(3-x)^2}{y^3} + \frac{1}{y}
\end{pmatrix} =
\begin{pmatrix}
\frac{x-2}{y^2}+\frac{x-3}{y^2} \\
-\frac{(2-x)^2}{y^3}-\frac{(3-x)^2}{y^3} + \frac{1}{y}
\end{pmatrix}
$$

In [None]:
function fprime(x)
    return [(x[1]-2)/(x[2]*x[2])+(x[1]-3)/(x[2]*x[2]),
            -(2-x[1])*(2-x[1])/(x[2]*x[2]*x[2])-(3-x[1])*(3-x[1])/(x[2]*x[2]*x[2])+1/x[2]]
end

In [None]:
default(size=(600,600), fc=:heat)
x, y = -2.5:0.1:2.5, 0.5:0.1:2.5
z = Surface((x,y)->f([x,y]), x, y)
surface(x,y,z, linealpha = 0.3)

In [None]:
sol = steepestdescent(f, fprime, [1.0,1.0], 2.0)

The choice of $h$ is important. Consider for instance a too small value: $h = 0.1$.

In [None]:
sol = steepestdescent(f, fprime, [1.0,1.0], 0.1)

But a too big $h$ can lead to some issues too. Consider for instance $h = 10$.

In [None]:
sol = steepestdescent(f, fprime, [1.0,1.0], 10.0)

We will have to ensure that the iterates are such that $y > 0$ due to the logarithmic operator.

The choice of the starting point is also important to ensure that the algorithm converges fast enough. Consider for instance $x_0 = (0.1, 0.1)$.

In [None]:
sol = steepestdescent(f, fprime, [0.1,0.1], 2.0)

Now, take $x_0 = (100, 100)$.

In [None]:
sol = steepestdescent(f, fprime, [100.0,100.0], 5.0)

In practice, we often need some insight on the function to optimize in order to be efficient.

## Rosenbrock function

$$
f(x,y) = (1-x)^2 + 100(y-x^2)^2
$$

$$
\nabla f(x,y) =
\begin{pmatrix}
-2(1-x)-400x(y-x^2) \\
200(y-x^2)
\end{pmatrix}
$$

$$
\nabla^2 f(x,y) =
\begin{pmatrix}
2 - 400(y-x^2) + 800x^2 & -400x \\
-400x & 200
\end{pmatrix}
=
\begin{pmatrix}
2 - 400y + 1200x^2 & -400x \\
-400x & 200
\end{pmatrix}
$$

In [None]:
function rosenbrock(x::Vector)
  return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end
 
function rosenbrock_gradient(x::Vector)
  return [-2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1],
          200.0 * (x[2] - x[1]^2)]
end
 
function rosenbrock_hessian(x::Vector)
  h = zeros(2, 2)
  h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2
  h[1, 2] = -400.0 * x[1]
  h[2, 1] = -400.0 * x[1]
  h[2, 2] = 200.0
  return h
end

In [None]:
default(size=(600,600))
x, y = 0:0.01:1.0, 0:0.01:1.0
z = Surface((x,y)->rosenbrock([x,y]), x, y)
surface(x,y,z, linealpha = 0.3)

In [None]:
Plots.contour(x,y,z, linealpha = 0.1, levels=2500)

In [None]:
sol, iter = steepestdescent(rosenbrock, rosenbrock_gradient, [0.0,0.0], 10.0, true, true)

The minimizer is located at $(1,1)$. Indeed,
$$
\nabla f(1,1) = \begin{pmatrix} 0 \\ 0 \end{pmatrix}
$$
and
$$
\nabla^2 f(1,1) =
\begin{pmatrix}
802 & -400 \\ -400 & 200
\end{pmatrix}
$$
The determinants of the principal minors are positive as they are respectively 802 and $802\times200-400^2= 400$, so the Hessian is positive definite.

However the steepest descent method converges very slowly.

In [None]:
plot!(iter[:,2], iter[:,3])

# Exact minimization of approximate minimization?

The exact minimization of the function along the search direction requires assumptions as unimodality or convexity, that are not necessarily satisfied. It is more practical to approximately minimize the function along the search direction using backtracking. This will be done more explicitely in the linesearch notebook.

For nonconvex functions, a first approach is to fix the step length.

In [None]:
function batchdescent(f::Function, fprime::Function, x0, α::Float64, verbose::Bool = true,
                      record::Bool = false, tol::Float64 = 1e-7, maxiter::Int64 = 1000)

    function fsearch(α::Float64)
        return(f(x-α*grad))
    end

    x = x0
    k = 0

    grad = fprime(x)

    if (verbose || record)
        fx = f(x)
    end
    if (verbose)
        println("$k. x = $x, f($x) = $fx")
    end
    if (record)
        iterates = [ fx x' ]
    end
    
    while ((k < maxiter) && (norm(grad) > tol))
        x = x-α*grad
        k += 1
        grad = fprime(x)       
        if (verbose || record)
            fx = f(x)
        end
        if (verbose)
            println("$k. x = $x, f($x) = $fx")
        end
        if (record)
            iterates = [ iterates; fx x' ]
        end
    end

    if (k == maxiter)
        println("WARNING: maximum number of iterations reached")
    end

    if (record)
        return x, iterates
    else
        return x
    end
end

We can get close too the solution if $\alpha$ is small enough.

In [None]:
sol, iter = batchdescent(f1, f1grad, [2.0,3.0], 0.1, true, true)

But if $\alpha$ is too large, it does not work at all!

In [None]:
ol, iter = batchdescent(f1, f1grad, [2.0,3.0], 2.0, true, true)

If $f \in C^1$, $f$ convex, and $\nabla f(\cdot)$ is Lipschitz continuous, i.e. $\exists L >0$ such that
$$
\forall x, y,\ \| \nabla f(x) - \nabla f(y) \|_2 \leq L \| x - y\|_2,
$$
we can recover the convergence by considering a decreasing sequence of step lengths $\alpha_k > 0$ staisfying
$$
\sum_{k = 1}^{+\infty} \alpha_k = +\infty,\qquad \sum_{k = 1}^{+\infty} \alpha_k^2 < +\infty.
$$
Example: $\alpha_k = \frac{\kappa}{k}$.

In [None]:
function rbdescent(f::Function, fprime::Function, x0, α0::Float64, verbose::Bool = true,
                   record::Bool = false, tol::Float64 = 1e-7, maxiter::Int64 = 1000)

    function fsearch(α::Float64)
        return(f(x-α*grad))
    end

    x = x0
    k = 0
    α = α0

    grad = fprime(x)

    if (verbose || record)
        fx = f(x)
    end
    if (verbose)
        println("$k. x = $x, f($x) = $fx")
    end
    if (record)
        iterates = [ fx x' ]
    end
    
    while ((k < maxiter) && (norm(grad) > tol))
        k += 1
        α = α0/k 
        x = x-α*grad
        grad = fprime(x)       
        if (verbose || record)
            fx = f(x)
        end
        if (verbose)
            println("$k. x = $x, f($x) = $fx", ", α = ", α)
        end
        if (record)
            iterates = [ iterates; fx x' ]
        end
    end

    if (k == maxiter)
        println("WARNING: maximum number of iterations reached")
    end

    if (record)
        return x, iterates
    else
        return x
    end
end

In [None]:
ol, iter = rbdescent(f1, f1grad, [2.0,3.0], 2.0, true, true)

In [None]:
ol, iter = rbdescent(f1, f1grad, [10.0,10.0], 2.0, true, true)

In [None]:
ol, iter = rbdescent(f1, f1grad, [100.0,100.0], 2.0, true, true)

In [None]:
ol, iter = rbdescent(f1, f1grad, [100.0,100.0], 0.1, true, true)

This technique has been proposed by Robbins and Monro in 1951 in the context of stochastic approximation, where the objective is
$$
f(x) = E[g(x,\xi)]
$$
and at each iteration, the next iterate is computed as
$$
x_{k+1} = x_k - \alpha_k \nabla g(x_k,\xi_k)
$$
where $\xi_k$ is drawn from the distribution of $\xi$.

This technique, as well as some extensions (mini-batch, stochastic average gradient,...) is still very popular in machine learning.