## Chapter 6: First order methods

# 6.0 Introduction

In the previous Chapter we introduced the zero order algorithms - the most naive set of *mathematical optimization* tools we can use to try to locate minima of a desired function that do so by repeatedly evaluating the function itself.  Even though neither of the methods discussed there can be effectively applied to the vast majority of modern machine learning / deep learning problems, we saw how the concept of a *local optimization method* of optimization (which we saw via the introduction of the *random local search* algorithm) has some very appealing characteristics. 

In this Chapter we mirror the structure of our discussion in the previous Chapter in describing *first order optimization methods*.  In particular, our main focus will be in describing the first order local method called *gradient descent*, which unlike its zero-order counterparts scales gracefully with input dimension and is used extensively in machine learning / deep learning.

## 6.0.1  Big picture view of the gradient descent algorithm

As we saw previously, a local optimization method is one where we aim to find minima of a given function by beginning at some point $\mathbf{w}^0$ and taking number of steps $\mathbf{w}^1, \mathbf{w}^2, \mathbf{w}^3,...,\mathbf{w}^{K}$ of the generic form 

\begin{equation}
\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \mathbf{d}^{\,k}.
\end{equation}

where $\mathbf{d}^{\,k}$ are direction vectors (which ideally are *descent directions* that lead us to lower and lower parts of a function) and $\alpha$ is called the *steplength* parameter.  We saw how the random local search algorithm follows this framework precisely, but is fatally flawed due to the way it determines each *descent direction* $\mathbf{d}^{\,k}$ -i.e., via random search - which grows exponentially more inefficient as the dimension of a function's input increases.  (see Section 5.3 for a complete introduction to this concept).

Of course this is a flaw in how that particular algorithm determines a proper descent direction, and does not damn the entire local optimization framework itself.  If we could somehow replace the random search component with a significantly more computationally efficient way of finding a good descent directions then we might be able to produce very useful local optimization algorithms.  This is indeed what do here, with the ultimate aim in this Chapter being to introduce the *gradient descent algorithm*.  Gradient descent is the foremost first order optimization algorithm and one of the most widely used optimization methods in machine learning / deep learning today.

Here is how the gradient descent algorithm works at a high level (in this Chapter we of course explain each of these concepts in complete detail).  As we saw in Chapter 2, the first derivative of a function helps form the best *linear* approximation to the function locally (called the *first order Taylor series approximation*).  Because of this - and the fact that it is extremely easy to compute the descent direction of a line or hyperplane regardless of its dimension (as we will see in this Chapter) - the descent direction of the first order Taylor series approximation typically provides descent in the function itself.  With proper steplength control the repeated use of these descent directions can indeed minimize generic functions, and this is indeed the essence of the gradient descent algorithm (as illustrated figuratively below).  As opposed to the random local search algorithm discussed in the previous Chapter, gradient descent scales extremely well with the dimension of input and so is much more widely applicable.

<figure>
<img src="../../mlrefined_images/math_optimization_images/Fig_2_7.png" width=700 height=250/>
  <figcaption>   
<strong>Figure 1:</strong> <em> A figurative drawing of the gradient descent algorithm.  The first order Taylor series approximation provides an excellent and easily computed descent direction at each step of this local method of optimization (here a number of approximations are shown in green).   Employing these directions at each step the *gradient descent algorithm* can be used to properly minimize generic functions.  Moreover, unlike the random local search algorithm, gradient descent scales very well with input dimension since the descent direction of a hyperplane is much more easily computed in high dimensions.
</em>  </figcaption> 
</figure>

## 6.0.2  Chapter organization

The contents of this Chapter are organized as follows.  We begin with a short discussion of the *first order condition for optimality* which codifies how the first derivative(s) of a function characterize its minima.  Then we discuss some fundamental concepts related to the geometric nature of hyperplanes and in particular the first order Taylor series in preparation for discussing the *gradient descent algorithm*, which is the fundamental first order local optimization algorithm.  With these ideas in hand we can then formally detail the gradient descent algorithm, examining a number of examples that help exhibit the algorithm's general behavior.  Finally - as with any local optimization method - we must worry about the selection of the steplength parameter.  We give both informal advice on its general setting, as well as more formal rigorous approaches to steplength selection as well.