## Chapter 6: First order algorithms

In the previous Chapter we introduced the zero order algorithms - the most naive set of *mathematical optimization* tools we can use to try to locate minima of a desired function that do so by repeatedly evaluating the function itself.  Even though neither of the methods discussed there can be effectively applied to the vast majority of modern machine learning / deep learning problems, we saw how the concept of a *local method* of optimization (which we saw via the introduction of the *random local search* algorithm) has some very appealing characteristics.  

As we saw previously, a local optimization method is one where we aim to find minima of a given function by beginning at some point $\mathbf{w}^0$ and taking number of steps $\mathbf{w}^1, \mathbf{w}^2, \mathbf{w}^3,...,\mathbf{w}^{K}$ of the generic form 

\begin{equation}
\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \mathbf{d}^{\,k}.
\end{equation}

where $\mathbf{d}^{\,k}$ are direction vectors (which ideally are *descent directions* that lead us to lower and lower parts of a function) and $\alpha$ is called the *steplength* parameter.  We saw how the random local search algorithm follows this framework precisely, but is fatally flawed due to the way it determines each *descent direction* $\mathbf{d}^{\,k}$ -i.e., via random search - which grows exponentially more inefficient as the dimension of a function's input increases.  (see Section 5.3 for a complete introduction to this concept).

Of course this is a flaw in how that particular algorithm determines a proper descent direction, and does not damn the entire local optimization framework itself.  If we could somehow replace the random search component with a significantly more computationally efficient way of finding a good descent directions then we might be able to produce very useful local optimization algorithms.  This is indeed what do here. 

In this Chapter we discuss the fundamentals of *first order algorithms*, which are local optimization methods that employ a function's *first derivative or gradient* to determine a descent direction instead of the function itself.  Because the descent direction we will devise is based entirely on the function's derivative / gradient, these methods are synonymously referred to as *gradient descent algorithms*.

We begin by discussing how the first derivative / gradient in the context of the classical *first order Taylor series approximation*, which provides the underpinning of gradient descent algorithms.  To ensure comfort with both single input and multi-input versions of gradient descent we take some care in detailing the general multi-input setup, including a review of some fundamental attributes of hyperplanes and tangent hyperplanes generated by a function's gradient.  After detailing the basic gradient descent algorithm we then look at a number of simple examples that help exhibit its general behavior.  Finally - as with any local optimization method - we must worry about the selection of the steplength parameter.  We give both informal advice on its general setting, as well as more formal rigorous approaches to steplength selection as well.