# Gradient Descent Methods
## Introduction
Gradient Descent Methods constitue a large family of heuristic methods based on the gradient definitions presented in the set up of unconstrained NLP. Gradient descent  methods are
useful to find critical points and test whether they are global minimum (for minimisation problems) or global maximum (for maximisation problems).

Basically, they consist of two types functions: 

- **search functions**: to find points where the gradient is zero or close to zero, i.e. functions to find critical points. 
- **test functions**: to verify whether the critical point is a global or local maximum or minimum.

The different implementations have different performance trade-offs depending on the characteristics of the optimisation problem.
In this notebook, we are going to cover two different methods, the simple gradient descent method and the stochastic gradient descent, with applications in machine learning.

## Simple Gradient descent
In the Simple gradient descent method, the algorithm starts at an initial point, calculates the gradient, and if its different than zero, it looks for the next point to test *in the direction opposite to the gradient*. Recall that, at any given point $x^*$ the gradient of the objective function $\nabla f(x*)$ represents a vector tangent to $f(x*)$. This tangent vector represents the direction of the maximum change of x towards infinity in the n-dimensional space. Therefore, the algorithm continues the search in the opposite direction to this vector, assuming that this will provide the maximum change of the function towards its minimum value.
With this, the Simple Gradient descent method can be defined as:

**Start**

- Choose an acceptable error $\epsilon$
- Choose a starting point $x'$

**Iterate**

- Calculate $\nabla f(x')$
- If $\left|\nabla f(x')\right|\leq \epsilon$ then exit with x′ as the solution, else
- Set $x''=𝑥'-t·\nabla f(x') \quad t \geq 0, t \in \mathbb{R}$ and iterate again. 

Note that since t is a positive value and  $\left|\nabla f(x')\right|$ is a vector, we are changing the value of x in the direction of the vector. The change is proportional to t, which can be a fixed parameter or a calculated parameter depending on the implementation. 
One implementation alternative is to use a one-dimensional search (e.g using a bisection algorithm) to find $𝑡*$ such that $f(x'')$ is a local minimum. Note that since both $x'$ and $\left|\nabla f(x')\right|$ are vectors, the only unknown is $t$, and this bisection is a reduction of the multi-dimensional minimisation problem to a one-dimensional minimisation problem.


## Stochastic Gradient Descent (SGD)
The stochastic gradient descent method can be applied in problems where the gradient of the objective function can be estimated from a subset of the parameters needed to compute the closed form of the gradient. 
The stochastic gradient descent is one of the fundamental optimization methods used in machine learning and given its importance, we will illustrate some outstanding applications of this method in the field of machine learaning the of this method using in the quadratic optimization examples.
Let us consider the following objective function:

$\min  R_n(\theta) = \frac{1}{1}\sum_{t=1}^{n}{L_i(\theta)}$

The stochastic gradient descent method is defined as: 

**Start**

- Choose an acceptable error $\epsilon$

- Choose an starting point $\theta$ (e.g $\theta=0$)

- Choose a learning factor for each iteration k $\zeta_k$
 
**Iterate**

- select a random value $t*$

- Calculate the gradient of $\nabla L_{t*}(\theta_k)$

- Update $theta_k$ in the direction contrary to the gradient at $t*$: $\theta_{k+1} = \theta_{k} - \zeta_k·\nabla L_{t*}(\theta_k)$
   
- Repeat until the difference between two consecutive estimations is sufficiently small, or until the maximum number of iterations is reached.
