# Gradient Descent

<hr>

A general method for finding the minima of any function is set the first derivative to zero and solve for its parameters

$\nabla f(w) = 0$

For example, in Ordinary Least Squares, we would like to estimate the weights, $\hat{w}$, that minimizes the loss function:

$\hat{w} = \displaystyle\arg \min_{\substack{w}} \sum_{i=1}^{N}(y_i - x_i w)^2$:

<hr>

**Convexity**<br>

A function, $f$, is convex if at each point, if $f$ is twice differentiable: $f$ is convex for all $w$

$\nabla^2 f(w)$ is a positive semidefinite (PSD) Hessian matrix (in 1D: $f''(w) \geq 0$)

Any matrix, $A$, is PSD if:
1. $v^T A v \geq 0$ for any vector, $v$, or
1. $A$ has eigenvalues $\geq 0$

If $f''(w) = 0$ then $w$ might be a **saddle point** and not a minma or maxima. In multiple dimensions, a Hessian matrix with a mixture of positive and negative eigenvalues (an indefinite matrix) suggest that the loss function curves upwards and downwards and is the definition of a saddle point.

$\therefore$ if the second derivative is non-negative everywhere, $f''(w) \geq 0$ for all $w$, then it has a unique global minimum and is a convex function

<hr>

**General workings of Gradient Descent**<br>
Solving $\nabla f(w) = 0$ directly might be difficult. <br>

Gradient descent uses an iterative way to determine $w$ that would minimize the function by starting with an arbitrary $w$ and then goes closer to the crtiical point with each iteration.

General scheme:
1. Start with some $w^0$, for $t = 0, 1, \ldots$
1. $w^{t+1} \leftarrow w^t + a_t d^t$ where $a_t$ is the step size and $d^t$ is the direction of the descent

<hr>

**Direction, $d^t$**<br>
To move one step in the right direction, we have to estimate a quadratic function, $g_t(u)$, that is larger than $f_t(w)$ for all values of $w$.

$g_t (u) = f_t (w^t) + f_t' (w^t) (u - w^t) + \frac{L}{2} (u - w^t)^2$ 

where $L$ defines the step size and should be the largest eigenvalue of the Hessian

such that:
1. $g_t (w) \geq f_t (w)$ for all values of $w$
2. $g_t (w^t) = f_t (w^t)$

<img alt="Quadratic Minimization" src="assets/quadratic_minimization.png" width="300">

Given $g_t (u)$, find derivative with respect to $u$ and set it to zero to find $u$ that minimizes $g_t (u)$

$u_t = w^t - \frac{1}{L}f'(w^t) = w^{t+1}$

Example: *Least Squares Regression*

$w^{t+1} \leftarrow w^t - \alpha_t \nabla f(w^t)$, where $\alpha_t$ is the step size

Gradient for squared loss: 

$\nabla_w (\sum_{i=1}^{N} (y_i - x_i w)^2) = \sum_{i=1}^{N} \nabla_w (y_i - x_i w)^2$

$= -2 \sum_{i=1}^{N} (y_i - x_i w) \cdot x_i^T$

$w^{t+1} = w^{t} + 2 \alpha_t \sum_{i=1}^{N} (y_i - x_i w) \cdot x_i^T$

$\therefore w^{t+1}$ will be a combination of $x_i$'s

<hr>

**Step sizes, $\alpha_t$**

With step size, $\alpha_t = \frac{1}{L}$, $L$ may be hard to compute as it is the maximum eigenvalue of the Hessian. 

Suppose each step will further minimize the loss, then the formula below holds:

$f(w^{t+1}) \leq f(w^t) - \frac{1}{2L} \lVert f(w^t) \rVert^2$

Start with an optimistic $\alpha_t$, check if the above equation holds. If yes, use $\alpha_t$ else try $\frac{\alpha}{2}$ and check again.

<hr>

**Stochastic Gradient Descent**<br>
Most frequently used variant of gradient descent for most ML applications.

Problem: In linear regression, each iteration is a computation of a gradient of the sum of residuals across N data points. When N is large, then this becomes computationally expensive.

Solution: Instead of taking the sum of all data points, estimate the sum with a few (or one) data point(s)

1. Start with some $w_0$, for $t = 0, 1, 2, \ldots$
1. Draw data point uniformly at random
1. Compute the next step: $w^{t+1} \leftarrow w^t - \alpha_t \nabla f_i(w^t)$
1. Step size, $\alpha_t$ should shrink ($\approx \frac{1}{t+1}$) as $t$ increases

In practice, stochastic gradient descent may not minimize the loss function in a given step but on expectation should minimize and converge to the minimum with the same number of iterations with each iteration being much less computationally expensive.

<img alt="Gradient Descent vs Stochastic Gradient Descent" src="assets/gd_vs_sgd.png" width="300">

<hr>

# Basic code
A `minimal, reproducible example`