# Intro to gradient descent and SGD

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
data = np.random.randn(15) * 4 + 12

In [None]:
data

In [None]:
np.average(data)

### The mean as a minimization problem

We talked about this back when we started doing curve fitting: the mean, the simplest model for a numerical variable, can be thought of as the solution to a least-squares minimization problem.

Define the *cost function*

$$ SSE(x^*; \mathbf x) = \sum_{i=0}^{m-1} (x_i - x^*)^2 $$

(sum of squared errors). This is of course proportional to the variance of the data. We think of this as a function of the estimate $x^*$, not the data $\mathbf x$.

### Minimizing a function by calculus

<font color = 'green'> __Derivative:__ <font color = 'black'> An operation in calculus that calculates the rate of change (slope) of a function at a point. If we have a function $f(x)$, we can find a function $f'(x)$ with the property that the "slope" of the graph of $f$ at a specific point $x_0$ is $f'(x_0)$. This function is called the derivative of $f$.
    
There is a whole list of rules for calculating the derivative of a function if you have a formula for it.

<font color = 'green'> __Gradient:__ <font color = 'black'> For a function $f(\mathbf x)$ that depends on several variables, the gradient is a vector built out of its derivative with respect to each of the variables. The important thing to know about the gradient is that as a vector, it points in the direction of greatest increase of $f$.

<font color = 'green'> __Critical point:__ <font color = 'black'> For any function $f(\mathbf x)$, a critical point is a value of the input $\mathbf x$ where the derivative/gradient is 0 (or undefined). A minimum/maximum value of $f$ must occur at a critical point.

### Minimizing a function iteratively

<font color = 'green'> __Gradient descent:__ <font color = 'black'> An iterative approach to minimizing a function. *Iterative* means we start with a guess and then try to improve it. In gradient descent, we improve the guess by taking a small step in the direction of the (negative) gradient.

In [None]:
def gradient_mean(data, candidate):
    return sum(-2 * (x - candidate) for x in data)

In [None]:
def gradient_descent_mean(data, learn_rate = 0.01, tol = 0.0001, verbose = False):
    # pick a random starting point
    candidate = np.random.uniform(np.min(data), np.max(data))
    delta = 1
    while abs(delta) > tol:
        grad = gradient_mean(data, candidate)
        delta = learn_rate * grad
        candidate -= delta
        if verbose:
            print(candidate)
    return candidate

<font color = 'green'> __Learning rate:__ <font color = 'black'> The multiplier we apply to the gradient in gradient descent. Higher learning rate means we take bigger steps. In principle this means our optimizer should converge faster. But pushing it too high can cause problems.

In [None]:
def grad_descent_path(data, learn_rate = 0.01, tol = 0.0001):
    # pick a random starting point
    candidate = np.random.uniform(np.min(data), np.max(data))
    path = [candidate]
    delta = 1
    while abs(delta) > tol:
        grad = gradient_mean(data, candidate)
        delta = learn_rate * grad
        candidate -= delta
        path.append(candidate)
    return np.array(path)

In [None]:
gradient_descent_mean(data)

In [None]:
path = grad_descent_path(data)
plt.plot(np.arange(1, path.shape[0] + 1), path, 'o')
plt.show()

### Stochastic gradient descent

Notice that the gradient of the SSE is actually a sum of many terms, each of which depends on just one instance. This is a pretty common form for loss functions, because the loss measures the difference between the model prediction and the true value, and the simplest way to combine these differences is to just add them up (or average them, etc.).

The problem is that if you have a really big data set, and if your gradient is more complicated to calculate than what we have above, it can be really computationally expensive to do this.

<font color = 'green'> __Stochastic gradient descent:__ A variation on gradient descent where we estimate the gradient by using just a single example at a time. We shuffle the data set and then step through it, calculating the gradient one step at a time. 

In [None]:
def sgd_mean(data, learn_rate = 0.001, tol = 0.0001, verbose = False):
    candidate = np.random.uniform(np.min(data), np.max(data))
    delta = 1
    temp_data = data.copy()
    while abs(delta) > tol:
        np.random.shuffle(temp_data)
        for x in temp_data:
            grad = -2*(x - candidate)
            delta = grad * learn_rate
            candidate -= delta
            if verbose:
                print(candidate)
    return candidate

In [None]:
sgd_mean(data, learn_rate = 0.001, verbose = True)

<font color = 'green'> __Learning rate decay:__ <font color = 'black'> Reducing the learning rate over time to prevent early stopping without causing the model to diverge.

<font color = 'green'> __Epoch:__ <font color = 'black'> A single pass through the training data set.

In [None]:
def sgd_mean_decay(data, learn_rate = 0.05, tol = 0.0001, verbose = False):
    candidate = np.random.uniform(np.min(data), np.max(data))
    delta = 1
    temp_data = data.copy()
    epoch = 1
    while abs(delta) > tol:
        np.random.shuffle(temp_data)
        for x in temp_data:
            grad = -2*(x - candidate) / len(data)
            delta = grad * learn_rate / epoch ** (1/2)
            candidate -= delta
            if verbose:
                print(candidate)
        epoch += 1
    return candidate

In [None]:
def sgd_mean_decay_path(data, learn_rate = 0.02, tol = 0.0001, verbose = False):
    candidate = np.random.uniform(np.min(data), np.max(data))
    delta = 1
    temp_data = data.copy()
    epoch = 1
    path = [candidate]
    while abs(delta) > tol:
        np.random.shuffle(temp_data)
        for x in temp_data:
            grad = -2*(x - candidate)
            delta = grad * learn_rate / epoch ** (1/2)
            candidate -= delta
            path.append(candidate)
            if verbose:
                print(candidate)
        epoch += 1
    return np.array(path)

In [None]:
path = sgd_mean_decay_path(data, learn_rate = 0.01)
plt.plot(np.arange(1, path.shape[0] + 1), path, 'o')
plt.show()