## Stochastic gradient descent

Stochastic gradient descent is a useful technique when it would be prohibitively expensive to run all of the training examples at once, or when we wish to update our model in a sequential way.  Here we demonstrate the difference in the convergence properties of batch versus stochastic gradient descent for the simple problem of linear regression.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = (9,9)

np.random.seed(0)

We will use a dataset generated from a line with noised added:
$$ y = x + 1 + \epsilon $$
$$ x \in [0,1]. $$

In [None]:
x = np.linspace(0,1,21)
y = x + np.random.randn(len(x))*0.1 + 1.0
plt.plot(x,y,'k.')
plt.show()

Let's initialize our weights to zero:

In [None]:
w = np.array([0.,0.])

Our cost function (or alternatively negative log-likelihood) is simple least squares

In [None]:
def L(w,x,y):
    return 1./2.*sum((y - w[0] - w[1]*x)**2)

Recall that to solve the least squares problem, we take the derivative of this thing with respect to the weights, set them equal to zero.  This produces the normal equations:
$$
X^T X W = X^T y
$$
which have an analytical solution.  However, for the purposes of illustration, we can assume that we can't just solve them, and have to use gradient descent.  The gradient for the intercept and the slope of the line we want to fit are 
$$
\frac{\partial L}{\partial w_0} = -\sum_{i=1}^m (y_i - w_0 - w_1 x_i)
$$
$$
\frac{\partial L}{\partial w_1} = -\sum_{i=1}^m (y_i - w_0 - w_1 x_i) x_i
$$
Writing a python function for this gives us:

In [None]:
def G(w,x,y):
    return np.array([-sum(y - w[0] - w[1]*x),-sum((y - w[0] - w[1]*x)*x)])

First, we can run so-called batch gradient descent, in which we look at all the data points at once.  We'll save our weights at each epoch, so that we can plot them.  A learning rate $\eta=0.01$ works well for this problem.

In [None]:
w_batch = [w.copy()]
eta = 0.01
for i in range(10000):
    w -= eta*G(w,x,y)
    w_batch.append(w.copy())
    
w_batch = np.array(w_batch)

Let's plot our path to the minimum

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

Now we'll do so-called stochastic gradient descent, where we compute the gradient based on a single data point sampled at random without replacement from the dataset.  We'll also save the weights at the end of each *epoch*, which is the time period after which all of the datapoints have been observed.

In [None]:
w = np.array([0.,0.])
w_stoch = [w.copy()]
w_epoch = [w.copy()]
eta = 0.01
for i in range(10000):
    w_epoch.append(w.copy())
    random_indices = np.random.choice(range(len(x)),len(x),replace=False)
    for j in random_indices:
        x_sample = x[j]
        y_sample = y[j]
        w -= eta*G(w,np.array([x_sample]),np.array([y_sample]))
        w_stoch.append(w.copy())

w_stoch = np.array(w_stoch)
w_epoch = np.array(w_epoch)


Let's plot this on top of the results from batch gradient descent

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1],'b-')
plt.plot(w_epoch[:,0],w_epoch[:,1],'ro')

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

There are some wiggles, but the stochastic value after each epoch falls remarkably close to the batch descent line.  This is even more interesting if we zoom in on the upper right region (near convergence).

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1])
plt.plot(w_epoch[:,0],w_epoch[:,1],'ro')

plt.xlim(1.1,1.2)
plt.ylim(0.7,0.8)

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

Why does this work?  Let's look at the sum of the individual weight updates in SGD over an epoch.
$$
\Delta w_{SGD} = \eta [\sum_{i=1}^m (y_i - w_{0,i} - w_{1,i} x_i),\sum_{i=1}^m (y_i - w_{0,i} - w_{1,i} x_i)x_i]
$$
Compare this to the update for batch gradient descent
$$
\Delta w_{BGD} = \eta [\sum_{i=1}^m (y_i - w_{0} - w_{1} x_i),\sum_{i=1}^m (y_i - w_{0} - w_{1} x_i)x_i]
$$
You'll note that it's exactly the same, with the exception of the subscripts on the weights.  However, since the weights aren't changing very rapidly (we're taking small steps after all), the resulting updates are very close to identical.

These are two end-member options for dealing with gradient descent.  The best solution for the purposes of machine learning lies somewhere in the middle, via a technique called mini-batch gradient descent.  In mini-batch gradient descent, at each epoch we split the data-set into $k$ subsets of a specified size known as the *batch size*.  

In [None]:
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

w = np.array([0.,0.])
w_mini = [w.copy()]
w_epoch_mini = [w.copy()]
eta = 0.01
#batch_size = 1 # Stochastic Gradient Descent
batch_size = 7 # Mini-batch Gradient Descent
#batch_size = 21 # Batch Gradient Descent
for i in range(10000):
    w_epoch_mini.append(w.copy())
    random_indices = np.random.permutation(len(x))
    mini_batches = chunks(random_indices,batch_size)
    for indices in mini_batches:
        x_sample = x[indices]
        y_sample = y[indices]
        w -= eta*G(w,x_sample,y_sample)
        w_mini.append(w.copy())

w_mini = np.array(w_mini)
w_epoch_mini = np.array(w_epoch_mini)

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_mini[:,0],w_mini[:,1],'b-')
plt.plot(w_epoch_mini[:,0],w_epoch_mini[:,1],'ro')

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

The mini-batch gradient descent produces results which are somewhat intermediate between the stochastic and batch versions.  

It's worth noting that for this example, stochastic gradient descent takes quite a bit more time to run.  This is because our dataset is relatively small, and the problem we are trying to solve is relatively simple.  However, in large scale problems (think running neural networks over millions of images), it's not possible to fit the training set into memory, and the computation becomes overwhelming.  Simultaneously, in cases where there are many local minimina, SGD may perform better because some local minima may only form for a large number of data points simultaneously.  In this sense, it may also be viewed as a form of *regularization*, because it helps the model avoid overfitting.

### Momentum

One last popular variant on gradient descent is the inclusion of momentum.  Momentum utilizes the following parameter update:
$$
\Delta \mathbf{w}_i = m \Delta \mathbf{w}_{i-1} + (1-m) \nabla \mathbf{w}_i
$$
$$
\mathbf{w}_{i+1} = \mathbf{w}_i - \eta \Delta \mathbf{w}_i
$$
This effectively makes the update direction slower to change, and can help to push the model up and out of local minima.  Let's illustrate it's function using stochastic gradient descent (mini-batch size 1).

In [None]:
w = np.array([0.,0.])
w_momen = [w.copy()]
w_mepoch = [w.copy()]
eta = 0.01
momentum = 0.9
delta_w = 0.0
for i in range(10000):
    w_mepoch.append(w.copy())
    random_indices = np.random.choice(range(len(x)),len(x),replace=False)
    for j in random_indices:
        x_sample = x[j]
        y_sample = y[j]
        delta_w = momentum*delta_w + (1.-momentum)*G(w,np.array([x_sample]),np.array([y_sample]))
        w -= eta*delta_w
        w_momen.append(w.copy())
        
w_momen = np.array(w_momen)

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1],'b-')
plt.plot(w_momen[:,0],w_momen[:,1],'r-')

#plt.xlim(1.1,1.2)
#plt.ylim(0.7,0.8)

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1],'b-')
plt.plot(w_momen[:,0],w_momen[:,1],'r-')

plt.xlim(1.1,1.2)
plt.ylim(0.7,0.8)

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

Momentum reduces the size of the wiggles due to the stochasticity in stochastic gradient descent.  

### RMSprop
One final popular variant of gradient descent is called RMSprop, and it is similar to gradient descent with momentum, but with a twist: instead of keeping a running average of the gradient, RMSprop keeps a running average of the squared gradient.  Then, when it comes time to update the weights, it normalizes the gradient by the square-root of this average-squared gradient.  What does this do?  It effectively eliminates the scale of the gradient from the problem, and we only go downhill based on the sign.  The averaging is necessary because the sign of the gradient can jump around alot, so it's better to know generally which direction is down (this is especially true in stochastic gradient descent).

In [None]:
w = np.array([0.,0.])
w_momen = [w.copy()]
w_mepoch = [w.copy()]
eta = 0.0001
momentum = 0.9
delta_w = 0
for i in range(10000):
    w_mepoch.append(w.copy())
    random_indices = np.random.choice(range(len(x)),len(x),replace=False)
    for j in random_indices:
        x_sample = x[j]
        y_sample = y[j]
        gradient = G(w,np.array([x_sample]),np.array([y_sample]))
        delta_w = momentum*delta_w + (1.-momentum)*gradient**2
        w -= eta*gradient/(np.sqrt(delta_w) + 1e-8)
        w_momen.append(w.copy())
        
w_momen = np.array(w_momen)

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1],'b-')
plt.plot(w_momen[::1000,0],w_momen[::1000,1],'ro-')

#plt.xlim(1.1,1.2)
#plt.ylim(0.7,0.8)

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

In [None]:
plt.plot(w_batch[:,0],w_batch[:,1],'go-')
plt.plot(w_stoch[:,0],w_stoch[:,1],'b-')
plt.plot(w_momen[:,0],w_momen[:,1],'r-')

plt.xlim(1.1,1.2)
plt.ylim(0.7,0.9)

plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.show()

In practice, RMSprop can be combined with normal momentum.  

These are just a few examples of the large scale gradient descent schemes that can be used for general optimization problems, but especially neural networks.  There are many, many other methods (a good overview can be found [here](http://ruder.io/optimizing-gradient-descent/).  However, effectively, these are all just slight variations on the general theme of figuring out which way is down, and going that direction.