In [1]:
%load_ext tikzmagic

---
slug: "/blog/optimizationfordeeplearning"
date: "2021-04-12"
title: "Optimization for Deep Learning"
category: "2 Deep Learning"
order: 4
---

### Optimization

Optimization is the process used to minimize the loss function during the training process of a neural network.
There are a variety of different approaches to optimization.
This post will discuss some of those approaches, including gradient descent, stochastic gradient descent, RMSProp, and Adam.
This post will also cover generalized optimization practices, such as momentum as well as adaptive learning.

### Gradient Descent

Vanilla gradient descent is one of the simplest approaches to optimization.
The general process is to reduce the loss function by moving the weights in the opposite direction of the gradient.
The weight update performed in gradient descent is shown below:

$$
\begin{aligned}
    w &= w - \alpha \frac{\partial L}{\partial w}&
    \text{Weight update}\\
    \alpha &\rightarrow \text{model hyperparameter} &\\
\end{aligned}
$$

### Stochastic Gradient Descent

Stochastic gradient descent is a variant of gradient descent and is one of the most popular optimization techniques in machine learning.
The basic difference between stochastic gradient descent and plain gradient descent is that in stochastic gradient descent, weights are updated after a single randomly-drawn point of data is seen (or a randomly-drawn batch of data points), rather than only when the entire dataset has been seen, as in plain gradient descent.
Minibatch stochastic gradient descent is a variant of stochastic gradient descent where the weights of a network are updated using a batch of randomly-drawn input data rather than a single data point.
The weight updates for stochastic gradient descent and minibatch stochastic gradient descent are shown below:

$$
\begin{aligned}
    w &= w - \alpha \frac{\partial L(y_i, f(x_i; w))}{\partial w}
    &\text{Weight update}\\
    w &= w - \alpha \left[\frac{1}{B}\sum^B_{i=1} \frac{\partial L(y_i, f(x_i;w))}{\partial w}\right]
    &\text{Minibatch weight update}\\
    \alpha &\rightarrow \text{model hyperparameter} &\\
\end{aligned}
$$

### Momentum

Momentum is a genralizable method to accelerate optimization that uses a moving average of past gradients to update weights rather than simply the last calculated gradient.
The exponentially decaying moving average of gradients, $v$, is the velocity at which the weights of the model move.
Nesterov momentum is a momentum method variant that evaluates the gradient after applying the current velocity, rather than before, as in standard momentum.

$$
\begin{aligned}
    v &= \epsilon v - \alpha \left[\frac{1}{B}\sum^B_{i=1}\frac{\partial L(y_i, f(x_i; w))}{\partial w}\right]
    & \text{SGD with Momentum}\\
    v &= \epsilon v - \alpha \left[\frac{1}{B}\sum^B_{i=1}\frac{\partial L(y_i, f(x_i; w + \epsilon v))}{\partial w}\right]
    & \text{SGD with Nesterov Momentum}\\
    w &= w + v 
    &\text{Weight update}\\
    \alpha &\rightarrow \text{model hyperparameter}&\\
    \epsilon &\rightarrow \text{model hyperparameter}&\\
    v & \rightarrow \text{velocity}&\\
\end{aligned}
$$

### AdaGrad

$$
\begin{aligned}
    r &= r + \left[\frac{1}{B}\sum^B_{i=1}\frac{\partial L(y_i, f(x_i; w))}{\partial w}\right]^2\\
    w &= w - \frac{\alpha}{\delta + \sqrt{r}} \left[
         \frac{1}{B} \sum^B_{i=1} \frac{\partial L(y_i, f(x_i; w))}{\partial w}
     \right]\\
     \alpha &\rightarrow \text{model hyperparameter}\\
     \delta &\rightarrow \text{small constant, usually }10^{-6}\\
\end{aligned}
$$

### RMSProp

$$
\begin{aligned}
    r &= \rho r + (1-\rho) \left[
         \frac{1}{B} \sum^B_{i=1} \frac{\partial L(y_i, f(x_i; w))}{\partial w}
     \right]^2\\
     w &= w - \frac{\alpha}{\sqrt{\delta + r}} \left[
         \frac{1}{B} \sum^B_{i=1} \frac{\partial L(y_i, f(x_i; w))}{\partial w}
     \right]\\
     \alpha &\rightarrow \text{model hyperparameter}\\
     \rho &\rightarrow \text{model hyperparameter}\\
     \delta &\rightarrow \text{small constant, usually }10^{-6}\\
\end{aligned}
$$

### RMSProp with Momentum

$$
\begin{aligned}
    r &= \rho r + (1-\rho) \left[
         \frac{1}{B} \sum^B_{i=1} \frac{\partial L(y_i, f(x_i; w + \epsilon v))}{\partial w}
     \right]^2\\
    v &= \epsilon v - \frac{\alpha}{\sqrt{r}} \left[
         \frac{1}{B} \sum^B_{i=1} \frac{\partial L(y_i, f(x_i; w + \epsilon v))}{\partial w}
     \right]\\
    w &= w + v\\
    \alpha &\rightarrow \text{model hyperparameter}\\
    \rho &\rightarrow \text{model hyperparameter}\\
    \delta &\rightarrow \text{small constant, usually }10^{-6}\\
\end{aligned}
$$

### Adam

$$
\begin{aligned}
    s &= \rho_1 s + (1 - \rho_1) \left[ 
        \frac{1}{B} \sum^B_{i=1} 
        \frac{\partial L(y_i, f(x_i; w + \epsilon v))}{\partial w} 
    \right]\\
    r &= \rho_2 r + (1 - \rho_2) \left[ 
        \frac{1}{B} \sum^B_{i=1} 
        \frac{\partial L(y_i, f(x_i; w + \epsilon v))}{\partial w} 
    \right]^2\\
    \hat{s} &= \frac{s}{1-\rho_1}\\
    \hat{r} &= \frac{r}{1-\rho_2}\\
    w &= w - \alpha \left[ 
        \frac{\hat{s}}{\sqrt{\hat{r}} + \delta}
    \right]\\
    \alpha &\rightarrow \text{model hyperparameter}\\
    \rho_1 &\rightarrow \text{model hyperparameter}\\
    \rho_2 &\rightarrow \text{model hyperparameter}\\
    \delta &\rightarrow \text{small constant, usually }10^{-6}\\
\end{aligned}
$$

### Resources

- Goodfellow, Ian, et al. *Deep Learning*. MIT Press, 2017.