# Deep Learning
## Optimization

Author: Bingchen Wang

Last Updated: 9 Nov, 2022


<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Basics</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Algorithms:
- [Batch Gradient Descent](#BGD)
- [Stochastic Gradient Descent](#SGD)
- [Mini-batch Gradient Descent](#MBGD)
- [Gradient Descent with Momentum](#GDwM)
- [RMSprop](#RMSp)
- [Adam](#ADAM)

Technique:
- [Learning Rate Decay](#LRD)

## Algorithms
Consider the linear model/hypothesis
$$
h_{\theta}(x^{(i)}) = \theta^{\mathsf{T}} x^{(i)}
$$
and the squared loss function
$$
L(\theta) = \frac{1}{2} {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2
$$

<a name = "BGD"></a>
### Batch Gradient Descent
Use the entire training set of $m$ examples in every iteration of the training process.
<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until convergence:
    <blockquote>
        Compute $h_\theta(x^{(i)})$ for <b>all</b> examples $i = 1, \dots, m$. <br>
        Calculate the cost $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2.$$ <br>
        Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
    </blockquote>
</blockquote>    

<a name ="SGD"></a>
### Stochastic Gradient Descent
Use a single example in every iteration of the training process.

<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until an approximate minimum is obtained or a certain threshold is met (e.g., max numbers of epochs):
    <blockquote>
        Randomly shuffle examples in the training set. <br>
        For $i = 1, \dots, m$ do:
        <blockquote>
            Compute $h_\theta(x^{(i)})$. <br>
            Calculate the cost $$J(\theta) = \frac{1}{2} {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2.$$ <br>
            Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
        <blockquote>
    </blockquote>
</blockquote>    

<a name = "MBGD"></a>
### Mini-batch Gradient Descent
Use a subset (a mini-batch) of the training set in every iteration of the training process. Denote the size of a mini-batch as $m_t$ and the total number of mini-batches as $T$, such that:
$$
m = m_t T
$$
<div class = "alert alert-block alert-info"> Typical choices for $m_t$: 64, 128, 256, 512, 1024. (Make sure that the mini-batch fits in the CPU/GPU memory.)
</div>
Further, denote the data in the mini-batch $t = 1, \dots, T$ as $\{X^{\{t\}}, y^{\{t\}}\}$.

<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until an approximate minimum is obtained or a certain threshold is met (e.g., max numbers of epochs):
    <blockquote>
        Randomly shuffle examples in the training set. Split the training set into $T$ mini-batches.<br>
        For $t = 1, \dots, T$ do:
        <blockquote>
            Compute $h_\theta(x^{\{t\}(i)})$ for all examples in the mini-batch $t$. <br>
            Calculate the cost $$J(\theta) = \frac{1}{2m_t} \sum_{i = 1}^{m_t}{\left(y^{\{t\}(i)} - h_{\theta}(x^{\{t\}(i)})\right)}^2.$$ <br>
            Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
        </blockquote>
    </blockquote>
</blockquote>
<div class = "alert alert-block alert-success"><b>Advantages of mini-batch gradient descent</b>:
<ul>
    <li> Make use of vectorization to speed up the training.
    <li> Make progress without processing the entire training set.
</ul>
</div>

#### Performance Comparisions: Batch, Stochastic & Mini-batch Gradient Descents
<div style = "text-align: center;">
    <img src="./images/gradient descent first 100 epochs.png" style="width:60%;" >
    <img src="./images/gradient descent last 10 epochs.png" style="width:60%;" >
</div>

<a name = "GDwM"></a>
### Gradient Descent with Momentum

#### Expoentially Weighted Moving Averages
Denote the original time series examples as $\{x_t\}_{t=1}^T$.

The **exponentially weighted moving averages** are computed using the recursive formula:
$$
\begin{align}
v_0 = & 0 \\
v_t = & \beta v_{t-1} + (1-\beta) x_t, \text{ for } t=1, \dots, T
\end{align}
$$
where $\beta$ is a hyper-parameter that governs the smoothness of the averages. Roughly speaking, $v_t$ can be interpreted as approximately averaging over $\frac{1}{1-\beta}$ days (related to the fact that $(1-\frac{1}{n})^n \approx \frac{1}{e}$, which is considered small enough in contribution to the average).
<div class = "alert alert-block alert-info"> <b>Advantage over simple moving averages/windows:</b> Saves on memory (just need to keep one running variable and keep overwriting it), and computation.<br>
<b>Disadvantage vis-à-vis simple moving averages/windows:</b> not the most accurate way to compute an average.
</div>

A problem with expoentially weighted moving averages is that it requires a **burn-in period** to be able to give sensible estimates of averages, which is conspicuous from the follow equations:
$$
\begin{align}
v_1 = & (1-\beta) x_1 \\
v_2 = & (1-\beta)\beta x_1 + (1-\beta) x_2 \\
v_3 = & (1-\beta)\beta^2 x_1 + (1-\beta)\beta x_2 + (1-\beta) x_3 \\
\cdots &
\end{align}
$$
The closer $\beta$ is to 1, the longer the burn-in period needs to be. To solve this problem and ensure better estimates near the beginning of the time series, bias correction can be used to improve the algorithm:
$$
\tilde v_t = \frac{v_t}{1-\beta^t}, \text{ for } t=1, \dots, T
$$
<div class = "alert alert-block alert-danger"> <b>Note:</b> To apply bias correction, compute the un-corrected series first and then multiply the values of the original uncorrected series by the bias correction terms respectively.
</div>
<div class = "alert alert-block alert-success"> This makes use of the following result in algebra:
    $$
    s_t = \sum_{s = 1}^t (1-\beta) \beta^{s-1} = (1-\beta) \frac{1 - \beta^t}{1-\beta} = 1 - \beta^t
    $$
so
    $$
    \tilde s_t = \frac{s_t}{1-\beta^t} = 1
    $$
</div>

<div style = "text-align: center;">
    <img src="./images/exponentially weighted moving averages.png" style="width:80%;" > <br>
    Applying exponentially weighted moving averages to the Oxford daily temperature data. 
</div>

#### Gradient Descent with Momentum
<blockquote>
    Initiate $v_{d\theta_j} = 0, \text{ for } j = 1, \dots, P$. <br>
    On iteration $t$ (e.g., with mini-batch gradient descent):
    <blockquote>
        Compute $d \theta_j := \frac{\partial J}{\partial \theta_j}, \text{ for } j = 1, \dots, P$ on the current mini-batch. <br>
        Update the momentum terms:
        $$
        v_{d\theta_j} := \beta v_{d\theta_j} + (1-\beta) d \theta_j, \text{ for } j = 1, \dots, P
        $$
        Update the parameter:
        $$
        \theta_j := \theta_j - \alpha v_{d\theta_j}, \text{ for } j = 1, \dots, P
        $$
    </blockquote>
</blockquote>

**Hyperparameters:** $\alpha, \beta$ ($= 0.9$ usually works well; i.e., averaging over the last 10 gradients).
<div class = "alert alert-block alert-info"> <b>Note:</b> Sometimes, the following equivalent updating rule is used:
    $$
    v_{d\theta_j} := \beta v_{d\theta_j} + d \theta_j, \text{ for } j = 1, \dots, P
    $$
where $\alpha$ needs to be re-tuned with a factor of $(1-\beta)$.
</div>

### RMSprop
<blockquote>
    Initiate $s_{d\theta_j} = 0, \text{ for } j = 1, \dots, P$ and $\epsilon = 10^{-8}$. <br>
    On iteration $t$ (e.g., with mini-batch gradient descent):
    <blockquote>
        Compute $d \theta_j := \frac{\partial J}{\partial \theta_j}, \text{ for } j = 1, \dots, P$ on the current mini-batch. <br>
        Update the mean squared term:
        $$
        s_{d\theta_j} := \beta_2 s_{d\theta_j} + (1-\beta_2) d \theta_j^2, \text{ for } j = 1, \dots, P
        $$
        Update the parameter:
        $$
        \theta_j := \theta_j - \alpha \frac{d\theta_j}{\sqrt{s_{d\theta_j} + \epsilon}}, \text{ for } j = 1, \dots, P
        $$
    </blockquote>
</blockquote>

<div class = "alert alert-block alert-success"> <b>Note:</b> We add $\epsilon$ to the denominator to avoid the divide-by-zero problem.
</div>

<a name = "ADAM"></a>
### Adaptive Moment Estimation (Adam)
<blockquote>
    Initiate $v_{d\theta_j} = 0, s_{d\theta_j} = 0, \text{ for } j = 1, \dots, P$ and $\epsilon = 10^{-8}$. <br>
    On iteration $t$ (e.g., with mini-batch gradient descent):
    <blockquote>
        Compute $d \theta_j := \frac{\partial J}{\partial \theta_j}, \text{ for } j = 1, \dots, P$ on the current mini-batch. <br>
        Update the momentum and the mean squared term:
        $$
        \begin{align}
        v_{d\theta_j} := & \beta_1 v_{d\theta_j} + (1-\beta_1) d \theta_j \\
        s_{d\theta_j} := & \beta_2 s_{d\theta_j} + (1-\beta_2) d \theta_j^2
        \end{align}
        $$
        for $j = 1, \dots, P$. <br><br>
        Apply bias corrections:
        $$
        \begin{align}
        v_{d\theta_j}^{\text{corrected}} = & \frac{v_{d\theta_j}}{1-\beta_1^t} \\
        s_{d\theta_j}^{\text{corrected}} = & \frac{s_{d\theta_j}}{1-\beta_2^t}
        \end{align}
        $$
        for $j = 1, \dots, P$. <br><br>
        Update the parameter:
        $$
        \theta_j := \theta_j - \alpha \frac{v_{d\theta_j}^{\text{corrected}}}{\sqrt{s_{d\theta_j}^{\text{corrected}} + \epsilon}}, \text{ for } j = 1, \dots, P
        $$
    </blockquote>
</blockquote>

**Typical values for the hyperparameters:**
- $\alpha$: needs to be tuned
- $\beta_1$ (momentum update): 0.9
- $\beta_2$ (mean squared update): 0.999
- $\epsilon$ : $10^{-8}$

## Technique
<a name = "LRD"></a>
### Learning Rate Decay
Denote the current epoch as $i$, the decay rate as $k$ and the current mini-batch as $t$.

#### Time-based decay
$$
\alpha = \frac{1}{1 + k \times (i-1)} \alpha_0
$$
#### Exponential decay
$$
\alpha = k^{i-1}\alpha_0
$$
#### Inverse square root decay
**Version 1**: Based on the epoch number,
$$
\alpha = \frac{k}{\sqrt{i}} \alpha_0
$$
**Version 2**: Based on the mini-batch number,
$$
\alpha = \frac{k}{\sqrt{t}} \alpha_0
$$
#### Discrete staircase decay
Denote the length of a step as $L_\text{step}$.
$$
\alpha = k^{\left\lfloor \frac{i-1}{L_\text{step}} \right\rfloor}\alpha_0
$$

#### Comparisons of different decay methods
<div style = "text-align: center;">
    <img src="./images/Learning rate decay.png" style="width:60%;" > <br>
</div>