# Deep Learning
## Optimization

Author: Bingchen Wang

Last Updated: 9 Nov, 2022


<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Basics</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Algorithms:
- [Batch Gradient Descent](#BGD)
- [Stochastic Gradient Descent](#SGD)
- [Mini-batch Gradient Descent](#MBGD)
- Gradient Descent with Momentum
- RMSprop
- Adam

Technique:
- Learning Rate Decay

## Algorithms
Consider the linear model/hypothesis
$$
h_{\theta}(x^{(i)}) = \theta^{\mathsf{T}} x^{(i)}
$$
and the squared loss function
$$
L(\theta) = \frac{1}{2} {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2
$$

<a name = "BGD"></a>
### Batch Gradient Descent
Use the entire training set of $m$ examples in every iteration of the training process.
<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until convergence:
    <blockquote>
        Compute $h_\theta(x^{(i)})$ for <b>all</b> examples $i = 1, \dots, m$. <br>
        Calculate the cost $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2.$$ <br>
        Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
    </blockquote>
</blockquote>    

<a name ="SGD"></a>
### Stochastic Gradient Descent
Use a single example in every iteration of the training process.

<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until an approximate minimum is obtained or a certain threshold is met (e.g., max numbers of epochs):
    <blockquote>
        Randomly shuffle examples in the training set. <br>
        For $i = 1, \dots, m$ do:
        <blockquote>
            Compute $h_\theta(x^{(i)})$. <br>
            Calculate the cost $$J(\theta) = \frac{1}{2} {\left(y^{(i)} - h_{\theta}(x^{(i)})\right)}^2.$$ <br>
            Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
        <blockquote>
    </blockquote>
</blockquote>    

<a name = "MBGD"></a>
### Mini-batch Gradient Descent
Use a subset (a mini-batch) of the training set in every iteration of the training process. Denote the size of a mini-batch as $m_t$ and the total number of mini-batches as $T$, such that:
$$
m = m_t T
$$
<div class = "alert alert-block alert-info"> Typical choices for $m_t$: 64, 128, 256, 512, 1024. (Make sure that the mini-batch fits in the CPU/GPU memory.)
</div>
Further, denote the data in the mini-batch $t = 1, \dots, T$ as $\{X^{\{t\}}, y^{\{t\}}\}$.

<blockquote>
    Start with some $\theta \in \mathbb{R}^p$ <br>
    Repeat until an approximate minimum is obtained or a certain threshold is met (e.g., max numbers of epochs):
    <blockquote>
        Randomly shuffle examples in the training set. Split the training set into $T$ mini-batches.<br>
        For $t = 1, \dots, T$ do:
        <blockquote>
            Compute $h_\theta(x^{\{t\}(i)})$ for all examples in the mini-batch $t$. <br>
            Calculate the cost $$J(\theta) = \frac{1}{2m_t} \sum_{i = 1}^{m_t}{\left(y^{\{t\}(i)} - h_{\theta}(x^{\{t\}(i)})\right)}^2.$$ <br>
            Update the parameters $$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \;\;\; \text{for } j = 1, \dots, P$$
        </blockquote>
    </blockquote>
</blockquote>
<div class = "alert alert-block alert-success"><b>Advantages of mini-batch gradient descent</b>:
<ul>
    <li> Make use of vectorization to speed up the training.
    <li> Make progress without processing the entire training set.
</ul>
</div>

#### Performance Comparisions: Batch, Stochastic & Mini-batch Gradient Descents
<div style = "text-align: center;">
    <img src="./images/gradient descent first 100 epochs.png" style="width:60%;" >
    <img src="./images/gradient descent last 10 epochs.png" style="width:60%;" >
</div>