# Deep Learning
## Basics

Author: Bingchen Wang

Last Updated: 1 Nov, 2022

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Section 1: Getting started:
- [Motivating example: Logistic regression](#LR)
- [Standard notation](#SN)
- [Activation functions](#AF)
- [Cost functions](#CF)

Section 2: Backpropagation:
- [Backpropagation](#Backprop)
- [Gradient Checking](#GC)

Section 3: Practical aspects:
- [Initialization](#Init)
- [Regularization](#Reg)


# Section 1: Getting Started
<a name = "LR"></a>
## Logistic regression
### Logistic model
**Data**: $\{x^{(i)}, y^{(i)}\}_{i=1}^m$. <br>
**Parameters**: $w$, $b$. <br>
**Hypothesis/model fit**:
$$
h(x^{(i)}) = \sigma(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}}
$$
where
$$
z^{(i)} = w^T x^{(i)} + b.
$$

### Logistic loss
$$
L(h(x^{(i)}), y^{(i)}) = - y^{(i)}\log\left(h(x^{(i)})\right) - (1 -  y^{(i)}) \log\left(1-h(x^{(i)})\right)
$$

### Logistic cost
$$
J(w,b) = - \frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\log\left(h(x^{(i)})\right) + (1 -  y^{(i)}) \log\left(1-h(x^{(i)})\right)\right]
$$
<br>
Objective:
$$
    \min_{w,b} J(w,b) \;\;\;(+ \;\;\;\;\text{regularization})
$$

#### Why using the logistic cost?
1. Connection to **maximum likelihood estimation**.
2. Using a matching cost function can facilitate **optimization**. For example, compare the logistic cost function with the least square cost function, both used to train the logistic model using a 20-example data set. Below is a visualization of the cost functions for different combinations of $w$ and $b$:
<div style = "text-align: center;">
    <img src="./images/Logistic cost vs least squares cost.png" style="width:80%;" >
    <br>
    Detailed implementation: <a href = "./Deep Learning simulations/Logistic cost function.ipynb">here</a>.
</div>

### Connection to neural networks
Logistic regression models can be viewed as a special case of neural networks, in which there is only one output layer with a single neuron.
<div style = "text-align: center;">
    <img src="./images/Logistic Regression (NN).jpg" style="width:40%;" >
    <br>
</div>

<a name = "SN"></a>
## Standard notation
<table>
    <tr>
        <th colspan = "2" style="font-size:16px"> Sizes </th>
    </tr>
    <tr>
        <td> $$m$$ </td>
        <td> number of examples in the dataset </td>
    </tr>
    <tr>
        <td> $$n_x$$ </td>
        <td> input size </td>
    </tr>
    <tr>
        <td> $$n_y$$ </td>
        <td> output size (= number of output classes $K$) </td>
    </tr>
    <tr>
        <td> $$n_h^{[l]}$$ </td>
        <td> number of hidden units/neurons of the $l$th layer </td>
    </tr>
    <tr>
        <td> $$L$$ </td>
        <td> number of layers in the neural network (hidden + output)</td>
    </tr>
</table>

<table>
    <tr>
        <th colspan = "3" style="font-size:16px"> Observations </th>
    </tr>
    <tr>
        <th> Object</th>
        <th> Dimension</th>
        <th> Meaning</th>
    <tr>
        <td> $$X$$ </td>
        <td> $$\mathbb{R}^{n_x \times m}$$ </td>
        <td> input matrix </td>
    </tr>
    <tr>
        <td> $$Z^{[l]}$$ </td>
        <td> $$\mathbb{R}^{n_h^{[l]} \times m}$$ </td>
        <td> matrix of weighted sums terms for the $l$th layer </td>
    </tr>
    <tr>
        <td> $$A^{[l]}$$ </td>
        <td> $$\mathbb{R}^{n_h^{[l]} \times m}$$ </td>
        <td> output matrix of the $l$th layer </td>
    </tr>
    <tr>
        <td> $$Y$$ </td>
        <td> $$\mathbb{R}^{n_y \times m}$$ </td>
        <td> label matrix </td>
    </tr>
    <tr>
        <td> $$\widehat Y$$ </td>
        <td> $$\mathbb{R}^{n_y \times m}$$ </td>
        <td> predicted label matrix ($A^{[L]}$) </td>
    </tr> 
    <tr>
        <td> $$ x^{(i)}_{j}$$ </td>
        <td> $$\mathbb{R}$$ </td>
        <td> $j$th feature of the $i$th example </td>
    </tr>
    <tr>
        <td> $$z^{[l](i)}_{j}$$ </td>
        <td> $$\mathbb{R}$$ </td>
        <td> weighted sum of the $j$ hidden unit in the $l$th layer for the $i$th example  </td>
    </tr>
    <tr>
        <td> $$a^{[l](i)}_{j}$$ </td>
        <td> $$\mathbb{R}$$ </td>
        <td> output of the $j$ hidden unit in the $l$th layer for the $i$th example  </td>
    </tr>
    <tr>
        <td> $$ y^{(i)}$$ </td>
        <td> $$\mathbb{R}^{n_y}$$ </td>
        <td> label of the $i$th example </td>
    </tr>
    <tr>
        <td> $$ \hat y^{(i)}$$ </td>
        <td> $$\mathbb{R}^{n_y}$$ </td>
        <td> predicted label for the $i$th example </td>
    </tr>
</table>

<table>
    <tr>
        <th colspan = "3" style="font-size:16px"> Parameters </th>
    </tr>
    <tr>
        <th> Object</th>
        <th> Dimension</th>
        <th> Meaning</th>
    <tr>
    <tr>
        <td> $$ W^{[l]}$$ </td>
        <td> $$ \mathbb{R}^{n_h^{[l]} \times n_h^{[l-1]}}$$ </td>
        <td> weight matrix in the $l$th layer </td>
    </tr>
    <tr>
        <td> $$ b^{[l]}$$ </td>
        <td> $$ \mathbb{R}^{n_h^{[l]}}$$ </td>
        <td> bias vector in the $l$th layer </td>
    </tr>
    <tr>
        <td> $$ g^{[l]}$$ </td>
        <td> $$ g: S \rightarrow S,$$ for some space $S$ </td>
        <td> activation function(s) in the $l$th layer </td>
    </tr>
</table>

### Relations between the objects

$$
\begin{align}
A^{[0]} =& X \\
Z^{[l]} =& W^{[l]}A^{[l]} + b^{[l]} \\
A^{[l]} =& g^{[l]}(Z^{[l]}) \\
\widehat Y =& A^{[L]}
\end{align}
$$

### A simple feedforward neural network
<div style = "text-align: center;">
    <img src="./images/Standard notation.jpg" style="width:70%;" >
    <br>
</div>

<a name = "AF"></a>
## Activation functions
The **sigmoid**/logistic function $a(z) = \frac{1}{1+ e^{-z}}$ is but one of the activation functions for neural networks. Other commonly used activation functions include:
- **linear**: $a(z) = z$
- **tanh**: $a(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
- **Rectified Linear Unit (ReLU)**: $a(z) = \max(0, z)$ (most popular choice for hidden layers)
- **Leaky ReLU**: $a(z) = \max(0.01z,z)$

<div style = "text-align: center;">
    <img src="./images/activation functions.png" style="width:70%;" >
    <br>
</div>

- **softmax**: used as the activation function for the output layer for multi-class classification.

$$a_i(z) = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}}$$

Note that each neuron in the output layer (with softmax as the activation function) depends on the other neurons as well.

### Derivatives/Jacobians
**sigmoid**: 
$$
a^\prime(z) = \frac{e^{-z}}{(1+e^{-z})^2} = a(z) \left(1-a(z)\right)
$$
**linear**:
$$
a^\prime(z) = 1
$$
**tanh**:
$$
a^\prime(z) = \frac{{(e^z+e^{-z})}^2 - {(e^z-e^{-z})}^2}{{(e^z+e^{-z})}^2} = 1 - a^2(z)
$$
**ReLU**:
$$
a^\prime(z) = \left\{ \begin{array}{cc} 1 & z > 0 \\ 0 & z < 0 \end{array} \right.
$$
**Leaky ReLU**:
$$
a^\prime(z) = \left\{ \begin{array}{cc} 1 & z > 0 \\ 0.01 & z < 0 \end{array} \right.
$$

**softmax**: $a$ and $z$ are $K \times 1$.
$$
J = \left[\begin{array}{cccc} \frac{\partial a}{\partial z_1} & \frac{\partial a}{\partial z_2} & \cdots & \frac{\partial a}{\partial z_K} \end{array}\right] 
= \left[\begin{array}{c} \nabla_z a_1 \\ \nabla_z a_2 \\ \vdots \\ \nabla_z a_K \end{array}\right] 
= \left[\begin{array}{cccc} 
\frac{\partial a_1}{\partial z_1} & \frac{\partial a_1}{\partial z_2}& \cdots& \frac{\partial a_1}{\partial z_K}\\
\frac{\partial a_2}{\partial z_1} & \frac{\partial a_2}{\partial z_2}& \cdots& \frac{\partial a_2}{\partial z_K}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial a_K}{\partial z_1} & \frac{\partial a_K}{\partial z_2}& \cdots& \frac{\partial a_K}{\partial z_K}\\
\end{array}\right]
$$
where
$$
\begin{align}
\frac{\partial a_i}{\partial z_j} 
=& \left\{\begin{array}{cc} 
\frac{e^{z_i}\sum_{j=1}^{K}e^{z_j} - {(e^{z_i})}^2}{{(\sum_{j=1}^{K}e^{z_j})}^2} & \text{if } i = j \\
- \frac{e^{z_i}e^{z_j}}{{(\sum_{j=1}^{K}e^{z_j})}^2} & \text{if } i \neq j
\end{array}\right. \\
=& \left\{\begin{array}{cc} 
a_i (1-a_i)  & \text{if } i = j \\
- a_i a_j & \text{if } i \neq j
\end{array}\right.
\end{align}
$$
Namely,
$$
J = \left[\begin{array}{cccc} 
a_1 (1-a_1) &  - a_1 a_2 & \cdots& - a_1 a_K\\
- a_2 a_1 & a_2 (1-a_2)& \cdots& - a_2 a_K\\
\vdots & \vdots & \ddots & \vdots\\
- a_K a_1 & - a_K a_2 & \cdots& a_K (1 - a_K)\\
\end{array}\right]
$$

<a name = "CF"></a>
## Cost functions

### Cross-entropy cost
$$
J_{CE} = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log \hat y_k^{(i)}
$$
#### Special case: logisitc loss ($K=2$)
$$
J_{logistic} = - \frac{1}{m} \sum_{i=1}^m \left(y^{(i)} \log \hat y^{(i)} + (1-y^{(i)}) \log (1- \hat y^{(i)}) \right)
$$

### Mean squared error cost
$$
J_{MSE} = - \frac{1}{m} \sum_{i=1}^m {(\hat y^{(i)} - y^{(i)})}^2
$$

# Section 2: Backpropagation
<a name = "Backprop"></a>
## Backpropagation
Wikipedia: <a href = "https://en.wikipedia.org/wiki/Backpropagation">here</a>. <br>

<div class = "alert alert-block alert-success"><b>Tips:</b><br>
1. When in doubt, check the <b>dimensions</b> of each term and make sure they are compatible. <br>
2. If still in doubt, focus on the <b>individual elements</b> of a vector/matrix and use scalar calculus.
</div>

Write the loss function for a single observation as $\mathcal{L}(\hat y, y)$. <br>
**Define**
$$
\delta^{[l]} = \nabla_{z^{[l]}} \mathcal{L}(\hat y , y)
$$
and $\circ$ is the Hadamard product (element-wsie product).
1. For the output layer $L$,
    - if the loss function is binary cross entropy, and the activation function is sigmoid:
        $$
        \begin{align}
        \nabla_{a^{[L]}} \mathcal{L}  =&  - \frac{y}{a^{[L]}} + \frac{1-y}{1-a^{[L]}} \\
        \nabla_{z^{[L]}} \mathcal{L} =& g^{[L]\prime}(z^{[L]}) \circ \nabla_{a^{[L]}} \mathcal{L} \\
        =& a^{[L]} (1-a^{[L]})  \cdot \left(- \frac{y}{a^{[L]}} + \frac{1-y}{1-a^{[L]}}\right) \\
        =& a^{[L]} - y
        \end{align}
        $$
    - if the activation function is cross entropy:
        $$
        \begin{align}
        \nabla_{a^{[L]}} \mathcal{L} =&  - \left[\begin{array}{c} \frac{y_1}{a^{[L]}_1} \\ \frac{y_2}{a^{[L]}_2}\\ \vdots \\ \frac{y_K}{a^{[L]}_K}\end{array}\right] \\
        \nabla_{z^{[L]}} \mathcal{L} =&  \underbrace{J^T}_{\mathbb{R}^{n_y \times n_y}} \underbrace{\nabla_{a^{[L]}} \mathcal{L}}_{\mathbb{R}^{n_y \times 1}} \\
        = & - \left[\begin{array}{cccc} 
a_1^{[L]} (1-a_1^{[L]}) &  - a_2^{[L]} a_1^{[L]} & \cdots& - a_K^{[L]} a_1^{[L]}\\
- a_1^{[L]} a_2^{[L]} & a_2^{[L]} (1-a_2^{[L]})& \cdots& - a_K^{[L]} a_2^{[L]}\\
\vdots & \vdots & \ddots & \vdots\\
- a_1^{[L]} a_K^{[L]} & - a_2^{[L]} a_K^{[L]} & \cdots& a_K^{[L]} (1 - a_K^{[L]})\\
\end{array}\right]\left[\begin{array}{c} \frac{y_1}{a^{[L]}_1} \\ \frac{y_2}{a^{[L]}_2}\\ \vdots \\ \frac{y_K}{a^{[L]}_K}\end{array}\right] \\
        = & \left[\begin{array}{c} a^{[L]}_1 - y_1 \\ a^{[L]}_2 - y_2\\ \vdots \\ a^{[L]}_K - y_K \end{array}\right] \\
        = & a^{[L]} - y
        \end{align}
        $$
Therefore, **for both cases**, $$ \delta^{[L]} = \nabla_{z^{[L]}} \mathcal{L} = a^{[L]} - y.$$
2. For layers $l = 1, \dots, L-1$,
    $$
    \begin{align}
    \nabla_{a^{[l]}} \mathcal{L} = & \underbrace{W^{[l+1]\mathsf{T}}}_{n^{[l]}_h \times n^{[l+1]}_h} \underbrace{\delta^{[l+1]}}_{n^{[l+1]}_h \times 1} \\
    \nabla_{z^{[l]}} \mathcal{L} = & g^{[l]\prime}(z^{[l]}) \circ \nabla_{a^{[l]}} \mathcal{L} \;\;\; (:=\delta^{[l]})\\
    \nabla_{W^{[l]}} \mathcal{L} = & \underbrace{\delta^{[l]}}_{n^{[l]}_h \times 1} \underbrace{a^{[l-1]\mathsf{T}}}_{1 \times n^{[l-1]}_h} \\
    \nabla_{b^{[l]}} \mathcal{L} = & \underbrace{\delta^{[l]}}_{n^{[l]}_h \times 1}
    \end{align}
    $$
    


<a name = "GC"></a>
## Gradient Checking
To check if backpropagation is running correctly, we can use gradient checking. It is motivated by the definition of derivatives:
$$
\frac{\partial J}{\partial \theta} = \lim_{\epsilon \rightarrow 0}\frac{J(\theta+\epsilon) - J(\theta - \epsilon)}{2\epsilon}
$$
When $\epsilon$ is small (say, `epsilon = 1e^-7`), ${\left[\frac{\partial J}{\partial \theta}\right]}_{approx} = \frac{J(\theta+\epsilon) - J(\theta - \epsilon)}{2\epsilon}$ should be very close to $\frac{\partial J}{\partial \theta}$.

<section class = "section--algorithm">
    <div class = "algorithm--header"> Gradient Checking Algorithm</div>
    <div class = "algorithm--content">
        Write the parameters in a model (e.g., neural network) as $\theta = {\left[\begin{array}{cccc} \theta_1, \theta_2, \dots, \theta_P \end{array}\right]}^{\mathsf{T}}$, where $P$ is the total number of parameters. <br>
        Specify a small $\epsilon$ (say $10^{-7}$). <br>
        For each $p = 1, \dots, P$:
        <blockquote>
            <ol>
                <li> compute the approximated gradient,
                    $$
                    d\theta_{p,approx} = \frac{J(\theta_1, \theta_2, \dots, \theta_p + \epsilon, \dots, \theta_P) -J(\theta_1, \theta_2, \dots, \theta_p - \epsilon, \dots, \theta_P)}{2\epsilon}
                    $$
                <li> compute the distance between the backprop result $d\theta_p$ and the approximated gradient $d\theta_{p,approx}$,
                    $$
                    D_p = \frac{{\Vert d\theta_p - d\theta_{p,approx} \Vert}_2}{{\Vert d\theta_p \Vert}_2 + {\Vert d\theta_{p,approx} \Vert}_2}
                    $$
                 <li> determine if there is a possible error:
                     <ul>
                         <li> if $D_p \approx 10^{-7}$, it looks fine.
                         <li> if $D_p \approx 10^{-3}$, there is probably something wrong.
                     </ul>
            </ol>
        </blockquote>
    </div>
</section>
<br>
<div class = "alert alert-block alert-warning"><b>Gradient checking dos and don'ts </b>:
<ol>
    <li> Don't use in training and only use for debug, as computation is expensive.
    <li> Don't work with dropout.
    <li> Do check the components of the backprop (e.g., $db^{[l]}$ and $dW^{[l]}$) in the presence of a failed check.
    <li> Do remember the regularization term if used in the training process, i.e. $\tilde J = J + \Omega$ rather than $J$.
</ol>
</div>

# Section 3: Practical aspects
<a name = "Init"></a>
## Initialization
- He initialization
- Xavier initialization
- Xavier initialization (Version 2)

<br>
To use optimization methods such as gradient descent, we need to initialize the parameters in the neural networks. A good initialization is important, as it can:
1. **speed up the convergence** of the optimization algorithm, e.g. gradient descent.
2. increase the possibility that the optimization algorithm, e.g. gradient descent, converges to a **lower training error** (and generalization error).
3. partially solve the vanishing/exploding gradients problem. For another commonly used technique, see *gradient clipping*.

### He initialization
$$
W^{[l]}_{ij} = \mathcal{N}\left(0, \frac{2}{n^{[l-1]}_h}\right) 
$$
<br>
**Ideal activation functions**: ReLU <br>
**Implementation**: `W = np.random.randn(size_current, size_prev)*np.sqrt(2/size_prev)`

### Xavier initialization
$$
W^{[l]}_{ij} = \mathcal{N}\left(0, \frac{1}{n^{[l-1]}_h}\right) 
$$
<br>
**Ideal activation functions**: tanh <br>
**Implementation**: `W = np.random.randn(size_current, size_prev)*np.sqrt(1/size_prev)`

### Xavier initialization (Version 2)
$$
W^{[l]}_{ij} = \mathcal{N}\left(0, \frac{2}{n^{[l-1]}_h + n^{[l]}_h}\right) 
$$
<br>
**Interpretation of the variance**: the harmonic mean of $n^{[l-1]}_h$ and $n^{[l]}_h$. <br>
**Implementation**: `W = np.random.randn(size_current, size_prev)*np.sqrt(2/(size_prev+size_current))`

<a name = "Reg"></a>
## Regularization
- $L_q$ penalties
- Dropout regularization
- Data augmentation
- Early stopping

<br>
Regularization is an important technique often used in the training of neural networks to prevent overfitting. Here we focus on a few approaches.

### $L_q$ penalties
$$
\min_{w,b} J(W,b) = \frac{1}{m} \sum_{i}^m L(\hat y^{(i)}, y^{(i)}) + \underbrace{\lambda}_{\text{regularization parameter}} \Omega(W)
$$
Commonly used regularizations include:
- **$L_2$ regularization**: $$\Omega(W) = \frac{1}{2}\Vert W \Vert_2^2 = \frac{1}{2} \sum_l \sum_i \sum_j {(W_{ij}^{[l]})}^2$$
    - Weight decay (consider gradient descent with regularization): $$W^{[l]} := \left(1-\alpha \lambda\right)W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}$$
- **$L_1$ regularization**: $$\Omega(W) = \Vert W \Vert_1  = \sum_l \sum_i \sum_j \vert W_{ij}^{[l]} \vert$$
    - $W$ will be sparse.
    
#### $L_q$ penalties in linear regression
In the context of linear regression, many more forms of regularization terms have been studied. Consider the general criterion:
$$
 \min_\beta \left\{\sum_{i=1}^m \left(y^{(i)} - x^{(i)\mathsf{T}}\beta\right) + \lambda \sum_{j=1}^p \vert \beta_j \vert^q \right\}
$$
- $q = 0$ amounts to variable subset selection.
- $q = 1$ corresponds to the lasso.
- $q = 2$ corresponds to ridge regression.

<div style = "text-align: center;">
    <img src="./images/L_q regularization.png" style="width:100%;" >
    <br>
</div>

<blockquote> 
    Values of $q \in (1,2)$ suggests a compromise between the lasso and ridge regression. Although this is the case with $q >1$, $\vert \beta_j \vert^q$ is <b>differntiable</b> at $0$, and so <b>does not share the ability of the lasso ($q = 1$) for setting coefficients exactly to zero.</b> Partly for this reason as well as for computational tractability, Zou and Hastie (2005) introduced the <b>elastic net</b> penalty. -- The Elements of Statistical Learning, p.73.  
</blockquote>

The elastic net penalty:
$$
\lambda \sum_{j=1}^p \left(\alpha \beta_j^2 + (1-\alpha) \vert \beta_j \vert \right)
$$
where $\alpha \in [0,1]$ governs how close the penalty is to $L_2$.

The elastic-net selects variables (by setting some to zero) like the lasso, and shrinks together the coefficients of correlated predictors like ridge, while also having considerable computational advantage  over the $L_q$ penalties.
<div style = "text-align: center;">
    <img src="./images/Elastic nets.png" style="width:90%;" >
    <br>
    <img src="./images/Elastic nets vs L_q.png" style="width:30%;" >
</div>