# Coursera Deeplearning.ai Notes

[link](https://www.coursera.org/learn/neural-networks-deep-learning/home/welcome)

* [Logistic Regression](#logistic)
* [Logistic Loss Function](#logistic_loss)
* [Backprop for Logistic Regression](#logistic_deriv)
* [Activation & Derivatives](#activation)
* [Gradient Descent](#grad)
* [Dimensions](#dimension)
* [Dimensions Again](#dimension_again)
* [Random Initialization](#random_init)
* [Regularization](#regularization)
* [Dropout](#dropout)
* [Optimization](#opt)
* [Weight Initialization](#weight_init)
* [Gradient Checking](#grad_checking)
* [Minibatch](#minibatch)
* [Momentum, RMSprop, Adam](#opt_algo)
* [Random Search](#random_search)
* [Batch Norm](#batchnorm)

# Course 1

# Week 1

Some points from Hinton's interview

* Residual Networks - linked to initalization with identity matrix
* Hinton showed that ReLU is approximately a stack of logistic units.
* Hinton working on capusles, essentially it is a DL network that partitions neurons into groups that represent features of a single input.

# Week 2

## Binary Classification

A picture of size 64 x 64 in RGB is represented by a 3 x 64 x 64 tensor. Unrolling this would save this into a 1D vector.

## Notation

$ (x,y), x \in \mathbb{R}^{n_x}, y \in \{0, 1\} $

For $m$ training examples: $\big \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)}) \big \} $ 

**Data matrix dimension** is $n_x \times m$, i.e. $X \in \mathbb{R}^{n_x \times m}$, this is `numpy` format and same as `X.shape` in `python`.

$Y = [y^{(1)}, y^{(2)}, \cdots, y^{(m)}], Y \in \mathbb{R}^{1 \times m}$, hence $Y.shape == (1, m)$.

<id a='logistic'></a>
## Logistic Regression

$$ \hat{y} = \sigma\big(w^T x + b\big), \sigma(z) = \frac{1}{1 + e^{-z}}$$

**Loss function**: squared loss not working here, because the problem becomes non-convex. So the correct one is:

$$ \mathcal{L}(\hat{y}, y) = - \big(y \log \hat{y} + (1 - y)\log(1-\hat{y})\big) $$

This is derived as follows:

$$
\begin{aligned}
\hat{y} &= p(y=1 \mid x) \\
\text{If } y=1 &: p(y\mid x) = \hat{y}\\
\text{If } y=0 &: p(y\mid x) = 1 - \hat{y}\\
p(y \mid x) &= \hat{y}^y (1-\hat{y})^{(1-y)} \\
\log p(y \mid x) &= y\log\hat{y} + (1-y)\log(1-\hat{y})
\end{aligned}
$$

<a id='logistic_loss'></a>
**Cost function**:

$$ J(w, b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}\big(\hat{y}^{(i)}, y^{(i)}\big) $$

This means that the cost function is a **maximizing likelihood estimator**.

**Gradient Descent**

    Repeat {
        w := w - alpha * dJ(w,b)/dw
        b := b - alpha * dJ(w,b)/db
    } until minimum reached

## Computational Graph

A graph that represents all temp variables for a formula.

**Notation**: $dz$ always refers to the deriviative of the final loss function $\mathcal{L}$ with respect to $z$.

For logistic regression of $X \in \mathbb{R}^{2}$:

**Forward pass**:

$$
\begin{aligned}
z &= w^T X + b = w_1 x_1 + w_2 x_2 + b \\
a &= \sigma(z) \\
\mathcal{L}(a, y) &= -\big(y\log a + (1-y) \log (1-a)\big)\\
\end{aligned}
$$

<a id='logistic_deriv'></a>
**Backprop**:

Since $\frac{d}{dx}\log x = \frac{1}{x}$:

$$
\begin{aligned}
da &= \frac{\partial \mathcal{L}}{\partial a} 
= -\big[ \frac{y}{a} + \frac{1-y}{1-a} (-1) \big] \\
&= -\frac{y}{a} + \frac{1 - y}{1 - a} \\
\frac{\partial a}{\partial z} &= a(1-a) \\
dz &= \frac{\partial \mathcal{L}}{\partial z} 
= \frac{\partial \mathcal{L}}{\partial a} \frac{\partial \mathcal{a}}{\partial z}
= \bigg[-\frac{y}{a} + \frac{1 - y}{1 - a}\bigg] \big[a(1-a)\big] \\
&= -y(1-a) + (1-y)a = -y + ay + a - ay \\
&= a - y \\
dw1 &= dz \frac{\partial z}{\partial w1} = x_1 dz \\
dw2 &= dz \frac{\partial z}{\partial w2} = x_2 dz \\
db &= dz \frac{\partial z}{\partial b} = dz
\end{aligned}
$$

Then update $w$ and $b$.

## Vectorized Implementation

**Forward Pass**:

Avoid the for loop to go through all $m$ training examples by using `np.dot(w.T, X)`.

**Backward Pass**:

$A = [a^{(1)}, a^{(2)}, \cdots, a^{(n)}], Y=[y^{(1)}, y^{(2)}, \cdots, y^{(n)}]$

Therefore $dz = A - Y$, $dw = \frac{1}{m}X dz^T$

**Python tip**: 

* Don't use `np.random.rand(5)` but use `np.random.rand(5,1)`. The formal gives shape `(5,)`, the latter gives shape `(5,1)`. Use `keepdim=True` parameter in `numpy` functions to avoid rank 1 arrays.
* use `assert()` to ensure shape in code

# Week 3 Neural Nets

Counting layers: input layer not counted. 

## Vectorization

**Notation** 

$W^{[i]}$ represents the the weight matrix for layer $i$, where **each row $j$**, $(w_{j}^{[i]})^{T}$, stores weights for a neuron in this layer. 

$[\cdot]$ brackets are used to denote **layers**, $(\cdot)$ brackets are used to denote **training examples**, $\{\cdot\}$ denotes **minibatches**.

<a id='activation'></a>
## Activation

$\tanh$ **almost always work better than** sigmoid function. The **exception** is the output layer. 

**ReLU** tend to be the default these days. **Leaky ReLU** tends to work better but not widely used.

$$
\begin{aligned}
\tanh(z) &= \frac{e^z - e^{-z}}{e^z + e^{-z}} \\
\text{ReLU} &= max(0, z) \\
\text{Leaky ReLU} &= max(0.01z, z)
\end{aligned}
$$

## Derivatives of Activation Functions

Let $g(z)$ be an activation function.

$$
\begin{aligned}
g_{sigmoid}'(z) &= g_{sigmoid}(z)(1-g_{sigmoid}(z)) \\
g_{tanh}'(z) &= 1 - (\tanh(z))^2 \\
g_{ReLU}'(z) &= \begin{cases}
0 &\text{for } z < 0 \\
1 &\text{for } z > 0 \\
\text{Undefined} &\text{for } z = 0
\end{cases} \\
g_{Leaky ReLU}'(z) &= \begin{cases}
0.01 &\text{for } z < 0 \\
1 &\text{for } z > 0 \\
\text{Undefined} &\text{for } z = 0
\end{cases}
\end{aligned}
$$

In practice for ReLU and Leaky ReLU, we ignored the $z=0$ case in code and instead set gradient to 1 for $z \geq 0$. In this case $g'(z)$ becomes a **sub gradient** of the activation function $g(z)$ which is why gradient descent still works.

<a id='grad'></a>
## Gradient Descent 

Assume that activations are reprsented as $A^{[l]} = g^{[l]}\big( Z^{[l]} \big)$.


$\odot$ - Hadamard product / element-wise product

**Forward Propagation** for a 2-layer network in matrix form:

$$
\begin{aligned}
Z^{[1]} &= W^{[1]}X + b^{[1]} \\
A^{[1]} &= g^{[1]}\big(z^{[1]}\big)\\
Z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \\
A^{[2]} &= g^{[2]}\big(Z^{[2]}\big)
\end{aligned}
$$

To generalize:

$$
\begin{aligned}
Z^{[l]} &= W^{[l]}A^{[l-1]} + b^{[l]} \\
A^{[l]} &= g^{[l]}\big(Z^{[l]}\big)
\end{aligned}
$$



**Backpropation** (refer to derivations [above](#logistic_deriv) for vector form) in matrix form:

$$
\begin{aligned}
dZ^{[2]} &= A^{[2]} - Y \\
dW^{[2]} &= \frac{1}{m} dZ^{[2]}\big(A^{[1]}\big)^T \\
db^{[2]} &= \frac{1}{m} \mathsf{np.sum}\big(dZ^{[2]}, \mathsf{axis=1, keepdims=True} \big) \\
dZ^{[1]} &= (W^{[2]})^T dZ^{[2]} \odot g^{[1]'}\big(Z^{[1]}\big) \\
dW^{[1]} &= \frac{1}{m} dZ^{[1]}X^T \\
db^{[1]} &= \frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=True)
\end{aligned}
$$

To generalize:

$$
\begin{aligned}
dZ^{[L]} &= A^{[L]} - Y \\
dW^{[L]} &= \frac{1}{m} dZ^{[L]}(A^{[L-1]})^T \\
db^{[L]} &= \frac{1}{m} \mathsf{np.sum}\big(dZ^{[L]}, \mathsf{axis=1, keepdims=True} \big) \\
dZ^{[l]} &= dA^{[l]} \odot g^{[l]'} \big( Z^{[l]} \big) \\
dW^{[l]} &= \frac{1}{m} dZ^{[l]}\big(A^{[l-1]}\big)^T \\
db^{[l]} &= \frac{1}{m} \mathsf{np.sum}\big(dZ^{[l]}, \mathsf{axis=1, keepdims=True} \big) \\
dA^{[l-1]} &= (W^{[l]})^T dZ^{[l]} \\
dZ^{[l]} &= (W^{[l+1]})^T dZ^{[l+1]} \odot g^{[l]'}\big(Z^{[l]}\big)
\end{aligned}
$$


Note that the bias derivative terms are **summed across all training examples** because gradients accumulate over the training set..

<a id='dimension'></a>
### Dimensions

`foo` and `dfoo` **always** have the same dimension.

The transpose of $A^{[1]}$ in line 2 above is because $w$ before is a **column vector**, whereas $W$ is a matrix stacking $w$'s as rows. 

**Dimension** of $W^{[i]}$ is $\big(n^{[i]}, n_{A^{[i-1]}}\big)$, where $n^{[i]}$ is the no. of neurons in this layer, i.e. its width, and $n_{A^{[i-1]}}$ is the no. of rows of $A^{[i-1]}$, e.g. $n_{A^{[0]}} = n_x$.

Dimension of $dZ^{[1]}$ is $(n^{[1]}, m)$, so are the dimensions of the two terms that formed the equation, i.e. they have the same dimension, therefore the we have an element-product.

### Pattern

1. forward propagation, compute output and store intermediate outputs of each layer in cache. 
    * Inputs: $A^{[l-1]}, W^{[l]}, b^{[l]}$, 
    * output: $Z^{[l]}$ and $A^{[l]}$ and cache $Z^{[l]}, W^{[l]}, b^{[l]}$.
2. compute loss
3. backprop.
    * inputs: $dA^{[l]}, dZ^{[l]}$,  cache of $Z^{[l]}, W^{[l]}, b^{[l]}$
    * output: $dA^{[l-1]}$, caches $dW^{[l]}, db^{[l]}$.
4. update parameters

## Working out the dimensions

Follow the shape of the network, the number of rows for the weight matrix $W$ for a layer should be the same as the no. of neurons in the layer. 

For the output layer, the dimension must match the final output.

<a id='random_init'></a>
## Random Initalization

Bias terms an be initalizaed to zero.

Randam initalization to **small numbers**, e.g. $W^{[1]} = np.random.rand((2,2)) \times 0.01$. Reason is if weights are large then derivatives tend to be saturated (e.g. tanh) resulting in slow training.

For deep networks, a **different** multiplying constant maybe needed.

Logistic regression doesn't have a hidden layer. So even if the weights are initalized to zero, the derivatives depend on the input value $X$, therefore it can still work. 

# Week 4 DL Networks

<a id='dimension_again'></a>
## Dimensions Again

$n^{[i]}$ - no. of activiations in layer $i$.

Dimension of $W^{[i]}$ is $\big(n^{[i]}, n^{[i-1]}\big)$

Dimension of $b^{[i]}$ is $\big(n^{[i]}, 1\big)$. In vectorized versions the dimension of this doesn't change due to broadcasting.

Dimension of $z^{[i]}$ and $a^{[i]}$ is $\big(n^{[i]}, 1 \big)$

Dimension of $Z^{[i]}$ and $A^{[i]}$ is $\big(n^{[i]}, m \big)$

## Circuit Theory, why deep? 

There are functions that you can compute with a **small** L-layer deep network that shallower networks require **exponentially** more hidden units to compute. Example, XOR or $n$ inputs. A deep network needs $\log n$ layers, whereas a single layer network would need $2^n$ neurons.

## Hyperparameters

* Learning rate
* no. of epochs
* no. of hidden layers
* no. of units per layer
* activation function, ReLU, tanh, etc.
* momentum
* minibatch size
* regularization

These parameters may change over time, due to hardware changes for example. Therefore once after a while, it's useful to re-evaluate and try a number of values.

# Course 2

# Week 1

## Train/DevTest

In big data era, where millions of data examples are available, train/cv/test split ratios can be changed. For 1mm examples, maybe reserve 10k each for cv and test, the rest is used for training, i.e. 98%/1%/1% split.

**Mismatched train/test distribution**

E.g. high resolution images versus low resolution images.


## Bias/Variance

Should be viewed in context with Bayes (optimal) error. 


## Basic Recipe

High bias (training set performance) -> Bigger network, train larger, NN architecture search, until bias is reduced to acceptable level.

High variance (dev set performance) -> More data, regularisation, NN architecture search.


**Bias/Variance tradeoff**: No longer apply in DL era. There are tools to drive down both without hurting one.

<a id='regularization'></a>
## Regularization

### Neural Networks

$$ J\big(W^{[1]}, b^{[1]}, \cdots, W^{[L]}, b^{[L]}\big) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}, y) + 
\frac{\lambda}{2m}\sum_{l=1}^{l=L} \| W^{[l]} \|^2$$

$$ \| W^{[l]} \|^2_F = \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}\big( w_{ij}^{[l]}\big)^2 $$

$\| W^{[l]} \|^2$ is called the **Frobenius Norm**, $w : (n^{[l]}, n^{[l-1]})$.

For backprop:

$$ 
\begin{aligned}
dW^{[l]'} &= dW^{[l]} + \frac{\lambda}{m}W^{[l]} \\
W^{[l]} &= W^{[l]} - \alpha \times dW^{[l]'} \\
&= W^{[l]} - \alpha \big( dW^{[l]} + \frac{\lambda}{m}W^{[l]} \big) \\
&= W^{[l]} - \frac{\alpha \lambda}{m}W^{[l]} - \alpha \times dW^{[l]} \\
&= \big(1-\frac{\alpha \lambda}{m}\big) W^{[l]} - \alpha \times dW^{[l]}
\end{aligned}
$$

This is also known as **weight decay**. Most used method by Andrew Ng.

Regularization constraints $W$'s value to be within a smaller range for activiation functions, therefore may not be reaching the non-linear areas, such as tanh.

In debugging, always plot the full loss value including both terms, including the regularization term.

<a id='dropout'></a>
### Dropout

**Inverted Dropout**

At **training** time:

Generate dropout matrix, with `keep_prob=0.8` for example.

    d3 = np.random.randn(a3.shape[0], a3.shape[1]) < keep_prob
    a3 = np.multiply(a3, d3)
    a3 /= keep_prob #revert to back original value for the ones kept

From this you see that for different examples, you use different dropout vectors. This goes back to the what the DL books said about forcing the network to focus on different part of the data. 

In each gradient descent iternation, use different dropout matrix.

**For backprop, shut down the same neurons using the same dropout matrix, then scale back:**

    da3 = np.multiply(da3, d3)
    da3 = np.divide(da3, keep_prob)


**At test time, no drop out.** 

Dropout **shrink the weights**, like $L^2$ regularization.

Different layers can have different `keep_prob`, smaller layers may have larger `keep_prob` values. 

Dropout is commonly used in computer vision to combat overfitting, where you never have enough data so always tend to overfit. Don't always generalize to other areas. **Unless you have an overfitting problem, no need to bother with dropout.**

With dropout, the loss curve will not be monotonically decreasing versus no. of iterations. **Trick is to turn off dropout first to make sure that the loss curve is monotonically decreasing, and then turn it back on for real training.**

### Data Augmentation

E.g. Flip, roatate, crop, or distort your data images. 

### Early Stopping

Plot both test and dev set loss curve, stop when dev loss start to increase.

### Orthogonalization

Separate the tasks in ML. Optimization to reduce $J$, Regularzation are separate tasks, work on them independently.

<a id='opt'></a>
## Optimization 

### Normalization

Use the $\mu$ and $\sigma$ from training set to normalize dev/test sets.

Normalization makes the optimization problem easier to solve.

### Vanishing / Exploding Gradients

If weights are a lot larger than 1 or a lot smaller than one, having lots of layers means weights would take power of $L-1$ to propgate back, could either explode or vanish.

<a id='weight_init'></a>
### Weight Initalization

This helps with vanishing / expoloding gradients but wont' resolve the issue.

**Single Neuron**

$z = w^T \times x$, when you have lots of weights, i.e. $w_n$ when $n$ is large, set: 

$$Variance(w_i) = \frac{1}{n} $$

Therefore: $W^{[l]} = np.random.randn(shape) \times np.sqrt\big(\frac{1}{n^{[l-1]}}\big)$

For **ReLU**, $variance(w_i) = \frac{2}{n}$ works better.

**He initialization** (He et al., 2015), $variance = \frac{2}{n^{[l-1]}}$.

For $\tanh$, use $Variance(w_i) = \frac{1}{n}$. Known as Xavier initalization.

Some people also use $\frac{2}{n^{[l-1]} + n^{[l-]}}$.

This variance measure can be a hyperparameter to tune, Andrew Ng would rank it as a lower priority though.

<a id='grad_checking'></a>
### Gradient Checking

Given $f(x)$ and its derivative $g(x)$:

$$ g(\theta) = \lim_{\epsilon \rightarrow 0} \frac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon} $$

The order of the error is $\mathcal{O}(\epsilon^2)$. This is better than just using $(f(\theta+\epsilon) - f(\theta))$ which has error in the order of $\mathcal{O}(\epsilon)$.

#### Procedure:

Take all parameters, $W^{[1]}, b^{[1]}, \cdots, W^{[L]}, b^{[L]}$, reshape into a big vector $\theta$.

Take all $dW^{[1]}, db^{[1]}, \cdots, dW^{[L]}, db^{[L]}$, reshape into a big vector $d\theta$.

For each $i$, compute:

$$
\begin{aligned}
d\theta_{approx}[i] &= \frac{J(\theta_1, \theta_2, \cdots, \theta_i + \epsilon, \cdots) - 
J(\theta_1, \theta_2, \cdots, \theta_i - \epsilon, \cdots)}{2\epsilon} \\
&\approx d\theta[i] = \frac{\partial J}{\partial \theta_i}
\end{aligned}
$$

Check: 

$$ 
\begin{aligned}
\frac{\| d\theta_{approx} - d\theta \|_2}{\| d\theta_{approx} \|_2 - \| d\theta \|_2} &\approx 
\begin{cases}
10^{-7} &\text{great!} \\
10^{-5} &\text{worth checking, probably ok, check all components of the vector} \\
10^{-3} &\text{worry, must be a bug!}
\end{cases}
\end{aligned}
$$

Note $\| \cdot \|_2$ = np.linalg.norm()

Create a vector to store all $d\theta[i]$, iterate through $\forall i \in \theta_i$, calculate $d\theta_{approx}[i]$ and store in this vector. See course assignment for details.

Then cacluate the check with `np.linalg.norm()` etc, use threshold levels above.

#### Implementation Notes

Don't use in training, only to debug.

If alo fails grad check, look for the componements in $\theta$ to identify bug. Is it for $W$ or $b$?

Remember regularizatoin. 

Doesn't work with dropout. Turn off dropout to check first, then turn it back on.

Run at random initalization, perhaps later again after some training. Possibly that an implementation only works when weights are close to 0, but not when they move away from zero.

# Week 2 

<a id='minibatch'></a>
## Minibatch Gradient Descent

**Notation**: $X^{\{i\}}$ to denote minibatch $i$

    for t in range(5000):
        y_hat, cache = forward_prop(X^{t}, W, b)
        compute_cost(y^{t}, y_hat)
        back_prop(cache)
        update_parameters()

In batch gradient descent, you'd expect the cost function to decrease monotonically. 

In minibatch gradient descent, costs may go up or down but should trend down over iterations.

If minibatch size == $m$ : Batch gradient descent, usually takes large, non-noisy steps towards the minimum. Too long per iteration.

If minibatch size == 1 : Stochastic Gradient Descent (SGD), won't converge but circle around the minimum. Loses speed up for vectorization. 

**Small training sets**, $m \le 2000$ -> batch gradient descent

Typical minibatch size: 64, 128, 256, 512, 1024... Make sure data fits into GPU/CPU memory.

When partitioning the dataset into minibatches, the last batch may be smaller than the batch size, this last batch is also used for training (see assignment).

<a id='opt_algo'></a>
## EWMA 

$$ V_t = \beta V_{t-1} + (1-\beta) \theta_t \approx \text{averaging over } \frac{1}{1-\beta} \text{ days} $$

Rule of thumb, $(1 - \epsilon)^{1/\epsilon} = \frac{1}{e}$, **larger** $\beta$ means averaging over a **longer** window.

## Bias Correction

Deal with the first few EWMA values, which starts at lower level. To correct this: $V'_t = \frac{V_t}{1-\beta^t}$

## Gradient Descent with Momentum

On epoch iteration $t$:
    
    compute dW, db on current minibatch
    VdW = beta * VdW + (1-beta) * dW
    Vdb = beta * Vdb + (1-beta) * db
    
    W = W - learning_rate * VdW
    b = b - learning_rate * Vdb
    
The idea is to use EWMA to smooth out the gradients across minibatches, i.e. smooth out the randomness in update directions. Typically use $\beta = 0.9$, average of around 10 iterations. Possible values range from 0.8 to 0.999.  Normally people don't use bias correction for `VdW`.

## Nesterov Momentum 

Based on cs231 Lecture 6.

The idea is that if we know that the next gradient update is momentum + gradient, instead of evaluating gradient at current spot, we can evaluate gradient at current + momentum (look ahead).

$$
\begin{aligned}
v_t &= \mu v_{t-1} - \epsilon \triangledown f(\theta_{t-1} + \mu v_{t-1}) \\
\theta_t &= \theta_{t-1} + v_t
\end{aligned}
$$

From normal backprop, we have $\theta_{t-1}, \triangledown f(\theta_{t-1})$. Let $\phi_{t-1} = \theta_{t-1} + \mu v_{t-1}$:

$$
\begin{aligned}
v_t &= \mu v_{t-1} - \epsilon \triangledown f(\phi_{t-1}) \\
\phi_t &= \theta_t + \mu v_t \\
&= \theta_{t-1} + v_t + \mu v_t \\
&= \phi_{t-1} - \mu v_{t-1} + (1 + \mu) v_t
\end{aligned}
$$

Therefore:

```
v_prev = v
v = mu * v - learning_rate * dx
x += -mu * v_prev + (1 + mu) * v
```

**Almost always** better than simple momentum. 

## RMSprop - Hinton proposed this in a Coursera course!

On epoch iteration $t$:

    compute dW, db on current minibatch
    SdW = beta * SdW + (1-beta) dW**2
    Sdb = beta * Sdb + (1-beta) db**2
    
    
    W = W - learning_rate * dW / np.sqrt(SdW + 1e-8)
    b = b - learning_rate * db / np.sqrt(Sdb + 1e-8)
    
The intuition is when the updates in some parameter directions are large versus others, you may want to slow down learning in those directions, e.g. you want to slow down learning in `b` but have fast learning in `W`. Dividing by the root squares achieves this.

RMSprop overcomes the weakness in AdaGrad, which results in gradient updates to vanish as momentum accumulates. [Stanford cs231](https://www.youtube.com/watch?v=hd_KFJ5ktUc&t=525s&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC&index=6).

## Adam (Adaptive Moment Estimation)

[paper](https://arxiv.org/pdf/1412.6980.pdf)

    VdW, SdW, Vdb, Sdb = 0, 0, 0, 0
    On epoch iteration t:
        compute dW, db on current minibatch
        
        VdW = beta1 * VdW + (1-beta1) * dW
        Vdb = beta1 * Vdb + (1-beta1) * db
        
        SdW = beta2 * SdW + (1-beta2) dW**2
        Sdb = beta2 * Sdb + (1-beta2) db**2
        
        # bias correction
        VdW_corrected = VdW / (1 - np.power(beta1, t))
        Vdb_corrected = Vdb / (1 - np.power(beta1, t))
        
        SdW_corrected = SdW / (1 - np.power(beta2, t))
        Sdb_corrected = Sdb / (1 - np.power(beta2, t))
        
        W = W - learning_rate * VdW_corrected / np.sqrt(SdW_corrected + 1e-8)
        b = b - learning_rate * Vdb_corrected / np.sqrt(Sdb_corrected + 1e-8)
        
Hyperparameter choices: 

* `learning_reate` : tuned
* `beta1` : 0.9
* `beta2` : 0.999 (proposed by original paper)

[Melis, et al. 2017](https://arxiv.org/abs/1707.05589) used Adam with $\beta_1 = 0$ and $\beta_2 = 0.999$ and $\epsilon = 10^{-9}$. This turns of the exponential moving average for the estimates of the means of the gradientds and brings Adam very close to RMSProp without momentum, but due to Adam's bias correction, **larger** learning rates can be used. 

## Learning Rate Decay

1 epoch = 1 pass through all data $m$, $\alpha$ is the learning rate.

$$ \alpha = \frac{1}{1 + \text{decay_rate} \times \text{epoch_num}}\alpha_0 $$

Try $\alpha_0 = .2$, $\text{decay_rate} = 1$.

Some variations:

$$
\begin{aligned}
\alpha &= 0.95^{\text{epoch_num}} \alpha_0 \\
\alpha &= \frac{k}{\sqrt{\text{epoch_num}}} \alpha_0
\end{aligned}
$$

Or discrete staircase, manual decay.

## Local Optima

In deep learning where you have zero gradient, more likely you have a saddle point, rather than a local optima.

**Plateaus** slows down learning significantly.

# Week 3

## Hyperparameters

Secondary importance: momentum, minibatch size, no. of hidden units. 

Grid search is out dated now. Ok when # of parameters is small.

Better to use **random search**,as suggested by the DL book. 

Use **Coarse to fine** scheme, like what's suggested by Goodfellow. 

<a id='random_search'></a>
## Appropriate Scale for Random Search

**Uniform sampling** may be reasonable for # of layers, # of hidden units in a layer.

Use **log** scale for **learning rate**. 

    # r is uniform
    r = -4 * np.random.rand()
    alpha = np.power(10, r)
    
**Momemtum $\beta$**: 
    
    r = np. -3 * np.random.rand()
    beta = 1 - np.power(10, r)
    
Theoratical reason is that when $\beta$ is close to 1, it has a huge impact, e.g. $\beta$ from 0.999 to 0.9995 is moving from averaging 1000 samples to 2000 samples. 

## Babysitting One Model

Watch cost function over a number of days and adjust each day when you don't have a lot of resources. (Pandas approach)

Alternatively, train many models in parallel. (Caviar approach)

<a id='batchnorm'></a>
## Batch Normalization

[paper](https://arxiv.org/abs/1502.03167) 

[R2RT Tensorflow Example](https://r2rt.com/implementing-batch-normalization-in-tensorflow.html)

In practice, people tend to normalize $Z^{[l]}$ more than $A^{[l]}$. This is the approached used in this course.

### Implementation

Given intermediate values in a network, $Z^{(1)}, Z^{(2)}, \cdots, Z^{(m)}$, from some hidden layer $l$, data $i$, i.e. $Z^{[l](i)}$. 

$$
\begin{aligned}
\mu &= \frac{1}{m} \sum_{i=1}^{m} Z^{[l](i)} \\
\sigma^2 &= \frac{1}{m} \sum_{i=1}^{m} \big( Z^{[l](i)} - \mu \big)^2 \\
Z^{[l](i)}_{norm} &= \frac{Z^{[l](i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} \\
\widetilde{Z}^{[l](i)} &= \gamma \times Z^{[l](i)}_{norm} + \beta 
\end{aligned}
$$

Use $\widetilde{Z}^{[l](i)}$ instead of $Z^{[l](i)}$. $\gamma$ and $\beta$ are training parameters.

In a deep network, batch norm is done between computing $Z^{[l]}$ and $A^{[l]}$, over a minibatch.

Parameters in the network becomes: $W^{[l]}, b^{[l]}, \gamma^{[l]}, \beta^{[l]}$.

$b^{[l]}$ can be eliminated due to the BN calculation, therefore can be set to 0 permanently, i.e. $z^{[l]} = w^{[l]}a^{[l-1]}$.

**Dimensions** for one training example: $Z^{[l]}, W^{[l]}, b^{[l]}, \gamma^{[l]}, \beta^{[l]} \rightarrow (n^{[l]}, 1)$, $n^{[l]}$ is the number of hidden units in layer $l$.



### Code

    for t in range(num_minibatch):
    
        Compute forward prop on X_minibatch_t
            In each hidden layer, use BN to replace Z^{[l]} with \widetilde{Z}^{[l]}
        
        Use backprop to compute dW^{[l]}, dbeta^{[l]}, dgamma^{[l]}
        
        Update parameters
        
Works with momentum, RMSprop, Adam, etc.



### Why BatchNorm works

Makes weights in later layers more robust to changes in earlier layers. 

**Covariate Shift**: data distribution between training and test changes. BN reduces this effect.

### Regularization Effect

Mean/variance computed on just one minibatch, therefore there is **noise** in the values of $Z$. This has a slight regularization effect.

**Andrew Ng: the regularization effect is light, so don't rely on it. Use dropout or other methods.**

### BN At Test Time

$\mu, \sigma^2$ at test time is based on an estimate using EWMA across minibatches.

Why though? Doesn't that favour the later batches? Posted question in Discussion Forum [here](https://www.coursera.org/learn/deep-neural-network/discussions/weeks/3/threads/fGY3XZWnEeeImxLZqEy9Mg)

Original paper uses running average collected during training, using the full population.

## Softmax Regression

Used for multi-class classification problems. 

### Activation Function

Given an $N$ class problem, dimension in brackets:

$$
\begin{aligned}
t &= e^{Z^{[l]}}, (N, 1) \\
a^{[l]} &= \frac{t_i}{\sum_{j=1}^{N} t_j}, (N, 1)
\end{aligned}
$$

### Loss Function

$$ \mathcal{L}(\hat{y}, y) = -\sum_{j=1}^{N} y_j \log \hat{y}_j $$

Equavilent to Maximum Likelihood.

## TensorFlow

`tf.placeholder()` used for data input. 

Define `tf.Variable()`, cost function is specified in terms of variables. 