# Week 1

Hyperparameters:
* The learning rate $\alpha$
* The number of iterations of gradient descent
* The number of layers
* The number of units in each layer
* The activation function

## Train / Dev / Test sets

Applied ML is an iterative process:

**IDEA $\rightarrow$ CODE $\rightarrow$ EXPERIMENT $\rightarrow$ IDEA $\rightarrow$ ...**

Applied deep learning is a very iterative process where you just have to go around this cycle many times to hopefully find a good choice of network for your application. So one of the things that determine how quickly you can make progress is how **efficiently** you can go around this cycle. And setting up your data sets well in terms of your **train, development and test sets** can make you much more efficient at that.

* You keep on training algorithms on your training sets.
* Use your dev set or your hold-out cross validation set to see which of many different models performs best on your dev set.
* And then after having done this long enough, when you have a final model that you want to evaluate, you can take the best model you have found and evaluate it on your test set. In order to get an unbiased estimate of how well your algorithm is doing.

It was common practice to take all your data and split it according to maybe a 70/30% in terms of a people often talk about the 70/30 train test splits.

But in the modern big data era, where, for example, you might have a million examples in total, then the trend is that your dev and test sets have been becoming a much smaller percentage of the total. Because remember, the goal of the dev set or the development set is that you're going to test different algorithms on it and see which algorithm works better. So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better. For example, if you have a million examples, if you need just 10,000 for your dev and 10,000 for your test, your ratio will be 1% of 1 million so you'll have 98% train, 1% dev, 1% test.

One other trend we're seeing in the era of modern deep learning is that more and more people train on **mismatched train and test distributions**. So maybe your training set has a lot of pictures crawled off the Internet but the dev and test sets are pictures uploaded by users. Turns out a lot of webpages have very high resolution, very professional, very nicely framed pictures of cats. But maybe your users are uploading blurrier, lower res images just taken with a cell phone camera in a more casual condition. And so these two distributions of data may be different. The rule of thumb I'd encourage you to follow in this case is to make sure that the dev and test sets come from the same distribution.

## Bias / Variance

For a model $Y = f(x) + \epsilon$, the expected MSE can be decomposed into the squared *bias* of $\hat{f}(x)$, the *variance* of $\hat{f}(x)$ and the variance of the error term $\epsilon$:

$$E\big[\big(Y-\hat{f}(x)\big)^2\big] = \underbrace{\big(E\big[\hat{f}(x)\big]-f(x)\big)^2}_{\text{bias}^2} + \underbrace{E\big[\big(\hat{f}(x)-E[\hat{f}(x)]\big)^2\big]}_{\text{variance}} + \sigma^2_{\epsilon}$$

* Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between $Y$ and $X1, X2, ..., Xp$. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of $f$. $\rightarrow$ **underfitting**

* The *variance* of $\hat{f}$ refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different $\hat{f}$. But ideally the estimate for $f$ should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$. $\rightarrow$ **overfitting**


* High bias?  $\rightarrow$ Bigger network (more hidden layers, more hidden units), different architectures
* High variance?  $\rightarrow$  More training data, **regularization**

Regarding to the **bias-variance tradeoff**, in deep learning generally one solution does not negatively affect the other.

### Regularization

Also known as **weight decay**, examples of regularization in linear regression are the Lasso or the Ridge regression:

$$min_{w,b}J(w,b)=\frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}-y^{(i)})+\frac{\lambda}{2m}||w||_2^2$$

where $\lambda$ is the regularization parameter and should be chosen using CV or the dev set.

In a neural network the cost function becomes

$$J(w^{[1]},b^{[1]},...,w^{[L]},b^{[L]}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}-y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^L||w^{[l]}||_F^2$$

where $||w^{[l]}||_F^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l−1]}}(w_{i,j}^{[l]})^2$ is the *Frobenius norm* of matrix $w^{[l]}$, of dimension $n^{[l]} \times n^{[l-1]}$.

Remember: $\underbrace{x^{(i)}}_{n^{x} \times 1} \rightarrow \underbrace{\overbrace{w^{[1]}}^{n^{[1]} \times n^{[x]}}x^{(i)}+b^{[1]}}_{n^{[1]} \times 1} \rightarrow  \underbrace{\overbrace{w^{[2]}}^{n^{[2]} \times n^{[1]}}a^{[1](i)}+b^{[2]}}_{n^{[2]} \times 1} \rightarrow ...$

Intuition: in this way the update of $w^{[l]}$ during back-propagation is *smaller*. Each hidden unit has a smaller effect: higher $\lambda \rightarrow$ smaller $w^{[l]} \rightarrow$ smaller $z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]} \rightarrow$ the activation function is more linear $\rightarrow$ closer to a linear regression model.

<img src="linear_af.PNG" width="300px" />

### Dropout

**Dropout** consists in randomly set hidden units to zero in order to reduce overfitting:

<img src="dropout.png" width="600px" />

The probability can change for every hidden layer.

**Inverted dropout** consists in dividing the remaing hidden units by the probability of being kept, let's say $80\%$. This is done because in $z^{[l]} = w^{[l]}a^{[l-1]}+b^{[l]}$ the $a^{[l-1]}$ has been reduced by $20\%$, hence using $a^{[l-1]}/0,8$ makes sure than the expected value of $a^{[l-1]}$ is the same.

But notice that in the test time you're not using dropout, you're not flipping coins to decide which hidden units to eliminate. And that's because when you are making predictions at the test time, you don't really want your output to be random.

### Data Augmentation

If getting additional data is costly, taking random distortions and translations of the image you have could augment your data set and make additional fake training examples.

<img src="data_aug.png" width="600" />

### Early stopping

**Early stopping** consists in stopping the minimization of the cost function (on the train set) before it gets to the minimum.

The problem with this method is that we're mixing optimization and reducing overfitting which are two different tasks.

## Setting up the Optimization

### Normalizing input features

When the features have very different scale so have the weights $w^{[l]}$. In this case the cost function is "elongated" and the gradient descent may need longer steps to finally find the minimum.

<img src="normalize.PNG" width="600" />

If you scale the train set by subtracting the mean and dividing by the standard deviation then you have to scale the test set with the same values. They have to go by the same transformation.

### Vanishing/Exploding gradients

With **deep** networks the gradient may decrease or increase exponentially.

A partial solution to this problem is a careful choice of the random initialization. For example you can set the variance equal to $\frac{1}{n^{[l-1]}}$ in case of tanh activation function (*Xavier initialization*) or $\frac{2}{n^{[l-1]}}$ in case of ReLu activation function, or $\frac{2}{n^{[l-1]}+n^{[l]}}$. For example:

$$w^{[l]} = np.random.rand(shape)*np.sqrt\bigg(\frac{2}{n^{[l-1]}}\bigg)$$

# Week 2

## Optimization algorithms

### Mini-batch gradient descent

The idea is that instead of computing the gradient on the whole training sample of dimension $(n_x,m)$ one can split the observations in train set in many *batches* of smaller dimension $X^{\{1\}}, X^{\{2\}}, ...$ and $y^{\{1\}}, y^{\{2\}}, ...$.

**Epoch**: the number of times when the complete dataset is passed forward and backward by the learning algorithm.

For each layer $l=1,...L$ the parameters are $W^{[l]}$ of dimension $(n^{[l]} \times n^{[l-1]})$, where $n^{[0]}=k$.

It works as follows:

a) First randomly initialize the parameters $W^{[l]}$ for $l=1,...L$.

b) Take a sample of length $p\in[1,m]$ of my training data, denoted by $(X^{(1)},y^{(1)})$ for sample number $1$.

c) Compute the cost $J^{(1)}(W)$ with the first initialization of the parameters and the first sample of the train data.

d) In back-propagation update the parameters for $l=L,...1$ according to a learning rate $\alpha$:
$$ W^{[l]} = W^{[l]} - \alpha \text{ } \frac{\partial J^{(1)}(W)}{\partial W^{[l]}}$$

This is one step of the gradient descent with one sample of the train data. Now repeat step (c) and (d) with the "new" $W^{[l]}$ on a second sample of the train data $(X^{(2)},y^{(2)})$ and so on. It will continue until convergence when every update in the gradient descent is done with different samples of the train data.



There are actually 3 main "kind" of Gradient Descent:

* Batch Gradient Descent: if the training data can fit in memory (RAM / VRAM) the choice is on Batch Gradient Descent. In this case the batch size is equal to the entire dataset. This means that the model is updated only when all the dataset is passed.

* Stochastic Gradient Descent: the batch size is equal to 1. This means that the model is updated with only a training instance at time.

* Mini-batch Gradient Descent: the batch size is equal to a value $p \in (1,m)$. This means that the model is updated per batch.

The thumb rule is to use batch gradient descent if you can fit all the dataset in memory. On the contrary, depending on the instance size, the choice will be a mini-batch gradient descent with a fixed size batch that can fit entirely in memory. Usually when you use the mini-batch gradient descent the error convergence will be more noisy compared to batch gradient descent, because of the content variability of the batches.

<img src="gd.png" width="600" />

### Gradient Descent with Momentum

**Exponential weighted averages:**

Consider a series $\theta_1, \theta_1, ...,\theta_n$ and let $v_t = \beta v_{t-1} + (1-\beta) \theta_t$, with $v_1 = 0$. The value $v_t$ is approximately the average over the last $\frac{1}{1-\beta}$ observations of $\theta_t$, for example when $\beta=0.9$ $v_t$ i the average of the last 10 $\theta_t$. The larger is $\beta$ and the smoother is the series of $v_t$.

Since at the beginning $v_t$ is artificially close to zero for a few iterations, one can correct the update using $v^*_t = \frac{v_t}{1-\beta^t}$. For example, let $\beta = 0.98$:

$$v_0 = 0$$
$$v_1 = 0.02 \theta_1$$
$$v_2 = 0.98 v_1 + 0.002 \theta_2 = 0.0196 \theta_1 + 0.02 \theta_2$$
while $(1-\beta^2) = 0.0396$ so that 
$$v^*_2 = \frac{0.0196 \theta_1 + 0.02 \theta_2}{0.0396}$$
which is a weighted average of the two $\theta_s$.

In order to avoid oscillations in the gradiente descent (blue) or overshooting (purple), the idea is to apply this smoothing mechanism in the update of the weights (red).

<img src="gd_momentum.png" width="600" />

On iteration $t$:
 
$$v_{dW} = \beta v_{dW} + (1-\beta)dW$$
$$v_{db} = \beta v_{db} + (1-\beta)db$$

$$W = W - \alpha v_{dW}$$
$$b = b - \alpha v_{db}$$

This add the hyperparameter $\beta$, generally set as $\beta = 0.9$ which works as a friction, so that is as the average of the last 10 iterations. The matrix $v_{dW}$ and $v_{db}$ are initialized as zeros.

### RMSprop

Root Mean Square Prop is another algorithm to slow down the update of the parameters in order to make it more stable and reach the minimum faster. Applied to mini-batches at every iteration you compute $dW$ and $db$ and update the parameters in the following way:

$$S_{dW} = \beta S_{dW} + (1-\beta) dW^2$$
$$S_{db} = \beta S_{db} + (1-\beta) db^2$$

$$W = W - \alpha \frac{dW}{\sqrt{S_{dW}}}$$
$$b = b - \alpha \frac{db}{\sqrt{S_{db}}}$$

So that the larger the derivative of the gradient (numerator), the larger is the adjustment to slow it down (denominator). One can add $\epsilon =10^{-8}$ to $S_{dW}$ so that it is different from zero.

### Adam optimization algorithm

Adam stands for Adaptive moment estimation. It combines momentum and RMSprop on mini-batch gradient descent. On every mini-batch step:

* Compute $dW$, $db$ with the current mini-batch
* Let $v_{dW} = \beta_1 v_{dW} + (1-\beta_1) dW$
* Let $S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2$
* Let $v^{*}_{dW} = \frac{v_{dW}}{(1-\beta_1^2)}$ as we correct the bias of the exponentially weighted average
* Let $S^{*}_{dW} = \frac{S_{dW}}{(1-\beta_2^2)}$ same as above
* Let $W = W - \alpha \frac{v^{*}_{dW}}{\sqrt{S^{*}_{dW}}}$

and the same for $S_{db}$.

There are other parameters: generally the learning rate $\alpha$ needs to be tunes, $dW$ is $\beta_1 = 0.9$ and $\beta_2=0.999$.


### Learning rate decay

One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time.

Maybe a mini-batch has just 64, 128 examples. Then as you iterate, your steps will be a little bit noisy. And it will tend towards the minimum, but it won't exactly converge. But your algorithm might just end up wandering around (blue), and never really converge, because you're using some fixed value for alpha. And there's just some noise in your different mini-batches. Alternatively, if $\alpha$ gets smaller, your steps you take will be slower and smaller. And so you end up oscillating in a tighter region around this minimum (green), rather than wandering far away.

<img src="lrd.png" width="500" /> 

Let $\alpha_0 = 0.2$ and set

$$\alpha = \frac{1}{1+\text{deacy_rate}*\text{epoch}} \alpha_0$$

Or other versions of exponential decay:

$$\alpha = 0.95^{\text{epoch}}*\alpha_0$$

$$\alpha = \frac{k}{\sqrt{\text{epoch}}}*\alpha_0$$

$$\alpha = \frac{k}{\sqrt{t}}*\alpha_0$$ where $k$ is a constant and $t$ is the number of the mini-batch.

### Local Optima

In high dimensionality is very unlikely to be stuck in local optima. Generally if the derivative is zero it's a saddle point.

The problems are **plateaus** which are regions where the gradien is clode to zero in a vast area. This slows down the learning.

Algoriths like momentum, RMSprop and Adam can really help the learning algorithm

<img src="gd_performance.gif" width="400" />  
 
With the case of saddle point, RMSprop(black line) goes straight down, it doesn’t really matter how small the gradients are, RMSprop scales the learning rate so the algorithms goes through saddle point faster than most.

# Week 3

## Hyperparameter Tuning

* $\alpha$: learning rate $\rightarrow$ **most important**
* $\beta$: for momentum $\rightarrow$ **second importance** ($\beta=0.9$)
* $\beta_1$, $\beta_2$, $\epsilon$: for Adam $\rightarrow$ **last importance** ($\beta_1 = 0.9, \beta_2=0.999, \epsilon = 10^{-8}$)
* \# of layers $\rightarrow$ **third importance**
* \# of hidden units $\rightarrow$ **second importance**
* ?learning rate decay $\rightarrow$ **third importance**
* mini-batch size $\rightarrow$ **second importance**


Using random search instead of grid search allows you to explore more parameters at the same cost (time). You can run an additional random search in a finer area where "good" parameters were found $\rightarrow$ *coarse to fine search*.

### Choosing the right scale

It is important to sample using an appropriate scale for the hyperparameters:

If $\alpha \in [0.0001,1]$ the picking uniformly in this interval means that $90\%$ we're picking a number between $0.1$ and $1$. It is better to pick uniformly on a logarithmic scale, so that the intervals $[0.0001,0.001], [0.001,0.01], [0.01,0.1], \text{ and } [0.1,1]$ have the same length:

    r = -4 * np.random.rand() # r in [-4,0]
    alpha = 10^r # alpha in [0.0001,1]


For exponentually weighted averages $\beta \in [0.9,0.999]$, equivalent $1-\beta \in [0.001,0.1]$, therefore we set $\beta = 1-10^{(-r*2-1)}$, with

    r = np.random.rand()

It is important to avoid sampling from a linear scale because when $\beta$ is close to 1, the sensitivity of the results you get changes a lot, even with very small changes to beta. that is if $\beta$ goes from $0.9$ to $0.9005$ (averaging over roughly 10 values), it's no big deal, this is hardly any change in your results. But if $\beta$ goes from $0.999$ to $0.9995$ (averaging over roughly 1000 or 2000 values), this will have a huge impact.

## Batch Normalization

Batch normalization (BN) makes your hyperparameter search problem much easier, makes your neural network much more robust. 

As normalizing the input features helps learning because it "regularizes" the shape of the optimal region, the same can be applied to every hidden layer of the neural network.

In order not to have units with zero mean and unit variance we apply  transformation:

$$z^{[l](i)} = w^{[l]}a^{[l-1](i)}$$
$$z^{[l](i)}_{\text{norm}} = \frac{z^{[l](i)}-\mu}{\sqrt{\sigma+\epsilon}}$$
$$\tilde{z}^{[l](i)} = \gamma^{[l]} z^{[l](i)}_{\text{norm}} + \beta^{[l]}$$

where $\gamma$ and $\beta$ are parameters to be learned that fix the mean and variance of each hidden layer. Notice that the parameter $b^{[l]}$ is not included anymore because when subtracting the mean it would disappear, so it is enough $\beta^{[l]}$. So in backpropagation you have to use an optimization algorithm to update them:

$$\beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]}$$

Generally this is done with mini-batches:

* for each mini-batch $t=1,...T$
    * compute forward prop $X^{\{t\}}$
        * in each hidden layer use BN to replace $z^{[l]}$ with $\tilde{z}^{[l]}$
    * use back prop to compute $dw^{[l]}, d\gamma^{[l]}, d\beta^{[l]}$
    * update parameters $w, \gamma \text{ and } \beta$ with GD, RMSprop or Adam with a learning rate $\alpha$ 

Batch Normalization helps also to generalize a model in case of a shift in the distributiono of the input features $\rightarrow$ **covariate shift**

Going through a NN, from the perspective of hidden layer $l$, the hidden unit values of the previous layer $a^{[l-1]}$ are changing all the time, and so it's suffering from the problem of covariate shift. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around.

It also contributes to regularization since in a mini-batch GD each mean and variance are computed on the mini-batch so this add some noise within that mini-batch.

Therefore, at training for each mini-batch $i$ of length $m$:
* $\mu = \frac{1}{m}\sum_i z^{(i)}$
* $\sigma^2 = \frac{1}{m}\sum_i (z^{(i)}-\mu)^2$
* $z^{(i)}_{\text{norm}} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}$
* $\tilde{z}^{(i)} = \gamma z^{(i)}_{\text{norm}} + \beta$

But at test time you may not have the same mini-batch number, and with a limited number of observation using that mean and variance can be suboptimal.

An alternative is to compute $\mu$ and $\sigma$ as an exponentially weighted average for each layer across the mini-batches and then use those estimated value in test time.

### Multi-class classifications

In a multi-class classification problem with $C$ classes the last layer $L$ has $C$ hidden units ($\hat{y}_i = C \times 1$) where each unit estimates the probability of being a class: $P(Y=c|X)$.

This last layer $L$ is called **softmax layer**. Once you compute $z^{[L]} = w^{[L]} a^{[L-1]}+b^{[L]}$ it uses the activation function $t = e^{(z^{[L]})}$ and tehn computes

$$\hat{y} = a^{[L]} = \frac{e^{(z^{[L]})}}{\sum_{j=1}^Ct_j}$$

Since the sum of the units of $a^{[L]}$ should be $1$. The difference with the other activation functions is that it takes a vector and returns a vector (instead of $\mathbb{R} \rightarrow [0,1]$).

Softmax regression is a generalization of logistic regrassion: is $C=2$ it would be the same.

Softmax regression uses the loss function over teh classes:

$$L(\hat{y},y) = -\sum_{j=1}^C y_j \log(\hat{y_j})$$

where, for example $y = [0 1 0 0]$ (second class) it becomes $-\log(\hat{y}_2)$, so it tries to increase $\hat{y}_2$, i.e. the probability of $y$ being the second class.

The cost function is defined as 

$$J(w^{[1]},b^{[1]},...) = \frac{1}{n}\sum_{i=1}^m L(\hat{y}^{(i)},y^{(i)})$$

Notice that $y$ is a matrix $C \times m$.

## Tensorflow

In [1]:
# conda create -n myenv37tf python=3.7 tensorflow=1
import numpy as np
import tensorflow as tf

Let's define a cost function $J(w) = w^2-10w+25$

It has minimum at $w^* = 5$.

In [2]:
# let's define the parameter
w = tf.Variable(0,dtype=tf.float32)
# cost = tf.add(tf.add(w**2, tf.multiply(-10.,w)),25) # old syntax
cost = w**2-10*w+25
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [3]:
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # to initialize teh parameters
print(session.run(w))

0.0


In [4]:
# run one step of GD and evaluate w
session.run(train)
print(session.run(w))

0.099999994


In [5]:
# run one thousand step of GD and evaluate w
for i in range(1000):
    session.run(train)
print(session.run(w))

4.9999886


It is only necessary to implement for-prop with a cost function because Tensorflow is able to do the back-prop by itself.

Let's use now some input dataset.

In [6]:
coefficient = np.array([[1.],[-10],[25.]])
w = tf.Variable(0,dtype=tf.float32)
x = tf.placeholder(tf.float32, [3,1]) # A placeholder is an object whose value you can specify only later.
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # to initialize the parameters
print(session.run(w))

0.0


In [7]:
# run one step of GD and evaluate w
session.run(train, feed_dict={x:coefficient}) # To specify values for a placeholder, you can pass in values by using a "feed dictionary"
print(session.run(w))

0.099999994


In [8]:
# run one thousand step of GD and evaluate w
for i in range(1000):
    session.run(train, feed_dict={x:coefficient})
print(session.run(w))

4.9999886


Alternatively, one can use this syntax:

In [12]:
coefficient = np.array([[1.],[-10],[25.]])
w = tf.Variable(0,dtype=tf.float32)
x = tf.placeholder(tf.float32, [3,1]) # input data
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
with tf.Session() as session:                    # Create a session and print the output
    session.run(init)                            # Initializes the variables
    session.run(train, feed_dict={x:coefficient})
    print(session.run(w))                     # Prints the loss

0.099999994
