In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

# Deep Learning 


* Choice of hyperparameters: the way we choose to train the model

* Model architecture: structure of the layers

# Gradient descent
* Stochastic gradient descent
* Batch gradient descent
* Mini-batch gradient descent

<center><img src="https://nurserytovarsity.com/wp-content/uploads/2017/07/imagg6-1280x720.png" width="600"><center>


# Gradient descent
* Stochastic gradient descent
* Batch gradient descent
* Mini-batch gradient descent

<center><img src="https://cdn-images-1.medium.com/max/800/1*0_bWlD9aWRvZOzT7QJvHWw.png" width="700"><center>


# Loss function

#### Classification:
- Binary cross entropy: $L_{binary} = \frac{1}{N}\Sigma^N_{i=1} [y_i log(\hat{y}_i) + (1 - y_i)log(1 - \hat{y}_i)]$
    
    
- Categorical cross entropy: $L_{categorical} = \frac{1}{N}\Sigma^N_{i=1}\Sigma^c_{j=i} [y_{ij}log(\hat{y}_{ij})]$
    
#### Regression

- Root mean squared error (RMSE):  $L_{RMSE} = \sqrt{\frac{1}{N}\Sigma_{i=1}^{n}{(y - \hat{y})^2}}$

# Loss function


- **Multi-class classification**
    - **softmax** output layer with **categorical** cross-entropy and **one-hot** targets.
- **Binary or multi-label classification**
    - **sigmoid** output layer with **binary** cross-entropy and **binary** vector targets.
- **Regression**
    - **linear** output layer with **RMSE**
    - Not performing? Try **discretizing output** through binning. Otherwise, go for a different learning algorithm.
    
<center><img src="../images/loss_functions_table.png" width="700"><center>


# Weight updates

**Learning rate**: small value η typically between 1.0 and 10^-6

![](../images/optimal_learning_rate.jpg)


# Weight updates

**Learning rate**: small value η typically between 1.0 and 10^-6

![](../images/learningrates.jpeg)


# Weight updates 

**Momentum:** take into account the gradient estimation of the previous batches

_SGD with momentum, Nesterov momentum_
    
![](../images/nesterov_momentum.jpeg)

## Momentum (further reading)

The main difference is in classical momentum you first correct your velocity and then make a big step according to that velocity (and then repeat), but in Nesterov momentum you first making a step into velocity direction and then make a correction to a velocity vector based on new location (then repeat).

i.e. without momentum:

`vW(t+1) = - scaling .* gradient_F( W(t) )`

`W(t+1) = W(t) + vW(t+1)`

Classical momentum:

`vW(t+1) = momentum.*Vw(t) - scaling .* gradient_F( W(t) )`

`W(t+1) = W(t) + vW(t+1)`

While Nesterov momentum is this:

`vW(t+1) = momentum.*Vw(t) - scaling .* gradient_F( W(t) + momentum.*vW(t) )`

`W(t+1) = W(t) + vW(t+1)`

![](https://miro.medium.com/max/600/1*4F6O5Jo936tykq0M1E_Z2Q.png)

[source](https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc#:~:text=The%20main%20difference%20is%20in,new%20location%20(then%20repeat).)
[source](https://medium.com/konvergen/momentum-method-and-nesterov-accelerated-gradient-487ba776c987)

# Weight updates 

**Adaptive learning rate**: adapt the learning rate based on the gradient history (removing the dependency on hyperparamter choice).

_AdaGrad, AdaDelta, RMSprop_

**Momentum & adaptive learning rate**: _Adam, Nadam_

![](../images/optimizers_1.gif)

## Optimizers (further reading)

[UvA notes on optimizers](https://uvadlc.github.io/lectures/dec2020/lecture3.1.pdf)

# Weight updates

![half center](../images/optimizers.png)


# Weight initialization

There are a few contradictory requirements:

- Weights need to be small enough magnitude $\rightarrow$ Otherwise output values explode
- Weights need to be large enough magnitude $\rightarrow$ Otherwise signal too weak to propagate

![center](../images/backpropagation_0.gif)

<sub>*Ryszard Tadeusiewcz "Sieci neuronowe", Kraków 1992*</sub>

# Weight initialization

**Naive approaches**: All zero

Every hidden unit will get zero signal. No matter what the input was, the output would be the same!

![center](../images/forward_pass_0.gif)


# Weight initialization

**Naive approaches**: All constant (e.g. all 1.0)

- Input to each neuron in a layer will be the same, 
- therefore the update each neuron in a layer receives will be the same,
- this will prevent different neurons in a layer from learning different things.

![center](../images/forward_pass_0.gif)


# Weight initialiation

**Solution**: Break symmetry with a random initializaiton.
- Xavier init: 
$w\sim\sqrt{\frac{2}{n_{in}+n_{out}}}\cdot N(0,1)$ ([Glorot et al.](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf))
- He init: $w\sim\sqrt{\frac{2}{n_{in}}}\cdot N(0,1)$ ([He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852))

_where $n_{in}$ and $n_{out}$ represent the number of in and outgoing connections_

# Recap Hyperparameters
* **Gradient descent**: dependent on data, usually mini-batch
* **Error function**: dependent on the type of problem; influences final activation
* **Weight updates**: dependent on data, usually Adam is a good choice
* **Weight initialisation**: He or Xavier

# Hyperparameters Questions

* Which is computationally more expensive: SGD, batch gradient descent or mini-batch gradient descent?

 
* Why is cross-entropy preferred over e.g. classification error (/accuracy)? 

Accuracy is not a continuously differentiable function of the weights!


* Which optimizer combines both an adaptive learning rate with momentum? 


* Why would you not initialize the weights to 0? 


* In what cases would you use He/Xavier initialization over random initialization? 

# Architecture 
* Fully-connected
* Activation 
* Bias
* Normalization
* Regularization

# Fully-Connected layer 

All inputs of one layer connected to every activation unit of the next layer. 

Also known as _Linear_ or _Dense_ layer

<center><img src="https://cs231n.github.io/assets/nn1/neural_net2.jpeg" width="900"><center>



# Activation

Introduces non-linearity into the network. 

No trainable parameters.
![](../images/activation_sigmoid_tanh_relu.png)


# Bias

An additional paramter that allows you to shift the activation function to the left or right (which may be critical for successful learning).

![](https://matthewmazur.files.wordpress.com/2018/03/neural_network-7.png?w=584)

# Batch Normalization

Batch normalization ([Loffe et al.](http://arxiv.org/abs/1502.03167))

- Normalize the layer inputs with batch normalization.

- This helps to ensure all layers activated in near optimal “regime” of the activation functions.

- Since the gradients’ dependency on the scale of the weights is reduced, it allows us to use higher learning rates,

- which means training is accelerated, as less iterations are required to converge to a given loss value.

<center><img src="../images/bn_algorithm.png" width="700"><center>

## Batch Norm (notes)

Batch Norm learns 4 parameters
- $\beta$
- $\epsilon$
- running mean $\mu$ (for inference stage)
- running variance $\sigma^2$ (for inference stage)

## Normalization (further reading)

- Weight normalization ([Salimans et al.](https://arxiv.org/abs/1602.07868))

- Layer normalization ([Ba et al.](https://arxiv.org/abs/1607.06450))

# Regularization

The great flexibility of neural networks makes them very powerful, however this comes at the price of easily overfitting of the data.

![center](../images/overfitting.jpeg)

# Dropout

- "Drop" neurons in the network with probability p (every mini-batch/epoch)

- No trainable paramters
![](../images/dropout.png)



# Dropout

- Computing the gradient is done with respect to the error, but also with respect to what all other units are doing. Therfore certain neurons may fix the mistakes of other neurons.
- Dropout prevents over-reliance on a subset of the neurons in a layer
- every neuron becomes more robust
![](../images/dropout.png)



# Conclusion

**Hyperparameters**: 
- gradient descent type 
- loss function
- weight updates
- weight initialisation

**Architecture**: 
- Fully-connected layers
- Activation function
- Bias
- (batch) normalization
- Rgularization with dropout

![footer_logo](../images/logo.png)

# Neural Networks in Keras
Let's put our knowledge to practice.

## [Keras basics](01-04-keras_basics.ipynb)

![footer_logo](../images/logo.png)