In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

# Deep Learning 


* Choice of hyperparameters: the way we choose to train the model

* Model architecture: structure of the layers

# Gradient descent
* Stochastic gradient descent
* Batch gradient descent
* Mini-batch gradient descent

<center><img src="https://nurserytovarsity.com/wp-content/uploads/2017/07/imagg6-1280x720.png" width="600"><center>


# Gradient descent
* Stochastic gradient descent
* Batch gradient descent
* Mini-batch gradient descent

<center><img src="https://cdn-images-1.medium.com/max/800/1*0_bWlD9aWRvZOzT7QJvHWw.png" width="700"><center>


# Loss function

#### Classification:
- Binary cross entropy: $L_{binary} = \frac{1}{N}\Sigma^N_{i=1} [y_i log(\hat{y}_i) + (1 - y_i)log(1 - \hat{y}_i)]$
    
    
- Categorical cross entropy: $L_{categorical} = \frac{1}{N}\Sigma^N_{i=1}\Sigma^c_{j=i} [y_{ij}log(\hat{y}_{ij})]$
    
#### Regression

- Root mean squared error (RMSE):  $L_{RMSE} = \sqrt{\frac{1}{N}\Sigma_{i=1}^{n}{(y - \hat{y})^2}}$

# Loss function


- **Multi-class classification**
    - **softmax** output layer with **categorical** cross-entropy and **one-hot** targets.
- **Binary or multi-label classification**
    - **sigmoid** output layer with **binary** cross-entropy and **binary** vector targets.
- **Regression**
    - **linear** output layer with **RMSE**
    - Not performing? Try **discretizing output** through binning. Otherwise, go for a different learning algorithm.
    
<center><img src="../images/loss_functions_table.png" width="700"><center>


# Weight updates

**Learning rate**: small value η typically between 1.0 and 10^-6

![](../images/optimal_learning_rate.jpg)


# Weight updates

**Learning rate**: small value η typically between 1.0 and 10^-6

![](../images/learningrates.jpeg)


# Weight updates 

**Momentum:** take into account the gradient estimation of the previous batches

_SGD with momentum, Nesterov momentum_
    
![](../images/nesterov_momentum.jpeg)

# Weight updates 

**Adaptive learning rate**: adapt the learning rate based on the gradient history

_AdaGrad, AdaDelta, RMSprop_

**Momentum & adaptive learning rate**: _Adam, Nadam_

![](../images/optimizers_1.gif)

# Weight updates

![half center](../images/optimizers.png)


# Weight initialization

- Determines where your search starts
- Too small: gradients in first layer will become small
- Too big: activations will be extreme

# Weight initialization

Naive approaches: 
* All zero
* Random uniform distribution

Solution: **Fan-scaled random** initialization to limit output variance
- Xavier init: 
$w\sim\sqrt{\frac{2}{n_{in}+n_{out}}}\cdot N(0,1)$ ([Glorot et al.](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf))
- He init: $w\sim\sqrt{\frac{2}{n_{in}}}\cdot N(0,1)$ ([He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852))

_where $n_{in}$ and $n_{out}$ represent the number of in and outgoing connections_

# Recap Hyperparameters
* **Gradient descent**: dependent on data, usually mini-batch
* **Error function**: dependent on the type of problem; influences final activation
* **Weight updates**: dependent on data, usually Adam is a good choice
* **Weight initialisation**: He or Xavier

# Hyperparameters Questions

* Which is computationally more expensive: SGD, batch gradient descent or mini-batch gradient descent?

 
* Why is cross-entropy preferred over e.g. classification error (/accuracy)? 


* Which optimizer combines both an adaptive learning rate with momentum? 


* Why would you not initialize the weights to 0? 


* In what cases would you use He/Xavier initialization over random initialization? 

# Architecture 
* Fully-connected
* Activation 
* Dropout
* Normalization

# Fully-Connected layer 

All inputs of one layer connected to every activation unit of the next layer. 

Also known as _Linear_ or _Dense_ layer

<center><img src="https://cs231n.github.io/assets/nn1/neural_net2.jpeg" width="900"><center>



# Activation

Introduces non-linearity into the network. 

No trainable parameters.
![](../images/activation_sigmoid_tanh_relu.png)


# Dropout

No trainable parameters.
![](../images/dropout.png)



# Normalization


- Batch normalization ([Loffe et al.](http://arxiv.org/abs/1502.03167))

- Weight normalization ([Salimans et al.](https://arxiv.org/abs/1602.07868))

- Layer normalization ([Ba et al.](https://arxiv.org/abs/1607.06450))

<center><img src="../images/bn_algorithm.png" width="700"><center>

# Conclusion

**Hyperparameters**: 
- gradient descent type 
- loss function
- weight updates
- weight initialisation

**Architecture**: 
- Fully-connected layers
- Activation function
- Dropout & (batch) normalization

![footer_logo](../images/logo.png)

# Neural Networks in Keras
Let's put our knowledge to practice.

## [Keras basics](01-03-keras_basics.ipynb)

![footer_logo](../images/logo.png)