# Training Deep Neural Networks
The NNs developed so far have been shallow, with only a few layers. What if we're tackling a much more complex problem, such as detecting hundreds of types of objexts in high-res images?

Training deep NNs can be problematic, for example:
- You may face the *vanishing/exploding gradients* problem. This is when the gradients grow smaller and small, or larger and larger, when flowing backwards through the DNN during training. This makes it difficult to train lower layers
- You might not have enough training data, or it may be too costly to label
- Training may be extremely slow
- A model with millions of parameters would severely risk overfitting the training set, especially if there's not enough training instances or the dataset is too noisy.

## The vanishing/exploding gradients problem

Recall the backpropagation algorithm used to train Neural nets. At each step, the gradient often gets smaller and smaller as the algorithm progresses to the lower layers. As a result, the Gradient Descent update leaves the lower layer's connection weights virtually unchanged and training never converges to a good solution. 

The opposite can also happen, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients* problem, which surfaces in recurrent NNs. In general, deep networks suffer from unstable gradients, different layers learn at widely different speeds.

In a [2010 paper](https://homl.info/47) the authors found a few suspects to why gradients can be so unstable, including a combination of the popular logistic sigmoid activation function and the weight initialization technique that was popular at the time (normal distribution centered around 0 with deviation of 1). They showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing until the activation function saturates at the top layers. (fig 11-1 on pg 333 exemplifies this)

## Glorot and He initialization

The authors of the paper Xavier Gloror and Yoshua Bengio propose a way to mitigate the unstable gradients problem. They point out that we need the signal to flow in both directions: forwards when making predictions and in the reverse direction when backpropagating gradients. We don't want the signal to die out, not to explode and saturate. They argue that we need the variance of the outputs of each layer to be equal to the variance of the inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

It is not actually possible to guarantee both, unless a layer has an equal number of inputs and neurons (these numbers are called *fan-in* and *fan-out* of the layer), but the authors proposed a good compromise: the connection weights of each layer must be initialized randomly as described by the equation below:

$$\text{Normal distributions with mean 0 and variance }\sigma^2 = \frac{1}{fan_{\text{avg}}}$$
or
$$\text{Uniform distribution between -r and +r with }r = \sqrt{\frac{3}{fan_{\text{avg}}}}$$

where $fan_{\text{avg}} = (fan_{in} + fan_{out})/2$. This strategy is called *Xavier* or *Glorot initialization*. Using Glorot initialization can speed up training considerably.

If we replace $fan_{\text{avg}}$ with $fan_{\text{in}}$ we get *LeCun initialization*, which was proposed in the 90s. 

Some papers have provided different strategies for initialization for various activations functions. They differ only by the scale of the variance and whether they use $fan_{\text{avg}}$ or $fan_{\text{in}}$

| Initialization | Activation Functions           | $\sigma^2$ (Normal)     |
| -------------- | ------------------------------ | ----------------------- |
| Glorot         | None, tanh, logistic, softmax  | 1/$fan_{\text{avg}}$    |
| He             | ReLU and variants              | 2/$fan_{\text{avg}}$    |
| LeCun          | SELU                           | 1/$fan_{\text{avg}}$    |

For the uniform distribution just compute $r=\sqrt{3\sigma^2}$. Note that for ReLU and its variants, the initialization is called *He initialization*

By default, Keras uses Glorot with a uniform distribution. When creating a layer we can pass in the initialization by setting ```kernel_initializer="he_uniform"``` or ```kernel_initializer="he_normal"```, for example

If you want He initialization with uniform distribution but based on $fan_\text{avg}$ rather than $fan_\text{in}$ you can use ```VarianceScaling``` initializer as follows

In [1]:
import keras
from keras.layers import Dense

he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)

Using TensorFlow backend.


<keras.layers.core.Dense at 0x7f3b740e7ee0>

## Nonsaturating Activation Functions

ReLU is a great choice of activation function for NNs because it doesn;t saturate for positive values (unlike the sigmoid function) and it is fast to compute. It suffers however, from the *dying ReLU* problem: during training some neurons *'die'* and stop outputting anything other than 0. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros and Gradient Descent does not affect it anymore because the gradient of the ReLU function is zero when its output is negative.

A variant of ReLU, *leaky ReLU* can help solve this problem.

$$ \text{LeakyReLU}_a(z) = \max(\alpha z, z) $$

The $\alpha$ hyperparameter defines how much 'leaks': it is the slope of the function for z<0 and is typically set to 0.01. This small slope ensures the leaky ReLU never dies; they can go into a ,long coma but they have a chance to eventually wake up.

A [2015 paper](https://homl.info/49) compared several variants of the ReLU function and one of its conclusions was that leaky variants alwayas outperformed the strict ReLU. Setting $\alpha=0.2$ (a huge leak) seemd to result in a better performance than $\alpha=0.01$ (a small leak). The paper also evaluated *randomized leaky ReLU* (RReLU), where $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing. It performemed well and acted as a regularizer. Finally it evaluated *parametric leaky ReLU* (PReLU), where $\alpha$ is authorized to be learned during training (i.e. becimoing a parameter that can be modified by backpropagation). PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

In 2015 the [*exponential linear unit (ELU)*](https://homl.info/50) was introduced and outperformed all other ReLU variants in the author's experiments: training time reduced, and the neural network performed better on the test set. 

$$ \text{ELU}_\alpha(z) = \begin{cases}
                          \alpha(\exp(z) - 1) & \text{if } z<0\\
                          z & \text{if } z\geq0
                          \end{cases} $$
                          
Where $\alpha$ is the hyperparameter that defines the value the ELU function takes when $z$ is a large negative number. The ELU function looks like the ELU (fig 11-3 on pg 336) with a few major differences:
- It takes on negative values when z<0; allowing units to have an average output closer to zero, which alleviates the vanishing gradients problem
- It has non-zero gradient for $z<0$, which avoids the dead neurons problem
- if $\alpha=1$ then the function is smooth everywhere, which helps speed up Gradient Descent, since it does not bounce as much

The main drawback of ELU is the that it is slower to compute than ReLU and its variants. Its faster convergence rate compensates for that slow computation, but still at test time an ELU network will be slower than a ReLU network.

In 2017 the [Scaled ELU (SELU)](https://homl.info/selu) was introduced. The authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU function, then the network will *self-normalize*: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training. As a result SELU significantly outperforms other activation functions. However there are certain conditions for self normalization to happen:
- Input features must be standardized
- Every hidden layer myst be initialized with LeCun normal initialization
- The network's architecture must be sequential*
- The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well

Note: For non-sequential architectures such as recurrent networks or networks with *skip-connections*, self-normalization is not guaranteed, however some researchers noted SELU to perform well in convolutional Neural networks

In general 
$$ \text{SELU} > \text{ELU} >\text{leaky ReELU (and variants)} > \text{ReLU} > \text{tanh} > \text{sigmoid}$$

Architecture might prevent you from using SELU, in which case you switch to ELU. If you care about runtime latency then use leaky ReLU instead. If you don't want to tweak $\alpha$, use the keras defaults. If you have spare time and computing power, use cross validation to evaluate other activation functions such as RReLU and PReLU. that said, because ReLU is the most common function, many libraries and hardware accelerators provide ReLU-specific optimizations.


To use the leaky ReLU function, create leaky ReLU layer and add it to model just after the layer you want to apply to

In [2]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2)
])

For PReLU replace LeakyReLU with ```PReLU()```. There's currently no implementation of RReLU in keras but you can easily implement your own.

For SELU set ```activation='selu'``` and ```kernel_initizalizer='lecun_normal'``` when creating a layer

### Batch normalization

While He normalization along with ELU (and ReLU variants) can help with the exploding gradients problem at the beginning of training, it doesn't guarantee it won't come back during trainig. [*Batch normalization*](https://homl.info/51) was introduced in 2017 to address these problems.

It consists of zero-centering and normalizing each input, then scaling and shifting the results using two neu parameter vectors per layer; one for scaling and the other for shifting. This way the model is allowed to learn the optimal scale and mean of each of the layer's inputs.
In many cases, adding a BN layer as the very first input means you don't need to standardize your training set. 

The algorithm computes the mean and standard deviation of the input over the current mini-batch. The operation is summarized below

1. $$\boldsymbol{\mu}_B = \frac{1}{m_B}\sum_{i=1}^{m_B}\textbf{x}^{(i)} $$
2. $$\boldsymbol{\sigma}^2 = \frac{1}{m_B}\sum_{i=1}^{m_B}(\textbf{x}^{(i)} - \mu_B)^2 $$
3. $$\hat{\textbf{x}}^{(i)} = \frac{\textbf{x}^{(i)} - \boldsymbol{\mu_B}}{\sqrt{\boldsymbol{\sigma}^2+\epsilon}} $$
4. $$\textbf{z}^{(i)} = \boldsymbol{\gamma}\otimes\hat{\textbf{x}}^{(i)}+ \boldsymbol{\beta}$$

Where 
- $\boldsymbol{\mu}_B$ is the vector of input means, evaluated over the whole mini-batch $B$
- $\boldsymbol\sigma_B$ is the vector of input standard deviations over mini-batch $B$
- $m_B$ is the number of instances in the mini batch
- $\hat{\textbf{x}}^{(i)}$ is the vector of zero centered and normalized inputs for instance $i$
- $\boldsymbol\gamma$ is the output scale parameter vector for the layer
- $\otimes$ is element-wise multiplication (each input is multiplied by its correspoding scale parameter)
- $\boldsymbol\beta$ is the output shift parameter vector for the layer. Each input is offser by its corresponding shift parameter
- $\epsilon$ is a tiny number that avoids division by zero, called a *smoothing term*
- $\textbf{z}^{(i)}$ is the output of the Batch Normalization

You might ask *'but what mean and deviation do I use at test time?'*. you might have only one test instance or even if we have a test batch, the samples might not be I.I.D.. 

One solution would be to wait until end of training, then run the whole training set through the NN to compute the mean and deviation of each input of the BN layer. These "final" input means and deviations could then be used instead of the batch means/deviation when making predictions. 

However, most implementations of BN, estimate these final statistics by using a moving average of the layer's input means and standard deviations. Keras does this automatically.

Batch Normalization also acts as a regularizer, reducing the needs for other normalization techniques. It does however, add some complexity to the model. It also makes slower predictions due to the extra computations required at each layer. 

#### Batch Normalization with keras

The folllowing model implements a BN layer after every hidden layer and as the first layer in the model (after flattening the input images)

In [3]:
from keras.models import Sequential
from keras.layers import Flatten, BatchNormalization

model = Sequential([
    Flatten(input_shape=[28, 28]),
    BatchNormalization(),
    Dense(300, activation='elu', kernel_initializer='he_normal'),
    BatchNormalization(),
    Dense(100, activation='elu', kernel_initializer='he_normal'),
    BatchNormalization(),
    Dense(10, activation='softmax')
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 784)               3136      
_________________________________________________________________
dense_3 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_2 (Batch (None, 300)               1200      
_________________________________________________________________
dense_4 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_3 (Batch (None, 100)               400       
_________________________________________________________________
dense_5 (Dense)              (None, 10)               

For such a small model, it is unlikely BN will have a very positive impact, but for deeper networks it can make a tremendous difference. 

For each layer, BN adds 4 parameters per input: $\boldsymbol{\gamma, \beta, \mu, \sigma}$. E.g. the first BN layer adds 3,136 parameters, which is $4\times784$. Since $\boldsymbol{\mu, \sigma}$ are the moving averages, they are not affected by backpropagation, so Keras calls them "non-trainable".

The authors of the BN paper argued in favor of adding the BN layes before the activation functions, rather than after (as we did). It is a good idea to experiment to see which option works best for your data. To do this, we need to remove the activation function from the hidden layers. Moreover, since a  BN layer includes one offset paramete per input, we can remove the bias term from the previous layer passing ```use_bias=False```

In [4]:
from keras.layers import Activation

model = Sequential([
    Flatten(input_shape=[28, 28]),
    BatchNormalization(),
    Dense(300, kernel_initializer='he_normal', use_bias=False),
    BatchNormalization(),
    Activation('elu'),
    Dense(100, activation='elu', kernel_initializer='he_normal', use_bias=False),
    BatchNormalization(),
    Activation('elu'),
    Dense(10, activation='softmax')
])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 784)               3136      
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_5 (Batch (None, 300)               1200      
_________________________________________________________________
activation_1 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30000     
_________________________________________________________________
batch_normalization_6 (Batch (None, 100)              

Usually the default values of the BN layer hyperparameters are fine, but you may ocasionally need to tweak the ```momentum```. Momentum is used by the BN layer when it updates the exponential moving averages, given a new value $\textbf{v}$, the layer updates the running average $\hat{\textbf{v}}$ using 
$$ \hat{\textbf{v}} \leftarrow \hat{\textbf{v}} \times \text{momentum} + \textbf{v}\times (1 - \text{momentum})$$
Typical values are close to 1: 0.9, 0.99, 0.999 (adding more 9s for larger datasets and smaller mini-batches)

Another important hyperparameter is the ```axis```. It determines which axis should be normalized, with default -1, i.e. normalizing the last axis. 

When the input batch is 2D (i.e. batch shape is [batch size, features]) this means each input feature will be normalized based on the mean and standard deviation computed across all the instances in the batch. For ecample the first BN layer in the previous example will independetly normalize, rescale and shift each of the 784 input features.

If we move the first BN layer before the Flatten layer, then the input batches will be 3D with shape [batch size, height, width]: thus the BN layer will comput 28 means and 28 standard deviations and will normalize, rescale and shift all pixels in a given column using the same mean and standard deviation. If instead you want to treat each of the 784 pixels independently you should set ```axis=[1,2]```

BN is so popular that it is often omitted in model diagrams as it is assumed BN is added after every layer. A recent [paper](https://homl.info/fixup) however, used a novel *fixed-update* weight initialization technique to train a very deep neural network (10,000 layers) withouth BN. This is bleeding edge research, so wait for additional results before dropping BN.

#### Gradient Clipping

Another [tecnique](https://homl.info) for mitigating the exploding gradient problem. It is most often used in recurrent neural nets, as Batch Normalization is tricky to use in RNNs. In Keras, adding Gradient Clipping is simple a matter of setting the ```clipvalue``` or ```clipnorm``` argument when creating an optimizer.

In [5]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss='mse', optimizer=optimizer)

This optimizer clips every component of the gradient vector to a value between -1 and 1. The threshold is a hyperparameter you can tune. 

Note it may change the orientation of the gradient vector. For instance, if the original gradient vector is [0.9, 100], it points mostly in the direction of the second axis. Hoerver clipping it gives [0.9, 1.0] which points roughly in the diagonal of the two axes. If you want to ensure clipping doesn't change direction of the gradient vector, you should clip by norm by setting ```clipnorm```. This will clip the whole gradient if its $l_2$ norm is greater than the threshold you picked.
For example, with ```clipnorm=1.0``` the vector [0.9, 100] becomes [0.00899964, 0.9999595] preserving orientation but almost eliminating the first component. You cant track the size of gradients using TensorBoard and you may want to try both clipping by value and norm with different thresholds to see which option performs best on the validation set.

### Reusing Pretrained Layers

Because very deep Neural nets take so long to train and run, it is often a good idea to find one that was built for a problem similar to yours and re-use it. This technique is called *transfer learning* and it speeds up training and requires less data. See diagram on pg. 346

We'll use Fashion MNIST as an example: Suppose the dataset only contained eight classes (e.g. all but sandal and shirt). Someone built a Keras model that achieved good performance (>90% accuracy). Call this model A.

Our task is to train a binary classifier that differentiates between sandals and shirts (positive=shirt, negative=sandal). We only have 200 labeled images. We train a model (call it model B) and we get 97.2% accuracy. 

We then realise the two tasks are quite similar and we can use transfer learning.

In [6]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [7]:
import numpy as np

def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

Train model A

In [8]:
import tensorflow as tf
import keras

tf.random.set_seed(42)
np.random.seed(42)

model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

Train on 43986 samples, validate on 4014 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [9]:
model_A.save('saved_models/11_training_dnns/model_A.h5')

Now we prepare model B

In [10]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))
model_B.save('saved_models/11_training_dnns/model_B.h5')

Train on 200 samples, validate on 986 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [11]:
model_B.evaluate(X_test_B, y_test_B)



[0.10593942695856094, 0.9825000166893005]

To re-use model A, we need to load it and create a new model based on that model's layers

In [12]:
model_A = keras.models.load_model('saved_models/11_training_dnns/model_A.h5')
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation='sigmoid'))

Note that now model_A and model_B_on_A will share layers, so training model_B_on_A will affect model_A. We need to clone model_A and its weights before re-using its layers

In [13]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Since the output layer for model_B_on_A was randomly initialized, training it now would cause it to make large errors and wreck the pre-trained weights. To avoid this, we'll freeze the reused layers for a few epochs giving the output layer time to learn reasonable weights.

In [14]:
# Freeze layers and re-compile
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False
    
model_B_on_A.compile(loss='binary_crossentropy', optimizer='SGD', metrics=['accuracy'])

In [15]:
# Train for a few epochs
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, 
                           validation_data=(X_valid_B, y_valid_B))

Train on 200 samples, validate on 986 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [16]:
# Unfreeze the layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True
    
# Freezing/unfreezing requires the model to be re-compiled
optimizer = keras.optimizers.SGD(lr=1e-4)
model_B_on_A.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, 
                           validation_data=(X_valid_B, y_valid_B))

Train on 200 samples, validate on 986 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [17]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.15868764865398408, 0.972000002861023]

Results are not as good. Why? Transfer learning does not work very well with small dense networks, presumably because they learn very few patterns.

Transfer learning works best with Deep Convolutional Neural Networks which tend to learn feature detectors that are much more general. Transfer learning will be revisited in chapter 14

### Unsupervised pretraining

The books explains this concept on page 349. It talks about not having enough labelled data for a supervised problem and using other DNNs such as Autoencoders or GANs to pre-train a model. 

### Pretraining on Auxiliary task

Another alternative for when we don't have enough labeled training data, is to train a first neural net on an auxiliary task for which we can obtain labeled data. Explanation on pg 350

## Faster Optimizers

We present some popular algorithms for optimizers that are faster than SGD

### Momentum Optimization

Recall that Gradient Descent updates the weights by subtracting the gradient of the cost function with regards to the weights multiplied by the learning rate. It does not care about what previous gradients were.

$$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}) \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}) $$

In [momentum optimization](https://homl.info/54), we introduct the *momentum vector* $\textbf{m}$ and *momentum hyperparameter* $\beta$ which takes into account previous gradients. At each iteration, the local gradient is subtracted from $\textbf{m}$ and updates the weights accordingly.  $\beta$ stops the momentum from growing too large, and is set between 0(high friction) and 1(no friction).

1. $$ \textbf{m} \leftarrow \beta\textbf{m} - \eta \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}) $$
2. $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \textbf{m} $$

Note: Recall that the idea behind GD is to take steps towards the bottom of a hill. For momentum optimization imagine a ball rolling down the hill instead.

In deep Neural nets that don't use Batch Normalization, upper layers have inputs with very different scales so using momentum optimization helps a lot.

In keras we can use the SGD optimizer and set its ```momentum``` hyperparameter

In [18]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

The one drawback of Momentum optmization is that it add yet another hyperparameter to be tuned. In practice, ```momentum=0.9``` usually works well.

### Nesterov Accelerated Gradient

A slight variant of momentum optimization. The [Nesterov Accelerated Gradient (NAG)](https://homl.info/55), measures the gradient of the cost function not a the local position $\boldsymbol{\theta}$ but slightly ahead in the direction of the momentum at $\boldsymbol{\theta} + \beta\textbf{m}$

1. $$ \textbf{m} \leftarrow \beta\textbf{m} - \eta \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta} + \beta\textbf{m}) $$
2. $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \textbf{m} $$


This small tweak works because in general the momentum vector will be pointing in the right direction (towards optimum). By using the momentum a bit farther ahead, our push will be slightly more accurate. (see Figure 11-6 pg 343). 

After many iterations, the slight improvements add up and NAG becomes significantly faster than regular momentum optimization. Moreover, when the momentum pushes weights across a valley, NAG pushes back towards the bottom of the valley, helping reducing oscillation and thus it converges faster.

With keras, we simply add the ```nesterov=True``` hyperparameter





In [19]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad

Gradient descent points towards the steepest slope (in an elongated bowl), which does not point down to the global optimum. 

[AdaGrad](https://homl.info/56) corrects the direction of descent  to point a bit more towards the global optimum by scaling down the gradient vector along the steepest dimensions.

1. $$ \textbf{s} \leftarrow \textbf{s} + \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})\otimes \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})$$
2. $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})\oslash \sqrt{\textbf{s} + \epsilon} $$

Step 1 accumulates the square of the gradients into vector $\textbf{s}$ (recall $\otimes$ is element-wise multiplication). 

Step 2 scales the gradient vector by a factor of $\sqrt{\textbf{s} + \epsilon}$ ($\otimes$ represents element-wise division).

In short the algorithm decays the learning rate, doing it faster for steeper dimensions than for dimensions with gentler slopes. This is called *adaptive learning rate*. An additional benefit is that it requires less tuning of the learning rate $\eta$.

Adagrad works well for simple quadratic problems, but often stops too early when training neural nets. While keras has an Adagrad optimizer, it is not recommended to use it for training deep neural networks (it may be efficient for simpler tasks such as Linear Regression though).

### RMSProp

RMSProp works similarly to AdaGrad but it only accumulates the gradients from the most recent iterations of training. It does so by using exponential decay in the first step.

1. $$ \textbf{s} \leftarrow \beta\textbf{s} + (1-\beta)\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})\otimes \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})$$
2. $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})\oslash \sqrt{\textbf{s} + \epsilon} $$

A typical value for decay ($\beta$) is 0.9 and this default usually works well. Except for very simple problems, this optimizer is faster than AdaGrad. Keras has the ```RMSprop``` optimizer. 

In [20]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam Optimization

[Adam](https://homl.info/59) stands for *Adaptive moment estimation* and combines ideas of momentum optimization and RMSprop: It keeps track of an exponentially decaying average of past gradients; and keeps track of an exponentially decaying average of past squared gradients

1. $$ \textbf{m} \leftarrow \beta_1\textbf{m} - (1-\beta_1)\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}) $$
2. $$ \textbf{s} \leftarrow \beta_2\textbf{s} + (1-\beta_2)\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})\otimes \nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta})$$
3. $$\hat{\textbf{m}} \leftarrow \frac{\textbf{m}}{1 - \beta_2^{t}}$$
4. $$ \hat{\textbf{s}} \leftarrow \frac{s}{1-\beta_2^{t}} $$
5. $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \eta\hat{\textbf{m}}\oslash\sqrt{\hat{\textbf{s}} + \epsilon} $$

In this equation, $t$ represents the iteration number (starting at 1).


Steps 1, 2 and 5 are familiar and look like momentum optimization and RMSprop, with the only difference being that step 1 computes an exponentially decaying average, rather than an exponentially moving sum.

Steps 3 and 4 are a technical detail: since **m** and **s** are intialized at 0, they will be biased toward 0 at the beginning of training, so these steps help boost **m** and **s**. 

The momentum decay hyperparameter $\beta_1$ is typically initialized at 0.9, while the scaling decay hyperparameter $\beta_2$ is often initialized to 0.999.

In [21]:
optimizer = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999)

Since Adam it is an adaptive learning algorithm, it requires less tuning of the learning rate $\eta$. Using the value of $\eta=0.001$ is generally a good choice.

One modification of Adam, is *Adamax*, which is described in pg 357. Adam is typically better than Adamax, so you can try if you experience problems with Adam. 

Nadam is Adam optimization + the Nesterov trick. It often converges slightly faster than Adam. The paper that introduced it, it was found that Nadam generally outperforms Adam, but is sometimes outperformed by RMSprop.

Note: Adaptive optimization methods are great and often converge fast. However a [2017 paper](https://homl.info/60) showed they can lead to solutions that generalize poorly on some datasets. If this is the case for you, try using NAG instead. Also, keep an eye on the latest research as it is moving fast.

See note on pg 358 on Jacobians and Hessians.

Pg 359 for Note on Sparse models and a round-up of optimizers.

## Learning Rate Scheduling

A good learning rate is very important. Set it too high, training may diverge. Set it too low, converge will happen, but it will take a very long time. Learning Schedules are techniques used for training a model with a variable learning rate. We present the most commonly used schedules.

#### Power Scheduling
Set the learning rate to a function of the iteration number $t$: $\eta(t) = \eta_0 / (1 +t/s)^c$. The initial learning rate $\eta_0$, the power $c$ (usually set to 1) and the steps $s$ are hyperparameters.
This method drops the learning rate quickly at first, then more and more slowly. Requires tuning of $\eta_0, s$ and possibly $c$.

In Keras we set the decay hyperparameter, which is the inverse of $s$. Keras assumes $c=1$

In [22]:
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

#### Exponential Scheduling
Set the learnining rate to $\eta(n) = \eta_0 0.1^{t/s}$. This way $\eta$ drops gradually by a factor of 10 every $s$ steps. 

In keras, we define the exponentialy decay function and use the ```LearningRateScheduler``` callback to pass it when fitting

In [23]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr*0.1**(epoch/s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

When you save a model, the optimizer and the learning rate get saved along with it. Meaning we can load the model and continue training from where it left off.

However the epoch argument does not get saved, it gets reset to 0 every time we call ```fit()```. One solution will be to set the ```fit()``` method's ```initial_epoch``` argument so that the epoch starts from where we left off.

#### Piecewise constant scheduling
Use a constant learning rate $\eta_0$ for $e_0$ epochs, then use another learning rate $\eta_1$ for some $e_1$ epochs, with $\eta_0 > \eta_1 > ... > \eta_n$. Although this can work very well, it requires fiddling with the right sequence of learning rates and how long to use them. 

In [24]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

#### Performance Scheduling
Measure the validation error every $N$ steps and reduce the learning rate by a factor of $\lambda$ when error stops dropping.

We can use the ```ReduceLROnPlateau``` callback.

In [25]:
# Multiply lr by 0.5 if loss does not improve for 5  consecutive epochs
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

#### 1cycle scheduling
See book pg 361

A [2013 paper by Andey Senior et al](https://homl.info/63) compared performance of popular learning rates to train DNNs for speech recognition. Authors concluded that performance scheduling and exponential scheduling performed well, favouring exponential scheduling for simplicity. and slightly faster performance.

Still, it seems that 1cycle scheduling is a better approach. See [notebook](https://github.com/ageron/handson-ml2/blob/master/11_training_deep_neural_networks.ipynb) for an implementation which uses the same approach for finding the optimal learning rate for the starting learning rate.

#### tf.keras
```tf.keras``` offeras an alternative way to implement LR scheduling: define the LR using one of the schedules available in ```keras.optimizers.schedules``` then pass this learning rate to any optimizer. This approach updates the learning rate at each step, instead of each epoch. For example the exponential_decay_fn is implemented below

In [None]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch_size=32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

With this approach, the schedule and its stae gets saved as well. (Note, this is specific to tf.keras, i.e. the tensorflow backend of keras)

### Regularization

One of the best techniques for regularization is early stopping. Moreover, even though batch normalization was designed to solve the exploding gradients problem, it also does a really good job of regularizing a model. Below we'll explore other techniques for regularization.

#### $l_1$ and $l_2$ regularization

$l_2$ can be used for constraining a NN connection weights and/or $l_1$ cna be used if you want a sparse model (i.e. many weights equal to 0). 

In [29]:
# example of l2 regularization with a factor of 0.01
layer = keras.layers.Dense(100, activation='elu', 
                           kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

This returns a regularizer that is applied at the end of each step during training, this is then added to the final loss. For $l_1$ use ```keras.regularizers.l1()``` and if you want both use ```keras.regularizers.l1_l2()``` specifying both regularization factors

For implementing this in practice, we can use ```functools.partial()``` to create a thin wrapper around a callable. This makes the code easier to read and less error-prone.

In [31]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation='elu',
                           kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation='softmax', 
                     kernel_initializer='glorot_uniform')    
])

#### Dropout
A very popular and effective regularization technique introduced in [2012](https://homl.info/64) and further detailed in [2014](https://homl.info/65).

At every training step, every neuron (including input neurons, but always excluding output neurons), has a probability $p$ of being dropped out. Meaning it will be completely ignored during this training step, but may be active in the next. The *dropout rate* $p$ is typically set between 10% and 50% (20-30% for recurrent neural nets, 40-50% for convolutional neural nets). After training, no neurons are dropped. That's it!

Neurons trained with dropout cannot co-adapt with their neighbouring neurons; they have to be as useful as possible on their own. They cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes to the inputs and in the end we get a more robust network that generalizes better.

Another way to understand the power of droput is to realizet that a unique neural net is generated at each training ste. There are $2^N$ possible networks (where $N$ is the number of droppable neurons), which is such a large number that is extremely unlikely the same networks will be sampled twice. Once we have run 10,000 training steps we have essentially trained 10,000 neural nets. They are not independent, but they are all different. The resulting NN can be seen as an averaging ensemble of all these smaller networks.

One technical detail is that for testing, we need to multiply each input connection weight by the *keep probability* $(1-p)$. If we don't do this, we'll be getting an input signal much larger than the network was trained on and will be unlikely to perform well. Alternatively, we can divide each neuron's input by the keep probability during training (these alternatives are not perfectly equivalent, but work equally well).

In keras, we can use ```keras.layers.Droput``` which will randomly drop some inputs (setting them to 0) and divide the remaining inputs by the keep probability. This only applies during training.

In [35]:
p = 0.2
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dropout(rate=p),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=p),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=p),
    keras.layers.Dense(10, activation='softmax')
])

**Note** on overfitting, underfitting and using dropout with SeLU on pg. 367

#### Monte Carlo (MC) Dropout
An improvement to dropout was made in a [2016 paper](https://homl.info/mcdroput) with two main points:
- The paper established a connection between dropout networks and approximate Bayesian inference, giving dropout a solid mathematical justification
- *MC Dropout* was introduced, which can boost the performance of a model without retraining, or modify it at all. It provides a much better measure of the uncertainty of the model

Below is its implementation

In [None]:
y_probas = np.stack([model(X_test_scaled, training=True)
                           for sample in range(100)])
y_proba = y_probas.mean(axis=0)

We just make 100 predictions over the test set (setting training=True to ensure Dropout is active) and stack the predictions. Page 369 has a lenghty comparison of Dropout model vs MC dropout model. It also discussed the uncertainty in a modela's probability estimates.

Note that the number of Monte Carlo samples you take (100 above) is a hyperparameter you can tweak. The more, the better. However inference time will also be increased and above a certain number of samples, we will get little improvement. 

If your model contains other layers that behave in a special way during training (e.g. batch normalization), you should not force training, like above. Instead we should replace the Dropout Layer with the following MCDropout class

In [37]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

#### Max-Norm Regularization

For each neuron, Max Norm constrains the weights $\textbf{w}$ of incoming connections such that $||\textbf{w}||_2 \leq r$, where $r$ is the max-norm hyperparameter and $||. ||_2$ is the $l_2$ norm. This does not add a regularization term ot the overall loss function, but instead is tipically implemented by computing $||\textbf{w}||_2$ after each training step and rescaling $\textbf{w}$ if needed ($\textbf{w} \leftarrow \textbf{w}\frac{2}{||\textbf{w}||_2 }$). Reducing $r$ increases regularization and helps reduce overfitting. This can also help alleviate vanishing gradients (if not using BN)

In [42]:
keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal',
                   kernel_constraint=keras.constraints.max_norm(1.))

<keras.layers.core.Dense at 0x7f3ab03e5af0>

# Exercise 8
We'll train various networks on the [CIFAR10](https://keras.io/api/datasets/cifar10/#load_data-function) dataset.

In [3]:
import keras
import tensorflow as tf
import numpy as np
tf.random.set_seed(58)
np.random.seed(58)

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train_full.shape

Using TensorFlow backend.


(50000, 32, 32, 3)

The dataset is compose of 50,000 32x32 images with RGB channels.

In [4]:
np.unique(y_test.flatten())

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We have 10 target classes, labeles 1 through 10. The labels for the classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

In [58]:
train_full_size = X_train_full.shape[0]
val_size = int(train_full_size*0.1)

X_train, y_train = X_train_full[:-val_size], y_train_full[:-val_size]
X_val, y_val = X_train_full[-val_size:], y_train_full[-val_size:]
X_val.shape

(5000, 32, 32, 3)


## a / b)
Build a DNN with 20 hidden layers of 100 neurons each. Use He initialization and ELU activation function. Using Nadam optimization and earlystopping train the network on the dataset.

In [36]:
# setup for tensorboard
import os
root_logdir = os.path.join(os.curdir, 'my_logs')

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    return os.path.join(root_logdir, run_id)

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Run ```tensorboard --logdir=./my_logs --port=6006``` on terminal

In [59]:
from keras.layers import Dense

def build_hidden_layers(inputs, units, n_layers,
                       kernel_initializer='he_normal', activation='elu'):
    h = Dense(units, kernel_initializer=kernel_initializer, activation=activation)(inputs)
    for idx in range(1, n_layers):
        h = Dense(units, kernel_initializer=kernel_initializer, activation=activation)(h)
    return h

In [60]:
from keras.layers import Input, Flatten
from keras.models import Model
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard

units = 100
n_layers = 20
epochs = 50
batch_size = 32
learning_rate = 3e-5

inputs = Input(shape=X_train.shape[1:])
flatten = Flatten()(inputs)
hidden = build_hidden_layers(flatten, units=units, n_layers=n_layers)
outputs = Dense(10, activation='softmax', name='output')(hidden)
model = Model(inputs=inputs, outputs=outputs)

optimizer = Adam(learning_rate=learning_rate)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.summary()

Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_14 (InputLayer)        (None, 32, 32, 3)         0         
_________________________________________________________________
flatten_14 (Flatten)         (None, 3072)              0         
_________________________________________________________________
dense_140 (Dense)            (None, 100)               307300    
_________________________________________________________________
dense_141 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_142 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_143 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_144 (Dense)            (None, 100)               1010

In [61]:
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,
          validation_data=(X_val, y_val),
          callbacks=[TensorBoard(get_run_logdir()), EarlyStopping(patience=5)])

Train on 45000 samples, validate on 5000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50


<keras.callbacks.callbacks.History at 0x7f21983ff7f0>

In [63]:
model.evaluate(X_test, y_test)



[1.5567552700042724, 0.45100000500679016]

This initial attempt without fine-tuning yields 45% accuracy on the test set. Also note that the training accuracy is 51% and validation accuracy 45%, meaning the model overfit.

## c) Add Batch Normalization and compare the results

In [73]:
from keras.layers import Dense, BatchNormalization, Activation

def build_hidden_layers_bn(inputs, units, n_layers,
                           kernel_initializer='he_normal', activation='elu'):
    inputs = BatchNormalization()(inputs)
    h = Dense(units, kernel_initializer=kernel_initializer)(inputs)
    h = BatchNormalization()(h)
    h = Activation(activation)(h)
    for idx in range(1, n_layers):
        h = Dense(units, kernel_initializer=kernel_initializer)(h)
        h = BatchNormalization()(h)
        h = Activation(activation)(h)
    return h

In [74]:
inputs = Input(shape=X_train.shape[1:])
flatten = Flatten()(inputs)
hidden = build_hidden_layers_bn(flatten, units=units, n_layers=n_layers, activation='elu')
outputs = Dense(10, activation='softmax', name='output')(hidden)
model = Model(inputs=inputs, outputs=outputs)

optimizer = Adam(learning_rate=learning_rate)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.summary()

Model: "model_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_18 (InputLayer)        (None, 32, 32, 3)         0         
_________________________________________________________________
flatten_18 (Flatten)         (None, 3072)              0         
_________________________________________________________________
batch_normalization_109 (Bat (None, 3072)              12288     
_________________________________________________________________
dense_182 (Dense)            (None, 100)               307300    
_________________________________________________________________
batch_normalization_110 (Bat (None, 100)               400       
_________________________________________________________________
activation_2 (Activation)    (None, 100)               0         
_________________________________________________________________
dense_183 (Dense)            (None, 100)               101

In [75]:
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,
          validation_data=(X_val, y_val),
          callbacks=[TensorBoard(get_run_logdir()), EarlyStopping(patience=5)])

Train on 45000 samples, validate on 5000 samples
Epoch 1/50
  352/45000 [..............................] - ETA: 9:52 - loss: 2.8518 - accuracy: 0.1023 



Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50


<keras.callbacks.callbacks.History at 0x7f217a28d3d0>

In [76]:
model.evaluate(X_test, y_test)



[1.3627665489196776, 0.515999972820282]

The first model took 37 epochs to converge and each training step took around ~12s. With batch normalization the model took 43 epochs to converge, though each step took around an extra 9s (avg 21s per epoch). Finally the performance of the BN model is better with 51% accuracy



## d) 
Replace Batch Normalization with SELU making the necessary adjustments to ensure the networks self normalizes (i.e. standardize input features, use LeCunn normal initialization, sequential architecture with dense layers)

In [77]:
# from keras.layers.experimental.preprocessing import Normalization


inputs = Input(shape=X_train.shape[1:])
flatten = Flatten()(inputs)
std_layer = Normalization()
std_layer.adapt()()
hidden = build_hidden_layers(standardized, units=units, n_layers=n_layers, 
                             kernel_initializer='lecun_normal', activation='selu')
outputs = Dense(10, activation='softmax', name='output')(hidden)
model = Model(inputs=inputs, outputs=outputs)

optimizer = Adam(learning_rate=learning_rate)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.summary()

ModuleNotFoundError: No module named 'keras.layers.experimental'