# Training Deep Neural Networks

## Practical Guidelines

<img src="../img/default_dnn.png" width="30%">

**Additional consideratios**
- Stadardize input features
- Reuse parts of a pre-trained NN that solves a similar problem
- Use unsupervised pretraining if you have a lot of unlabeled data
- If your model self-normalizes: If it overfits the training set, then you should add alpha dropout (and always use early stopping as well). Do not use other regularization methods, or else they would break self-normalization.
- If your model cannot self-normalize (e.g., it is a recurrent net or it contains skip connections):

-- You can try using ELU (or another activation function) instead of SELU, it may perform better. Make sure to change the initialization method accord‐ ingly (e.g., He init for ELU or ReLU).

-- If it is a deep network, you should use Batch Normalization after every hidden layer. If it overfits the training set, you can also try using max-norm or l2 reg‐ ularization.

- If you need a sparse model, you can use l1 regularization (and optionally zero out the tiny weights after training). If you need an even sparser model, you can try using FTRL instead of Nadam optimization, along with l1 regularization. In any case, this will break self-normalization, so you will need to switch to BN if your model is deep.
- If you need a low-latency model (one that performs lightning-fast predictions), you may need to use less layers, avoid Batch Normalization, and possibly replace the SELU activation function with the leaky ReLU. Having a sparse model will also help. You may also want to reduce the float precision from 32-bits to 16-bit (or even 8-bits)
- If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.



**Vanishing / Exploding Gradients Problem:** 
When the gradients become too big / too small to be propagated through the NN.
Solution: Use a BatchNormalisation after each layer, except the output.



In [None]:
model = keras.models.Sequential([
        keras.layers.Flatten(input_shape=[28, 28]),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(10, activation="softmax")
])

**WAYS TO SPEED UP TRAINING**:

- Good initialisation strategy for connection weights
- Good activation function
- Batch normalisation / Gradient Clipping
- Reuse parts of a pre-trained network
- Faster optimisers

## Function & Activation
So which activation function should you use for the hidden layers of your deep neural networks? 

In general **SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.** 
- If the network’s architecture prevents it from self- normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). 
- If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may just use the default α values used by Keras (e.g., 0.3 for the leaky ReLU). 
- If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is over‐ fitting, or PReLU if you have a huge training set.

- To use the leaky ReLU activation function, you must create a LeakyReLU instance like this:

> leaky_relu = keras.layers.LeakyReLU(alpha=0.2)

> layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")

- For PReLU, just replace LeakyRelu(alpha=0.2) with PReLU().

- For SELU activation, just set activation="selu" and kernel_initial izer="lecun_normal" when creating a layer:
> layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

## Gradient Clipping

To solve the issue of exploding gradients in RNNs, it's better to use Gradient Clipping because the implementation of Batch Normalisation is more tricky.
To do so, just use the clipvalue when creating an optimiser:

> optimiser = keras.optimizer.SGD(clipvalue=1.0)  #clip every component of the gradient vector to a value between –1.0 and 1.0


You can also clip the norm, instead of the value, if you notice the gradients explode (look for more info on how to do that).

## Faster Optimizers

**Momentum optimization:** you just have to set the parameter *momentum* (0.9 is usually a good value)
>> optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9) 

**Nesterov Accelerated Gradient** sometimes improves performance. It's a variation of momentum optimization. To use set *nesterov=True*
>> optmizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

**AdaGrad** this algorithm decays the learning rate, but it does it faster for steep dimensions than for dimensions with deeper slopes (known as *adaptative learning rate*).  It doesn't work well for NNs, but it's sufficient for Linear Regression.

**RMSProp** it's a variation of AdaGrad, but only accumulates gradients of recent iterations, so it performs better. The default valud of rho works well.

>> optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

**Adam** (adaptative moment estimation) it's a variation of Momentum optimisation and RMSProp. The default values work well because it's an adaptive learning algorithm.

>> optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)




## Learning Schedules

Instead of using a fixed learning rate, start with a high value and divide it by 3 until the algorithm stops diverging. 

Ps: You can set the learning schedule in the callback function. 

Other techniques:

| Learning Rate Schedule  | Use Case                                         | Implementation Example                                                                                       |
|--------------------------|-------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| **Constant Learning Rate** | Default choice for many models. Use when unsure about dynamic adjustments. | `optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)`                                                |
| **Time-based Decay**      | Reduce learning rate over time (e.g., large datasets). | `lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(initial_learning_rate=0.001, decay_steps=10000, decay_rate=0.5)`<br>`optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)` |
| **Step Decay**            | Reduce learning rate at specific epochs.       | `lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay([10000, 20000], [0.001, 0.0005, 0.0001])`<br>`optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)` |
| **Exponential Decay**     | Gradually reduce the learning rate exponentially. | `lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.001, decay_rate=0.96, decay_steps=10000)`<br>`optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)` |
| **Cosine Decay**          | Oscillate learning rate for cyclical patterns. | `lr_schedule = tf.keras.optimizers.schedules.CosineDecay(initial_learning_rate=0.001, decay_steps=10000)`<br>`optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)` |
| **Cosine Decay with Warm Restarts** | Reset learning rate periodically, good for cyclical tasks. | `lr_schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(initial_learning_rate=0.001, first_decay_steps=1000)`<br>`optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)` |
| **Learning Rate Finder** | Find an optimal learning rate by experimenting. | Use `tf.keras.callbacks.LearningRateScheduler` with a custom function to adjust the learning rate dynamically. |
| **Reduce on Plateau**    | Automatically reduce the learning rate when a metric (e.g., loss) stops improving. | `callback = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10)`<br>`model.fit(..., callbacks=[callback])` |

---

### Notes
- Replace `tf.keras.optimizers.Adam` with any optimizer you're using.
- Learning rate schedules help improve model performance and convergence.
- Start simple (e.g., constant learning rate) and experiment with others as needed.

## Avoiding Overfitting Through Regularisation

### L1 & L2 
- You should apply reg to each layer
- To avoid issues, you can use the second example from below.


In [None]:
# example of l1 regularization
layer = keras.layers.Dense(100, activation="elu",
                               kernel_initializer="he_normal",
                               kernel_regularizer=keras.regularizers.l2(0.01))

In [None]:
# example of how you can implement regularisation in all layers

from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                                activation="elu",
                                kernel_initializer="he_normal",
                                kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
         keras.layers.Flatten(input_shape=[28, 28]),
         RegularizedDense(300),
         RegularizedDense(100),
         RegularizedDense(10, activation="softmax",
                          kernel_initializer="glorot_uniform")
])

### Dropout

This is a popular technique. At every step, every neuron (excluding the output), have probability p of being dropped out. P is called the dropout rate, typically set at 50%.

- Dropout is only applied during training. *So you can't compare validation and training loss*. Make sure to evaluate training loss after training in this case.

- If you see the model is overfitting, you can increase dropout, and decrease if the other way around.

- You can also implement dropout only after the last hidden layer (this is common). 

- It can slow the training, but the performance usually pays off. 

- If you want to regularize a self-normalizing network based on the SELU activation function, use AlphaDropout.

In [None]:
# example of dropout implementation
model = keras.models.Sequential([
         keras.layers.Flatten(input_shape=[28, 28]),
         keras.layers.Dropout(rate=0.2),
         keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
         keras.layers.Dropout(rate=0.2),
         keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
         keras.layers.Dropout(rate=0.2),
         keras.layers.Dense(10, activation="softmax")
])

### Monte-Carlo Dropout

It can be applied to a model without the need to retrain it.

- The number of samples you use (100, int the example below), is a value you can tweak. If you set too many, it will take long, and the performance might not be too good.

- What the code does is basically generating X predicitons for every instance in the test set, stacking them, and after averaging it. 

- If you're using Normalisation layers, you can't use the code as below (look for more information on how to do this)

In [None]:
# example of Monte Carlo dropout

with keras.backend.learning_phase_scope(1): # force training mode = dropout on
    y_probas = np.stack([model.predict(X_test_scaled) for sample in range(100)])

y_proba = y_probas.mean(axis=0)


### Max-Norm Regularisation

To implement max-norm regularization in Keras, just set every hidden layer’s ker nel_constraint argument to a max_norm() constraint, with the appropriate max value, for example:

In [None]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                       kernel_constraint=keras.constraints.max_norm(1.))

## NOTES

- Glorot initialization and He initialization were designed to make the output standard deviation as close as possible to the input standard deviation, at least at the beginning of training. This reduces the vanishing/exploding gradients problem.

- **All weights should be sampled independently; they should not all have the same initial value.** One important goal of sampling weights randomly is to break symmetry: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it. Concretely, this means that all the neurons in any given layer will always have the same weights. It's like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.

- **It is perfectly fine to initialize the bias terms to zero.** Some people like to initialize them just like weights, and that's OK too; it does not make much difference.

- ReLU is usually a good default for the hidden layers, as it is fast and yields good results. Its ability to output precisely zero can also be useful in some cases (e.g., see Chapter 17). Moreover, it can sometimes benefit from optimized implementations as well as from hardware acceleration. 

- The leaky ReLU variants of ReLU can improve the model's quality without hindering its speed too much compared to ReLU. For large neural nets and more complex problems, GLU, Swish and Mish can give you a slightly higher quality model, but they have a computational cost. 

- The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number in a fixed range (by default between –1 and 1), but nowadays it is not used much in hidden layers, except in recurrent nets. 

- The sigmoid activation function is also useful in the output layer when you need to estimate a probability (e.g., for binary classification), but it is rarely used in hidden layers (there are exceptions—for example, for the coding layer of variational autoencoders;).

- The softplus activation function is useful in the output layer when you need to ensure that the output will always be positive. 

- The softmax activation function is useful in the output layer to estimate probabilities for mutually exclusive classes, but it is rarely (if ever) used in hidden layers.

- If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.

- One way to produce a sparse model (i.e., with most weights equal to zero) is to train the model normally, then zero out tiny weights. For more sparsity, you can apply ℓ1 regularization during training, which pushes the optimizer toward sparsity. A third option is to use the TensorFlow Model Optimization Toolkit.

- Dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training. MC Dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC Dropout you generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor of 10 or more.

## PRATICE 1


In [1]:
# Build a DNN with five hidden layers of 100 neurons each, He initialization, and the Swish activation function.
import tensorflow as tf

tf.random.set_seed(42) # ensure reproducibility

model = tf.keras.models.Sequential()

# input layer
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))

# hidden layers
for i in range(20):
    model.add(tf.keras.layers.Dense(100, # 100 neurons
                                    activation="swish", #activation function
                                    kernel_initializer="he_normal")) # He initialization

2025-01-07 14:56:57.695170: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  super().__init__(**kwargs)


**Exercise:** Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. 

he dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you'll need a softmax output layer with 10 neurons. 

Remember to search for the right learning rate each time you change the model's architecture or hyperparameters.

In [2]:
# add output layer
model.add(tf.keras.layers.Dense(10, activation="softmax")) # 10 neurons for 10 classes


In [3]:
# set Nadam optimizer
optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-5)

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

In [4]:
# loads CIFAR-10 dataset

cifar10 = tf.keras.datasets.cifar10.load_data()

(X_train_full, y_train_full), (X_test, y_test) = cifar10

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
[1m170498071/170498071[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 0us/step


In [5]:
# create custom callbacks to save the model at the end of each epoch
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model.keras", save_best_only=True)

run_index = 1 # increment every time you train the model
# set a root log directory for TensorBoard
import os
run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_{:03d}".format(run_index))
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)

callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

In [6]:
%load_ext tensorboard
%tensorboard --logdir=./my_cifar10_logs --port=6006


In [7]:
model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 9ms/step - accuracy: 0.1349 - loss: 9.6913 - val_accuracy: 0.2338 - val_loss: 2.1182
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 11ms/step - accuracy: 0.2315 - loss: 2.0920 - val_accuracy: 0.2730 - val_loss: 1.9874
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 11ms/step - accuracy: 0.2766 - loss: 1.9717 - val_accuracy: 0.3080 - val_loss: 1.8826
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 13ms/step - accuracy: 0.3097 - loss: 1.8935 - val_accuracy: 0.3168 - val_loss: 1.8736
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 12ms/step - accuracy: 0.3329 - loss: 1.8328 - val_accuracy: 0.3438 - val_loss: 1.8077
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 13ms/step - accuracy: 0.3540 - loss: 1.7777 - val_accuracy: 0.3648 - val_loss: 1.7487

<keras.src.callbacks.history.History at 0x149ae4a00>

In [8]:
# evaluate the model
model.evaluate(X_valid, y_valid)

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.4407 - loss: 1.5426


[1.5400021076202393, 0.44859999418258667]

Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

In [10]:
tf.random.set_seed(42)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100, kernel_initializer="he_normal"))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation("swish"))

model.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-4)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20,
                                                     restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_bn_model.keras",
                                                         save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_normalisation_logs", "run_{:03d}".format(run_index))
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

model.fit(X_train, y_train, epochs=100,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

model.evaluate(X_valid, y_valid)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 18ms/step - accuracy: 0.1886 - loss: 2.2063 - val_accuracy: 0.2716 - val_loss: 2.0190
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 17ms/step - accuracy: 0.3430 - loss: 1.8159 - val_accuracy: 0.3140 - val_loss: 1.8905
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 17ms/step - accuracy: 0.3946 - loss: 1.6836 - val_accuracy: 0.3318 - val_loss: 1.8197
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 17ms/step - accuracy: 0.4322 - loss: 1.5913 - val_accuracy: 0.3348 - val_loss: 1.8526
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 19ms/step - accuracy: 0.4626 - loss: 1.5192 - val_accuracy: 0.4040 - val_loss: 1.6635
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 19ms/step - accuracy: 0.4918 - loss: 1.4456 - val_accuracy: 0.4156 - val_loss: 1.645

[1.6451835632324219, 0.4156000018119812]

- *Is the model converging faster than before?*
 Much faster! The previous model took 29 epochs to reach the lowest validation loss, while the new model achieved that same loss in just 12 epochs and continued to make progress until the 17th epoch. The BN layers stabilized training and allowed us to use a much larger learning rate, so convergence was faster.

- *Does BN produce a better model?*
 Yes! The final model is also much better, with 50.7% validation accuracy instead of 46.7%. It's still not a very good model, but at least it's much better than before (a Convolutional Neural Network would do much better, but that's a different topic, see chapter 14).

- *How does BN affect training speed?*
 Although the model converged much faster, each epoch took about 15s instead of 10s, because of the extra computations required by the BN layers. But overall the training time (wall time) to reach the best model was shortened by about 10%.

**Exercise:** Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

In [11]:
tf.random.set_seed(42)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100,
                                    kernel_initializer="lecun_normal",
                                    activation="selu"))

model.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.Nadam(learning_rate=7e-4)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(
    patience=20, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    "my_cifar10_selu_model.keras", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_relu_logs", "run_{:03d}".format(run_index))

tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train_scaled, y_train, epochs=100,
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)

model.evaluate(X_valid_scaled, y_valid)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 13ms/step - accuracy: 0.2824 - loss: 2.0171 - val_accuracy: 0.3724 - val_loss: 1.7766
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.3915 - loss: 1.7181 - val_accuracy: 0.4300 - val_loss: 1.6458
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.4270 - loss: 1.6222 - val_accuracy: 0.4438 - val_loss: 1.6002
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 12ms/step - accuracy: 0.4537 - loss: 1.5540 - val_accuracy: 0.4502 - val_loss: 1.6028
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.4761 - loss: 1.4978 - val_accuracy: 0.4634 - val_loss: 1.5596
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.4980 - loss: 1.4399 - val_accuracy: 0.4714 - val_loss: 1.558

[1.5177444219589233, 0.4860000014305115]

**Exercise:** Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

Warning: there are now two versions of AlphaDropout. One is deprecated and also broken in some recent versions of TF, and unfortunately that's the version in the tensorflow library. Luckily, there's a perfectly fine version in the keras library (i.e., keras, not tf.keras). It's neither deprecated nor broken, so let's import and use that one:

In [13]:
import keras.layers

tf.random.set_seed(42)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100,
                                    kernel_initializer="lecun_normal",
                                    activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-4)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(
    patience=20, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    "my_cifar10_alpha_dropout_model.keras", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_dropout_logs", "run_{:03d}".format(run_index))
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train_scaled, y_train, epochs=100,
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)

model.evaluate(X_valid_scaled, y_valid)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 10ms/step - accuracy: 0.2818 - loss: 2.0511 - val_accuracy: 0.3912 - val_loss: 1.7355
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 11ms/step - accuracy: 0.4002 - loss: 1.6964 - val_accuracy: 0.4392 - val_loss: 1.6386
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 12ms/step - accuracy: 0.4397 - loss: 1.5966 - val_accuracy: 0.4534 - val_loss: 1.6257
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 13ms/step - accuracy: 0.4662 - loss: 1.5295 - val_accuracy: 0.4714 - val_loss: 1.6026
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.4916 - loss: 1.4606 - val_accuracy: 0.4724 - val_loss: 1.6307
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 12ms/step - accuracy: 0.5145 - loss: 1.4089 - val_accuracy: 0.4794 - val_loss: 1.587

[1.5872766971588135, 0.47940000891685486]