Answers
-------
1. Both are weight initialization techniques that attempt to normalize the initial weights of a NN. The purpose 
  of it is to speed training and prevent vanishing and exploding gradients. This problem is due to the large variance
  between inputs and outputs of layers which these weight initialization techniques attempt to fix.

2. No, this would defeat the purpose of the initialization. The network won't be able to learn since there will be 
  no variability between the weight values - no breaking symmetry. This means that all neurons will always output
  the same weights (symmetry) which will be equivalent to training a network with a single neuron per layer.

3. It's ok to initialize bias to 0. Does not make a difference.

4. Depending on the problem and the other techniques used for training:
  - Sigmoid and TanH
    * Since these are sensitive to vanishing gradients, it's important to use them with batch normalization.
    * Also, they are more expensive to compute, therefore, they could slow down larger networks
  - ReLU
    * Fast but has the risk of creating dead neurons (their input is less than zero and they output 0).
    * Leaky ReLU fixes this problem
    * Good for smaller networks
  - Swish
    * Good activation function for larger networks

5. If we set the momentum hyperparameter b closer to 1, the momentum of previous weights will get taken into account more.
  It will put less emphasis on the current weight updates. This can make the weight updates to accumulate speed to the point
  that it overshoots the minimum.

6. 3 ways to produce a sparse model:
  - l1 regularization
  - Dropout
  - ??

7. Dropout does slow down training (usually by a factor of 2). No effects on inference as long as we do the dropout
  scaling during training (we can also do it during inference). In MC, inference is affected as well.

In [79]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

In [52]:
# Load the CIFAR-10 dataset
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Normalize the pixel values to a range between 0 and 1
X_train, X_test = X_train / 255.0, X_test / 255.0

### Exercise 8.a

In [77]:
def build_dnn_model(input_shape, num_hidden_layers, num_neurons_hidden, 
                    output_num, optimizer, loss, use_batch_norm=False, use_dropout=False):
  model = tf.keras.models.Sequential()
  model.add(tf.keras.layers.Flatten(input_shape=input_shape))
  
  for i in range(num_hidden_layers):
    model.add(tf.keras.layers.Dense(num_neurons_hidden, kernel_initializer="he_normal"))
    if use_dropout and num_hidden_layers - i < 3:
      model.add(tf.keras.layers.Dropout(rate=0.25))    
    if use_batch_norm: 
      model.add(tf.keras.layers.BatchNormalization())    
    model.add(tf.keras.layers.Activation("swish"))

  if use_batch_norm: 
    model.add(tf.keras.layers.BatchNormalization())

  model.add(tf.keras.layers.Dense(output_num, activation="softmax"))
  model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

  return model

def fit_dnn_model(model, X_train, y_train, epochs=20, validation_split=0.2, callbacks=[]):
  return model.fit(X_train, y_train, 
            epochs=epochs, 
            validation_split=validation_split, 
            callbacks=callbacks)



### Exercise 8.b

In [69]:
input_shape = list(X_train.shape[1:])
optimizer = tf.keras.optimizers.legacy.Nadam()
model = build_dnn_model(input_shape, 20, 100, 10, optimizer, "sparse_categorical_crossentropy")

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model", save_best_only=True)

history = fit_dnn_model(model, X_train, y_train, epochs=20, validation_split=0.2, callbacks=[early_stopping_cb, model_checkpoint_cb])

Epoch 1/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 2/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 3/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 4/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 5/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 14/20
Epoch 15/20
Epoch 16/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Exercise 8.c

In [73]:
input_shape = list(X_train.shape[1:])
optimizer = tf.keras.optimizers.legacy.Nadam()
model = build_dnn_model(input_shape, 20, 100, 10, optimizer, "sparse_categorical_crossentropy", use_batch_norm=True)

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model", save_best_only=True)

history = fit_dnn_model(model, X_train, y_train, epochs=20, validation_split=0.2, 
                        callbacks=[early_stopping_cb, model_checkpoint_cb])

Epoch 1/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 2/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 3/20
Epoch 4/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 5/20
Epoch 6/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 7/20
Epoch 8/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


In [78]:
input_shape = list(X_train.shape[1:])
optimizer = tf.keras.optimizers.legacy.Nadam()
model = build_dnn_model(input_shape, 20, 100, 10, optimizer, "sparse_categorical_crossentropy", 
                        use_batch_norm=True, use_dropout=True)

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model", save_best_only=True)

history = fit_dnn_model(model, X_train, y_train, epochs=20, validation_split=0.2, 
                        callbacks=[early_stopping_cb, model_checkpoint_cb])

Epoch 1/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 2/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 3/20
Epoch 4/20
Epoch 5/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 6/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 7/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 8/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 9/20
Epoch 10/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 11/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 12/20
Epoch 13/20


INFO:tensorflow:Assets written to: my_cifar10_model/assets


Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


In [86]:
# MC dropout
# Training=True ensures that the dropout layer remains active, ensuring different predictions each time
y_probas = np.stack([model(X_test, training=True) for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_pred = np.argmax(y_proba, axis=1)

In [91]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.5207