# Chapter 11: Training Deep Neural Nets 

## Exercise 1
Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No. If you initiaze all the weights with the same value, even if it is obtained using He initialization, you won't break the simmetry of each layer. The neural network will behave as if it had just one neuron per layer.

## Exercise 2
Is it okay to initialize the bias terms to 0?

Yes, it is ok.

## Exercise 3
Name three advantages of the ELU activation function over ReLU

* It has a non-zero gradient when z < 0, which avoid the dying units problem.
* The function is smooth everywhere, which speeds up gradient descent, since it does not bounce so much left and right of z = 0
* It takes negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem

## Exercise 4
In which cases would you want to use each of the following activation functions: ELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

* ELU: Almost always. Only drawback is that the ELU function is quite slow to compute.
* Leaky ReLU: To avoid the dying units problem that ReLU has.
* ReLU: Need speed. Good default, but ELU and Leaky ReLU can be better.
* Tanh: If you need to output a number between 1 and -1. Rarely used.
* Logistic: To estimate probabilities. Also rarely used.
* Softmax: You need to output probabilities of mutually exclusive classes. Usually used in the output layer for classification tasks.

## Exercise 5
What may happen if you set the momentum hyperparameter too close to 1 (e.g. 0.99999) when using a MomentumOptimizer?

If you set the momentum hyperparameter too close to 1 the system will have almost no friction, so the gradient steps can get too high and the system may not converge to a good solution.

## Exercise 6
Name three ways you can produce a sparse model.

* Setting to 0 all the weights with really small values.
* Using a high $l1$ regularization during training, which will force the optimizer to zero out as many weights as it can.
* Applying other techniques, such as Follow The Regularized Leader.

## Exercise 7
Does dropout slow down training? Does it slow down inference (i.e. making predictions on new instances)?

Dropout will slow training a bit, but inference will be the same (you only have to multiply the output of each neuron by the keep ratio).

## Exercise 8
Deep Learning
* a) Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.

In [58]:
import tensorflow as tf

he_init = tf.contrib.layers.variance_scaling_initializer()

def build_dnn_ex8a(X):
    hidden_1 = tf.layers.dense(X, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden1")
    hidden_2 = tf.layers.dense(hidden_1, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden2")
    hidden_3 = tf.layers.dense(hidden_2, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden3")
    hidden_4 = tf.layers.dense(hidden_3, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden4")
    hidden_5 = tf.layers.dense(hidden_4, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden5")
    return hidden_5

* b) Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.

In [76]:
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data


class MnistData():
    def __init__(self, min_digit=0, max_digit=9):
        mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
        digit_filter = np.vectorize(lambda t: t >= min_digit and t <= max_digit)
        train_idx = digit_filter(np.argmax(mnist.train.labels, axis=1))
        validation_idx = digit_filter(np.argmax(mnist.validation.labels, axis=1))
        test_idx = digit_filter(np.argmax(mnist.test.labels, axis=1))
        self.train_images = mnist.train.images[train_idx]
        self.train_labels = mnist.train.labels[train_idx, min_digit:max_digit+1]
        self.validation_images = mnist.validation.images[validation_idx]
        self.validation_labels = mnist.validation.labels[validation_idx, min_digit:max_digit+1]
        self.test_images = mnist.test.images[test_idx]
        self.test_labels = mnist.test.labels[test_idx, min_digit:max_digit+1]
        
def fetch_batch(batch_size, batch_idx, X, y):
    start = batch_idx * batch_size
    end = (batch_idx + 1) * batch_size
    batch_x = X[start:end]
    batch_y = y[start:end]
    return batch_x, batch_y

tf.reset_default_graph()
sess = tf.InteractiveSession()

# build dnn
mnist = MnistData(min_digit=0, max_digit=4)
num_samples = np.shape(mnist.train_images)[0]
num_features = np.shape(mnist.train_images)[1]
num_classes = np.shape(mnist.train_labels)[1]

X = tf.placeholder(tf.float32, shape=[None, num_features], name="x_input")
y = tf.placeholder(tf.float32, shape=[None, num_classes], name="y_input")
dnn = build_dnn_ex8a(X)
output = tf.layers.dense(dnn, num_classes, activation=None, kernel_initializer=he_init, name="logits")

# training  
with tf.name_scope('loss'):
    loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=output))

with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer()
    training_op = optimizer.minimize(loss_op)

with tf.name_scope('accuracy'):
    softmax = tf.nn.softmax(output)
    correct = tf.equal(tf.argmax(softmax, axis=1), tf.argmax(y, axis=1))
    accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32))
    
saver = tf.train.Saver()
initializer = tf.global_variables_initializer()

sess.run(initializer)

num_epochs = 1000
batch_size = 100
num_batches = num_samples // batch_size

MAX_CHECKS_NO_PROGRESS = 20
checks_no_progress = 0
best_loss = np.inf
for epoch in range(num_epochs):
    for batch in range(num_batches):
        # training step
        X_batch, y_batch = fetch_batch(batch_size, batch, mnist.train_images, mnist.train_labels)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    
    loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: mnist.validation_images, y: mnist.validation_labels})
    if loss < best_loss:
        best_loss = loss
        checks_no_progress = 0
        saver.save(sess, "./mnist_digits_0-4.ckpt")
    else:
        checks_no_progress += 1
        if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
            print("No progress after {} epochs. Stopping...".format(epoch))
            break
    print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
        epoch, loss, best_loss, acc * 100))
    
saver.restore(sess, "./mnist_digits_0-4.ckpt")
acc_test = accuracy_op.eval(feed_dict={X: mnist.test_images, y: mnist.test_labels})
print("Final test accuracy: {:.3f}%".format(acc_test * 100))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
0	Validation loss: 0.073015	Best loss: 0.073015	Accuracy: 97.498%
1	Validation loss: 0.059028	Best loss: 0.059028	Accuracy: 98.280%
2	Validation loss: 0.044040	Best loss: 0.044040	Accuracy: 98.632%
3	Validation loss: 0.040519	Best loss: 0.040519	Accuracy: 98.944%
4	Validation loss: 0.040727	Best loss: 0.040519	Accuracy: 98.866%
5	Validation loss: 0.035412	Best loss: 0.035412	Accuracy: 98.905%
6	Validation loss: 0.045298	Best loss: 0.035412	Accuracy: 98.866%
7	Validation loss: 0.034796	Best loss: 0.034796	Accuracy: 99.062%
8	Validation loss: 0.036038	Best loss: 0.034796	Accuracy: 98.984%
9	Validation loss: 0.039586	Best loss: 0.034796	Accuracy: 99.023%
10	Validation loss: 0.040684	Best loss: 0.034796	Accuracy: 99.101%
11	Validation loss: 0.030625	Best loss: 0.030625	Accuracy: 99.062%
12	Validatio

* c) Tune the hyperparameters using cross-validation and see what precision you can achieve.

In [None]:
from sklearn.exceptions import NotFittedError

class DNNClassifier():
    def __init__(self, hidden_layers=5, num_neurons=100, num_epochs=500, optimizer=tf.train.AdamOptimizer,
                 batch_size=100, learning_rate=1e-4, activation=tf.nn.elu, initializer=he_init,
                 batch_norm_momentum=None, dropout_rate=None, random_seed=42):
        self.hidden_layers = hidden_layers
        self.num_neurons = num_neurons
        self.optimizer = optimizer
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.activation = activation
        self.initializer = initializer
        self.batch_norm_momentum = batch_norm_momentum
        self.dropout_rate = dropout_rate
        self.random_seed = random_seed
        self.session = None
    
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        if self.session is None:
            raise NotFittedError()
        else:
            pass
    
    def _predict_proba(self, X):
        pass
    
    def _build_graph(self):
        pass

* d) Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?

* e) Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?

## Exercise 9

Transfer learning:
* Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a new one.
* Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?
* Try caching the frozen layers, and train the model again: how much faster is it now?
* Try again reusing just four hidden layers instead of five. Can you achieve a higher precision?
* Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?

## Exercise 10

Pretraining on an auxiliary task.
* In this exercise you will build a DNN that compares two MNIST digit images and predicts whether they represent the same digit or not. Then you will reuse the lower layers of this network to train an MNIST classifier using very little training data. Start by building two DNNs (let's call them DNN A and B), both similar to the one you built earlier but without the output layer: each DNN should have five hidden layers of 100 neurons each, He initialization, and ELU activation. Next, add a single output layer on top of both DNNs. You should use TensorFlow's concat() function with axis=1 to concatenate the output of both DNNs along the horizontal axis, then feed the result to the output layer. This output layer should contain a single neuron using the logistic activation function.
* Split the MNIST training set in two sets: split #1 should copntain 55,000 images, and split #2 should contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images pickled from split #1. Half of the training instances should be pairs of images that belong to the same class, while the other half should be images from different classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes.
* Train the DNN on this training set. For each image pair, you can simultaneously feed the first image to DNN A and the second image to DNN B. The whole network will gradually learn to tell whether two images belond to the same class or not.
* Now create a new DNN by reusing and freezing the hidden layers of DNN A, and adding a softmax output layer on with 10 neurons. Train this network on split #2 and see if you can achieve high performance despite having only 500 images per class.