# Chapter 11: Training Deep Neural Nets 

## Exercise 1
Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No. If you initiaze all the weights with the same value, even if it is obtained using He initialization, you won't break the simmetry of each layer. The neural network will behave as if it had just one neuron per layer.

## Exercise 2
Is it okay to initialize the bias terms to 0?

Yes, it is ok.

## Exercise 3
Name three advantages of the ELU activation function over ReLU

* It has a non-zero gradient when z < 0, which avoid the dying units problem.
* The function is smooth everywhere, which speeds up gradient descent, since it does not bounce so much left and right of z = 0
* It takes negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem

## Exercise 4
In which cases would you want to use each of the following activation functions: ELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

* ELU: Almost always. Only drawback is that the ELU function is quite slow to compute.
* Leaky ReLU: To avoid the dying units problem that ReLU has.
* ReLU: Need speed. Good default, but ELU and Leaky ReLU can be better.
* Tanh: If you need to output a number between 1 and -1. Rarely used.
* Logistic: To estimate probabilities. Also rarely used.
* Softmax: You need to output probabilities of mutually exclusive classes. Usually used in the output layer for classification tasks.

## Exercise 5
What may happen if you set the momentum hyperparameter too close to 1 (e.g. 0.99999) when using a MomentumOptimizer?

If you set the momentum hyperparameter too close to 1 the system will have almost no friction, so the gradient steps can get too high and the system may not converge to a good solution.

## Exercise 6
Name three ways you can produce a sparse model.

* Setting to 0 all the weights with really small values.
* Using a high $l1$ regularization during training, which will force the optimizer to zero out as many weights as it can.
* Applying other techniques, such as Follow The Regularized Leader.

## Exercise 7
Does dropout slow down training? Does it slow down inference (i.e. making predictions on new instances)?

Dropout will slow training a bit, but inference will be the same (you only have to multiply the output of each neuron by the keep ratio).

## Exercise 8
Deep Learning
* a) Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.

In [None]:
import tensorflow as tf

he_init = tf.contrib.layers.variance_scaling_initializer()

def build_dnn_ex8a(X):
    hidden_1 = tf.layers.dense(X, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden1")
    hidden_2 = tf.layers.dense(hidden_1, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden2")
    hidden_3 = tf.layers.dense(hidden_2, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden3")
    hidden_4 = tf.layers.dense(hidden_3, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden4")
    hidden_5 = tf.layers.dense(hidden_4, 100, activation=tf.nn.elu, kernel_initializer=he_init, name="hidden5")
    return hidden_5

* b) Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.

In [None]:
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data


class MnistData():
    def __init__(self, min_digit=0, max_digit=9):
        mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
        digit_filter = np.vectorize(lambda t: t >= min_digit and t <= max_digit)
        train_idx = digit_filter(np.argmax(mnist.train.labels, axis=1))
        validation_idx = digit_filter(np.argmax(mnist.validation.labels, axis=1))
        test_idx = digit_filter(np.argmax(mnist.test.labels, axis=1))
        self.train_images = mnist.train.images[train_idx]
        self.train_labels = np.argmax(mnist.train.labels[train_idx, min_digit:max_digit+1], axis=1)
        self.validation_images = mnist.validation.images[validation_idx]
        self.validation_labels = np.argmax(mnist.validation.labels[validation_idx, min_digit:max_digit+1], axis=1)
        self.test_images = mnist.test.images[test_idx]
        self.test_labels = np.argmax(mnist.test.labels[test_idx, min_digit:max_digit+1], axis=1)
        
def fetch_batch(batch_size, batch_idx, X, y):
    start = batch_idx * batch_size
    end = (batch_idx + 1) * batch_size
    batch_x = X[start:end]
    batch_y = y[start:end]
    return batch_x, batch_y

tf.reset_default_graph()

# build dnn
mnist = MnistData(min_digit=0, max_digit=4)
num_samples = np.shape(mnist.train_images)[0]
num_classes = np.shape(np.unique(mnist.train_labels))[0]
num_features = np.shape(mnist.train_images)[1]

X = tf.placeholder(tf.float32, shape=[None, num_features], name="x_input")
y = tf.placeholder(tf.int64, shape=[None], name="y_input")
dnn = build_dnn_ex8a(X)
output = tf.layers.dense(dnn, num_classes, activation=None, kernel_initializer=he_init, name="logits")

# training  
with tf.name_scope('loss'):
    loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output))

with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer()
    training_op = optimizer.minimize(loss_op)

with tf.name_scope('accuracy'):
    softmax = tf.nn.softmax(output)
    correct = tf.equal(tf.argmax(softmax, axis=1), y)
    accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32))
    
saver = tf.train.Saver()
initializer = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(initializer)

    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            X_batch, y_batch = fetch_batch(batch_size, batch, mnist.train_images, mnist.train_labels)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

        loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: mnist.validation_images, y: mnist.validation_labels})
        if loss < best_loss:
            best_loss = loss
            checks_no_progress = 0
            saver.save(sess, "./mnist_digits_0-4.ckpt")
        else:
            checks_no_progress += 1
            if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                print("No progress after {} epochs. Stopping...".format(epoch))
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
            epoch, loss, best_loss, acc * 100))

    saver.restore(sess, "./mnist_digits_0-4.ckpt")
    acc_test = accuracy_op.eval(feed_dict={X: mnist.test_images, y: mnist.test_labels})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))

* c) Tune the hyperparameters using cross-validation and see what precision you can achieve.

First of all we will move most of the code from before to a custom class which will hold all the hyperparameters that can be tweaked. After doing that, we will be able to use the RandomizedCV class from scikitlearn in order to obtain easily the best hyperparameters for our DNN.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError
from sklearn.model_selection import train_test_split

class DNNClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, hidden_layers=5, num_neurons=100, optimizer=tf.train.AdamOptimizer,
                 batch_size=50, learning_rate=1e-4, activation=tf.nn.elu, initializer=he_init,
                 batch_norm_momentum=None, dropout_rate=None, tensorboard_logdir=None, random_seed=42):
        self.hidden_layers = hidden_layers
        self.num_neurons = num_neurons
        self.optimizer = optimizer
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.activation = activation
        self.initializer = initializer
        self.batch_norm_momentum = batch_norm_momentum
        self.dropout_rate = dropout_rate
        self.tensorboard_logdir = None
        self.random_seed = random_seed
        self.session = None
    
    def fit(self, X, y, num_epochs=1000):
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=self.random_seed)
        
        num_features = np.shape(X)[1]
        classes = np.unique(y)
        num_classes = np.shape(classes)[0]
        
        self._graph = tf.Graph()
        with self._graph.as_default():
            self._build_graph(num_features, num_classes)
        self.session = tf.Session(graph=self._graph)
        self.session.run(self._init)

        num_samples = np.shape(X_train)[0]
        num_batches = num_samples // self.batch_size
        MAX_CHECKS_NO_PROGRESS = 20
        checks_no_progress = 0
        best_loss = np.inf
        with self.session.as_default() as sess:
            for epoch in range(num_epochs):
                for batch in range(num_batches):
                    # training step
                    X_batch, y_batch = fetch_batch(batch_size, batch, X_train, y_train)
                    sess.run(self._training_op, feed_dict={self._X: X_batch, self._y: y_batch})
                    if self.tensorboard_logdir is not None and batch_idx % 3 == 0:
                        step = epoch * num_batches + batch
                        s = self.session.run(self.summaries, feed_dict={self._X: X_batch,
                                                                        self._y: y_batch})
                        self.writer.add_summary(s, step)

                loss, acc = sess.run([self._loss_op, self._accuracy_op], feed_dict={self._X: X_val, self._y: y_val})
                if loss < best_loss:
                    best_loss = loss
                    checks_no_progress = 0
                else:
                    checks_no_progress += 1
                    if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                        print("No progress after {} epochs. Stopping...".format(epoch))
                        break
                print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
                    epoch, loss, best_loss, acc * 100))
    
    def predict(self, X):
        if self.session is None:
            raise NotFittedError()
        else:
            probabilities = self._predict_proba(X)
            return np.argmax(probabilities, axis=1)

    
    def _predict_proba(self, X):
        with self.session.as_default() as sess:
            return self._Y_proba.eval(feed_dict={self._X: X})
    
    def _dnn(self, inputs):
        for layer in range(self.hidden_layers):
            if self.dropout_rate:
                inputs = tf.layers.dropout(inputs, self.dropout_rate, training=self._training, seed=self.random_seed)
            
            inputs = tf.layers.dense(inputs, self.num_neurons, 
                                     kernel_initializer=self.initializer, name="hidden%d" % (layer + 1))
            if self.batch_norm_momentum:
                inputs = tf.layers.batch_normalization(inputs, momentum=self.batch_norm_momentum)
                
            inputs = self.activation(inputs, name="hidden%d_out" % (layer + 1))
            
        return inputs
    
    def _build_graph(self, num_inputs, num_outputs):
        if self.random_seed is not None:
            tf.set_random_seed(self.random_seed)
            np.random.seed(self.random_seed)

        X = tf.placeholder(tf.float32, shape=[None, num_inputs], name="x_input")
        y = tf.placeholder(tf.int64, shape=[None], name="y_input")
        if self.batch_norm_momentum or self.dropout_rate:
            self._training = tf.placeholder_with_default(False, shape=(), name='training')
        else:
            self._training = None
        
        dnn = self._dnn(X)
        logits = tf.layers.dense(dnn, num_outputs, activation=None, 
                                      kernel_initializer=self.initializer, name="logits")
        with tf.name_scope('loss'):
            loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits), name="loss")

        with tf.name_scope('train'):
            optimizer = self.optimizer(self.learning_rate)
            training_op = optimizer.minimize(loss_op, name="training")

        with tf.name_scope('accuracy'):
            y_proba = tf.nn.softmax(logits, name="y_proba")
            correct = tf.equal(tf.argmax(y_proba, axis=1), y)
            accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
            
        if self.tensorboard_logdir is not None:
            now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
            log_dir = os.path.join(self.tensorboard_logdir, "run-{}".format(now))
            self.writer = tf.summary.FileWriter(log_dir)
            tf.summary.scalar('loss', loss_op)
            tf.summary.scalar('accuracy', accuracy_op)
            self.summaries = tf.summary.merge_all()
        
        saver = tf.train.Saver()
        init = tf.global_variables_initializer()
        
        # Make the important operations available
        self._X, self._y = X, y
        self._Y_proba, self._loss_op = y_proba, loss_op
        self._training_op, self._accuracy_op = training_op, accuracy_op
        self._init, self._saver = init, saver
        
    def save(self, path):
        self._saver.save(self.session, path)

Now we can use the RandomizedSearchCV class to search for the best hyperparameters. The number of iterations and the hyperparameter distributions can be tweaked depending on your computing resources:

In [None]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV

from functools import partial


def leaky_relu(alpha=0.01):
    def parametrized_leaky_relu(z, name=None):
        return tf.maximum(alpha * z, z, name=name)
    return parametrized_leaky_relu

tf.reset_default_graph()

mnist = MnistData(min_digit=0, max_digit=4)
X = np.concatenate((mnist.train_images, mnist.validation_images), axis=0)
y = np.concatenate((mnist.train_labels, mnist.validation_labels), axis=0)

# specify parameters and distributions to sample from
param_dist = {"optimizer": [tf.train.GradientDescentOptimizer, tf.train.AdamOptimizer,
                            tf.train.AdagradOptimizer, partial(tf.train.MomentumOptimizer, momentum=0.9)],
              "hidden_layers": sp_randint(3, 8),
              "num_neurons": sp_randint(50, 250),
              "batch_size": sp_randint(20, 200),
              "activation": [tf.nn.elu, leaky_relu(alpha=0.01), leaky_relu(alpha=0.1)],
              "learning_rate": [1e-4, 3e-4, 1e-3, 3e-3, 1e-2],
              "initializer": [he_init]
             }

# run randomized search
n_iter_search = 15
dnn = DNNClassifier()
random_search = RandomizedSearchCV(dnn, param_distributions=param_dist,
                                   n_iter=n_iter_search, verbose=2)

random_search.fit(X, y)

In [None]:
from sklearn.metrics import accuracy_score

y_pred = random_search.best_estimator_.predict(mnist.test_images)
accuracy = accuracy_score(y_pred, mnist.test_labels)
print('Accuracy of best classifier: {:.3f}'.format(accuracy * 100))

In [None]:
random_search.best_params_

Now we can save the best model to disk, and restore it later when needed:

In [None]:
random_search.best_estimator_.save('./best_model_ex8.ckpt')

* d) Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?

In [None]:
dnn_with_bn = DNNClassifier(hidden_layers=5, activation=leaky_relu(alpha=0.01), initializer=he_init, 
                            num_neurons=191, optimizer=partial(tf.train.MomentumOptimizer, momentum=0.9),
                            learning_rate=0.01, batch_size=42, batch_norm_momentum=0.9)
dnn_with_bn.fit(X, y)

In [None]:
y_pred = dnn_with_bn.predict(mnist.test_images)
accuracy = accuracy_score(y_pred, mnist.test_labels)
print('Accuracy of best classifier with batch normalization: {:.3f}'.format(accuracy * 100))

If we use batch normalization the model will take lower to converge, slowing down training a bit. However, the performance of the model increased.

* e) Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?

The model doesn't seem to be overfitting the training set (we even get a higher accuracy on the test set than on the validation set used for training.

In [None]:
dnn_with_bn_and_dropout = DNNClassifier(hidden_layers=5, activation=leaky_relu(alpha=0.01), initializer=he_init, 
                            num_neurons=191, optimizer=partial(tf.train.MomentumOptimizer, momentum=0.9),
                            learning_rate=0.01, batch_size=42, batch_norm_momentum=0.9, dropout_rate=0.5)
dnn_with_bn_and_dropout.fit(X, y)

In [None]:
y_pred = dnn_with_bn_and_dropout.predict(mnist.test_images)
accuracy = accuracy_score(y_pred, mnist.test_labels)
print('Accuracy of best classifier with batch normalization and dropout: {:.3f}'.format(accuracy * 100))

In this case, using dropout hasn't increased the performance of our model.

## Exercise 9

Transfer learning:
* Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a new one.

In [None]:
tf.reset_default_graph()

saver = tf.train.import_meta_graph('best_model_ex8.ckpt.meta')

graph = tf.get_default_graph()
X = graph.get_tensor_by_name("x_input:0")
y = graph.get_tensor_by_name("y_input:0")
loss_op = graph.get_tensor_by_name("loss/loss:0")
accuracy_op = graph.get_tensor_by_name("accuracy/accuracy:0")
optimizer = tf.train.AdamOptimizer()
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="logits")
training_op = optimizer.minimize(loss_op, var_list=train_vars)
init = tf.global_variables_initializer()

* Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?

In [None]:
def sample_n_instances_per_class(X, y, n=100):
    Xs, ys = [], []
    for label in np.unique(y):
        idx = (y == label)
        Xc = X[idx][:n]
        yc = y[idx][:n]
        Xs.append(Xc)
        ys.append(yc)
    return np.concatenate(Xs), np.concatenate(ys)

In [None]:
mnist_5_9 = MnistData(min_digit=5, max_digit=9)
mnist_5_9.train_images, mnist_5_9.train_labels = sample_n_instances_per_class(mnist_5_9.train_images, 
                                                                              mnist_5_9.train_labels, 100)

with tf.Session() as sess:
    saver.restore(sess, 'best_model_ex8.ckpt')
    sess.run(init)

    num_samples = 500
    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            X_batch, y_batch = fetch_batch(batch_size, batch, mnist_5_9.train_images, mnist_5_9.train_labels)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

        loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: mnist_5_9.validation_images,
                                                                y: mnist_5_9.validation_labels})
        if loss < best_loss:
            best_loss = loss
            checks_no_progress = 0
        else:
            checks_no_progress += 1
            if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                print("No progress after {} epochs. Stopping...".format(epoch))
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
            epoch, loss, best_loss, acc * 100))

    acc_test = sess.run(accuracy_op, feed_dict={X: mnist_5_9.test_images, y: mnist_5_9.test_labels})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))

* Try caching the frozen layers, and train the model again: how much faster is it now?

In [None]:
hidden5 = graph.get_tensor_by_name("hidden5_out:0")

with tf.Session() as sess:
    saver.restore(sess, 'best_model_ex8.ckpt')
    sess.run(init)
    hidden5_cache = sess.run(hidden5, feed_dict={X: mnist_5_9.train_images, y: mnist_5_9.train_labels})
    hidden5_val_cache = sess.run(hidden5, feed_dict={X: mnist_5_9.validation_images, y: mnist_5_9.validation_labels})

    num_samples = 500
    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            hidden5_batch, y_batch = fetch_batch(batch_size, batch,  mnist_5_9.train_images, mnist_5_9.train_labels)
            sess.run(training_op, feed_dict={X: hidden5_batch, y: y_batch})

        loss, acc = sess.run([loss_op, accuracy_op], feed_dict={hidden5: hidden5_val_cache,
                                                                y: mnist_5_9.validation_labels})
        if loss < best_loss:
            best_loss = loss
            checks_no_progress = 0
        else:
            checks_no_progress += 1
            if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                print("No progress after {} epochs. Stopping...".format(epoch))
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
            epoch, loss, best_loss, acc * 100))

    acc_test = sess.run(accuracy_op, feed_dict={X: mnist_5_9.test_images, y: mnist_5_9.test_labels})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))

* Try again reusing just four hidden layers instead of five. Can you achieve a higher precision?

In [None]:
tf.reset_default_graph()

saver = tf.train.import_meta_graph('best_model_ex8.ckpt.meta')
n_outputs = 5

graph = tf.get_default_graph()
X = graph.get_tensor_by_name("x_input:0")
y = graph.get_tensor_by_name("y_input:0")
hidden4_out = graph.get_tensor_by_name("hidden4_out:0")
logits = tf.layers.dense(hidden4_out, n_outputs, kernel_initializer=he_init, name="new_logits")
loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits))
softmax = tf.nn.softmax(logits)
correct = tf.equal(tf.argmax(softmax, axis=1), y)
accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32))

optimizer = tf.train.AdamOptimizer()
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="new_logits")
training_op = optimizer.minimize(loss_op, var_list=train_vars)
init = tf.global_variables_initializer()

In [None]:
mnist_5_9 = MnistData(min_digit=5, max_digit=9)
mnist_5_9.train_images, mnist_5_9.train_labels = sample_n_instances_per_class(mnist_5_9.train_images, 
                                                                              mnist_5_9.train_labels, 100)

with tf.Session() as sess:
    saver.restore(sess, 'best_model_ex8.ckpt')
    sess.run(init)

    num_samples = 500
    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            X_batch, y_batch = fetch_batch(batch_size, batch, mnist_5_9.train_images, mnist_5_9.train_labels)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

        loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: mnist_5_9.validation_images,
                                                                y: mnist_5_9.validation_labels})
        if loss < best_loss:
            best_loss = loss
            checks_no_progress = 0
        else:
            checks_no_progress += 1
            if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                print("No progress after {} epochs. Stopping...".format(epoch))
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
            epoch, loss, best_loss, acc * 100))

    acc_test = sess.run(accuracy_op, feed_dict={X: mnist_5_9.test_images, y: mnist_5_9.test_labels})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))

* Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?

In [None]:
tf.reset_default_graph()

saver = tf.train.import_meta_graph('best_model_ex8.ckpt.meta')
n_outputs = 5

graph = tf.get_default_graph()
X = graph.get_tensor_by_name("x_input:0")
y = graph.get_tensor_by_name("y_input:0")
hidden4_out = graph.get_tensor_by_name("hidden4_out:0")
logits = tf.layers.dense(hidden4_out, n_outputs, kernel_initializer=he_init, name="new_logits")
loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits))
softmax = tf.nn.softmax(logits)
correct = tf.equal(tf.argmax(softmax, axis=1), y)
accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32))

optimizer = tf.train.AdamOptimizer()
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="hidden[34]|new_logits")
training_op = optimizer.minimize(loss_op, var_list=train_vars)
init = tf.global_variables_initializer()

In [None]:
mnist_5_9 = MnistData(min_digit=5, max_digit=9)
mnist_5_9.train_images, mnist_5_9.train_labels = sample_n_instances_per_class(mnist_5_9.train_images, 
                                                                              mnist_5_9.train_labels, 100)

with tf.Session() as sess:
    saver.restore(sess, 'best_model_ex8.ckpt')
    sess.run(init)

    num_samples = 500
    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            X_batch, y_batch = fetch_batch(batch_size, batch, mnist_5_9.train_images, mnist_5_9.train_labels)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

        loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: mnist_5_9.validation_images,
                                                                y: mnist_5_9.validation_labels})
        if loss < best_loss:
            best_loss = loss
            checks_no_progress = 0
        else:
            checks_no_progress += 1
            if checks_no_progress >= MAX_CHECKS_NO_PROGRESS:
                print("No progress after {} epochs. Stopping...".format(epoch))
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.3f}%".format(
            epoch, loss, best_loss, acc * 100))

    acc_test = sess.run(accuracy_op, feed_dict={X: mnist_5_9.test_images, y: mnist_5_9.test_labels})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))

## Exercise 10

Pretraining on an auxiliary task.
* In this exercise you will build a DNN that compares two MNIST digit images and predicts whether they represent the same digit or not. Then you will reuse the lower layers of this network to train an MNIST classifier using very little training data. Start by building two DNNs (let's call them DNN A and B), both similar to the one you built earlier but without the output layer: each DNN should have five hidden layers of 100 neurons each, He initialization, and ELU activation. Next, add a single output layer on top of both DNNs. You should use TensorFlow's concat() function with axis=1 to concatenate the output of both DNNs along the horizontal axis, then feed the result to the output layer. This output layer should contain a single neuron using the logistic activation function.

In [None]:
def build_dnn_ex9(inputs, num_layers=5, num_neurons=100, name="dnn"):
    with tf.name_scope(name):
        for i in range(num_layers):
            inputs = tf.layers.dense(inputs, num_neurons, activation=tf.nn.elu,
                                     kernel_initializer=he_init, name="hidden{}_{}".format(i, name))        
    return inputs

tf.reset_default_graph()
mnist = MnistData()
num_inputs = np.shape(mnist.train_images)[1]
X = tf.placeholder(tf.float32, shape=[None, 2, num_inputs], name="input_x")
X1, X2 = tf.unstack(X, axis=1)
y = tf.placeholder(tf.float32, shape=[None, 1], name="input_y")
dnn_a = build_dnn_ex9(X1, name="dnn_a")
dnn_b = build_dnn_ex9(X2, name="dnn_b")
dnn_outputs = tf.concat([dnn_a, dnn_b], axis=1)
hidden = tf.layers.dense(dnn_outputs, units=10, activation=tf.nn.elu, kernel_initializer=he_init)
output = tf.layers.dense(hidden, units=1, activation=None, kernel_initializer=he_init, name="logits")
y_proba = tf.nn.sigmoid(output)

* Split the MNIST training set in two sets: split #1 should contain 55,000 images, and split #2 should contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images picked from split #1. Half of the training instances should be pairs of images that belong to the same class, while the other half should be images from different classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes.

The MnistData class already creates a training and validation set, with sizes 55,0000 and 5,000 respectively. So that part is already done. We will create the function that generates the training batch:

In [None]:
def generate_training_batch_ex9(X, y, batch_size):
    newX, newY = [], []
    
    while len(newX) != batch_size // 2:
        rnd_idx1, rnd_idx2 = np.random.randint(0, len(X), 2)
        if rnd_idx1 == rnd_idx2:
            continue
            
        if y[rnd_idx1] == y[rnd_idx2]:
            newX.append(np.array([X[rnd_idx1], X[rnd_idx2]]))
            newY.append([0])
            
    while len(newX) != batch_size:
        rnd_idx1, rnd_idx2 = np.random.randint(0, len(X), 2)
        if rnd_idx1 == rnd_idx2:
            continue
            
        if y[rnd_idx1] != y[rnd_idx2]:
            newX.append(np.array([X[rnd_idx1], X[rnd_idx2]]))
            newY.append([1])
    rnd_indices = np.random.permutation(batch_size)
    return np.array(newX)[rnd_indices], np.array(newY)[rnd_indices]

* Train the DNN on this training set. For each image pair, you can simultaneously feed the first image to DNN A and the second image to DNN B. The whole network will gradually learn to tell whether two images belond to the same class or not.

In [None]:
X_test, y_test = generate_training_batch_ex9(mnist.test_images, mnist.test_labels, len(mnist.test_images))

with tf.name_scope('loss'):
    loss_op = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=output))

with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer()
    training_op = optimizer.minimize(loss_op)

with tf.name_scope('accuracy'):
    y_pred = tf.cast(tf.greater_equal(y_proba, 0.5), tf.float32)
    correct = tf.equal(y_pred, y)
    accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32))

saver = tf.train.Saver()
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    num_samples = 500
    num_epochs = 1000
    batch_size = 100
    num_batches = num_samples // batch_size

    MAX_CHECKS_NO_PROGRESS = 20
    checks_no_progress = 0
    best_loss = np.inf
    for epoch in range(num_epochs):
        for batch in range(num_batches):
            # training step
            X_batch, y_batch = generate_training_batch_ex9(mnist.train_images, mnist.train_labels, batch_size)
            _, loss_val = sess.run([training_op, loss_op], feed_dict={X: X_batch, y: y_batch})
        if epoch % 5 == 0:
            loss, acc = sess.run([loss_op, accuracy_op], feed_dict={X: X_test,
                                                                    y: y_test})
            print("{} Test loss: {:.6f} Test accuracy: {:.3f}%".format(
                  epoch, loss, acc * 100))

    acc_test = sess.run(accuracy_op, feed_dict={X: X_test, y: y_test})
    print("Final test accuracy: {:.3f}%".format(acc_test * 100))
    
    save_path = saver.save(sess, "./my_digit_comparison_model.ckpt")

* Now create a new DNN by reusing and freezing the hidden layers of DNN A, and adding a softmax output layer on with 10 neurons. Train this network on split #2 and see if you can achieve high performance despite having only 500 images per class.

In [None]:
tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_outputs = 10

restore_saver = tf.train.import_meta_graph('my_digit_comparison_model.ckpt.meta')

graph = tf.get_default_graph()
X = graph.get_tensor_by_name("input_x:0")
y = tf.placeholder(tf.int64, [None])
last_hidden = graph.get_tensor_by_name("dnn_a/hidden4_dnn_a/Elu:0")
new_output = tf.layers.dense(last_hidden, units=10, activation=None, kernel_initializer=he_init, name="new_logits")

with tf.name_scope('loss'):
    loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=new_output))

with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer()
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="new_logits")
    training_op = optimizer.minimize(loss_op, var_list=train_vars)

with tf.name_scope('accuracy'):
    y_proba = tf.nn.softmax(new_output, name="y_proba")
    correct = tf.equal(tf.argmax(y_proba, axis=1), y)
    accuracy_op = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

init = tf.global_variables_initializer()

saver = tf.train.Saver()

In [None]:
n_epochs = 100
batch_size = 50

mnist = MnistData()
X_test, y_test = mnist.test_images, mnist.test_labels
X_train2, y_train2 = mnist.validation_images, mnist.validation_labels
X_train2 = np.stack((X_train2, np.zeros([5000, 784])), axis=1)
X_test = np.stack((X_test, np.zeros([10000, 784])), axis=1)

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./my_digit_comparison_model.ckpt")

    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_train2))
        for rnd_indices in np.array_split(rnd_idx, len(X_train2) // batch_size):
            X_batch, y_batch = X_train2[rnd_indices], y_train2[rnd_indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 10 == 0:
            acc_test = accuracy_op.eval(feed_dict={X: X_test, y: y_test})
            print(epoch, "Test accuracy:", acc_test)

    save_path = saver.save(sess, "./my_mnist_model_final.ckpt")