# Chapter 11. Training Deep Neural Nets





## Tricky part in training DNN
* Vanishing Gradient 
* Slow converge speed
* overfitting

### Vanising Gradient Problems
* gradient is often getting smaller while training is in progress.
* as a result, weight remains unchanged at some point.
* thus training never converge to good solution
* major suspects for problems (article by Xavier & Yoshua published 2010)
  * Sigmoid activation 
  * Random (Normal Dist. with std dev of 1) initialization for weight vector 

#### 1. Xavier and He Initialization
* Xavier Initialization (For Logistic)
  * Random Initialization but std dev is adjusted depends on the number of input and output
  * Speed up training for logistic activation 
  * default initialization for tensorflow dense layer (tf.layer.dense)
* He Initialization (For ReLU)
  * Similar to Xavier's
  * but for ReLU activation

#### 2. Non-Saturating Activation Functions
* Before 2010, many believe Sigmoid is proven optimal (because biological neuron's activation model very similar to sigmoid) 
* however, it turns out not true for Artificial neuron
* Better activation for AN 
  * ReLU
    * dying ReLU -> for negative input, ReLU always outputs 0 which makes it stop to converge. 
  * Leaky ReLU
    * addressing dying ReLU by using small constant gradient for negative input
    * always outperform ReLU 
  * RReLU (Randomized Leaky ReLU) 
    * pick random leak(gradient for negative) during training, use fixed average leak during testing
  * PReLU (Parameterized Leaky ReLU)
    * update leak also by backpropagation 
  * ELU (Exponential Linear Unit)
    * average output close to 0 (which addresses gradient vanishing)
    * non-zero gradient for all input range
    * smooth everywhere (differentiable everywhere)

#### 3. Batch Normalization 
* still there remains a little possibility for gradient vanishing, even though all improvements previously mentioned
* **normalizing inputs just before the activation function of each layers during training and learn 4 perameters for each batch norm layer (scale, offset, mean, std. dev)**
* key benefit
  * strong reduction of gradient vanishing 
  * less sensitive to weight initialization 
  * speed up in learning process 
  * significant performance improvement (especially in image detection) 
  * regularization effect (prevent from overfitting)
* drawback
  * runtime penalty 
* ELU + HE init vs. Batch Norm. 
  * for runtime performance, first option to consider is ELU + HE init combination, instead of using batch norm 
  

    
    


In [1]:
# Simple DNN with Batch Normalization 
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedShuffleSplit
import tensorflow as tf
import numpy as np
from datetime import datetime
from collections import Iterator

tf.reset_default_graph()

digits = load_digits()
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
digits_data = digits['data']
digits_label = digits['target']

for train_indices, test_indices in  sss.split(digits_data, digits_label):
    train_data , test_data = digits_data[train_indices], digits_data[test_indices]
    train_label, test_label = digits_label[train_indices], digits_label[test_indices]

n_features = digits_data.shape[1]
n_cls = 10
n_epochs = 100
n_hidden_1 = 300
n_hidden_2 = 100
batch_size = 10

class RandomBatch():
    def __init__(self, data, size):
        self.data = data
        self.b_sz = size
        self.iter_cnt = 0
        self.d_len = len(data)
        
    def __iter__(self):
        self.iter_cnt = 0
        return self
        
    def next(self):
        if self.iter_cnt < (self.d_len / self.b_sz) :
            self.iter_cnt = self.iter_cnt + 1
            return self.data[np.random.choice(self.d_len, self.b_sz, replace=False)]
        else:
            raise StopIteration
            
            
random_batch = RandomBatch(np.c_[train_data, train_label], 50)

X = tf.placeholder(dtype=tf.float32, shape=(None, n_features), name="X")
Y = tf.placeholder(dtype=tf.int32, shape=(None), name="Y")
training = tf.placeholder_with_default(False, shape=(), name="training")


with tf.name_scope("dnn_w_batchnorm"):
    network = "complex"
    hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden1")
    bn1 = tf.layers.batch_normalization(hidden_1, training=training, momentum=0.9)
    bn1_act = tf.nn.elu(bn1)
    hidden_2 = tf.layers.dense(bn1_act, n_hidden_2, name="hidden2")
    bn2 = tf.layers.batch_normalization(hidden_2, training=training, momentum=0.9)
    b2_act = tf.nn.elu(bn2)
    logits = tf.layers.dense(b2_act, n_cls, name="output")

# with tf.name_scope("dnn_simple"):
#     network = "simple"
#     hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden_1", activation=tf.nn.elu)
#     hidden_2 = tf.layers.dense(hidden_1, n_hidden_2, name="hidden_2", activation=tf.nn.elu)
#     logits = tf.layers.dense(hidden_2, n_cls, name="output")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, Y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

saver = tf.train.Saver()
log_writer = tf.summary.FileWriter('/home/tf_logs/mnist_dnn_{}_{}_b{}_e{}'.format(network,
                                                                                  datetime.utcnow().strftime("%Y%m%d%H%M%S"),
                                                                                  batch_size, 
                                                                                  n_epochs), graph=tf.get_default_graph())

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for sample in iter(random_batch):
            train_x = sample[:,:-1]
            train_y = sample[:,-1:]
            sess.run([train_op, extra_update_ops], feed_dict={X:train_x, Y: train_y.flatten(), training: True})
        summary = tf.summary.Summary()
        
        acc_train = sess.run(accuracy, feed_dict={X: train_data, Y: train_label, training: False})
        acc_test = sess.run(accuracy, feed_dict={X: test_data, Y: test_label, training: False})
        summary.value.add(tag="acc_train", simple_value=acc_train)
        summary.value.add(tag="acc_test", simple_value=acc_test)
        log_writer.add_summary(summary, epoch * batch_size)
    saver.save(sess,'/home/tf_logs/mnist_dnn.ckpt')

log_writer.close()
    



Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).


#### 4. Gradient Clipping
* setting hard limit for gradient, so that it doesn't exceed certain threshold (which possibly causes gradient vanishing)


In [2]:

tf.reset_default_graph()

X = tf.placeholder(dtype=tf.float32, shape=(None, n_features), name="X")
Y = tf.placeholder(dtype=tf.int32, shape=(None), name="Y")
training = tf.placeholder_with_default(False, shape=(), name="training")
threshold = 1.0

with tf.name_scope("dnn_w_batchnorm"):
    network = "complex"
    hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden_1")
    bn1 = tf.layers.batch_normalization(hidden_1, training=training, momentum=0.9)
    bn1_act = tf.nn.elu(bn1)
    hidden_2 = tf.layers.dense(bn1_act, n_hidden_2, name="hidden_2")
    bn2 = tf.layers.batch_normalization(hidden_2, training=training, momentum=0.9)
    b2_act = tf.nn.elu(bn2)
    logits = tf.layers.dense(b2_act, n_cls, name="output")


with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"):
    ## gradient clipping on optimization process
    optimizer = tf.train.AdamOptimizer()
    grads_and_vars = optimizer.compute_gradients(loss)
    capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) for grad, var in grads_and_vars]
    train_op = optimizer.apply_gradients(capped_gvs, name="train_op")
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, Y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

saver = tf.train.Saver()
log_writer = tf.summary.FileWriter('/home/tf_logs/mnist_dnn_{}_{}_b{}_e{}'.format(network,
                                                                                  datetime.utcnow().strftime("%Y%m%d%H%M%S"),
                                                                                  batch_size, 
                                                                                  n_epochs), graph=tf.get_default_graph())

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for sample in iter(random_batch):
            train_x = sample[:,:-1]
            train_y = sample[:,-1:]
            sess.run([train_op, extra_update_ops], feed_dict={X:train_x, Y: train_y.flatten(), training: True})
        summary = tf.summary.Summary()
        
        acc_train = sess.run(accuracy, feed_dict={X: train_data, Y: train_label, training: False})
        acc_test = sess.run(accuracy, feed_dict={X: test_data, Y: test_label, training: False})
        summary.value.add(tag="acc_train", simple_value=acc_train)
        summary.value.add(tag="acc_test", simple_value=acc_test)
        log_writer.add_summary(summary, epoch * batch_size)
    saver.save(sess,'/home/tf_logs/mnist_dnn.ckpt')

log_writer.close()
    

#### 5. Reusing Pretrained Layers
> Reusing part of network from pretrained model is possible 

In [3]:
tf.reset_default_graph()

X = tf.placeholder(dtype=tf.float32, shape=(None, n_features), name="X")
Y = tf.placeholder(dtype=tf.int32, shape=(None), name="Y")
training = tf.placeholder_with_default(False, shape=(), name="training")

with tf.name_scope("dnn_w_batchnorm"):
    network = "complex"
    hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden_1")
    bn1 = tf.layers.batch_normalization(hidden_1, training=training, momentum=0.9)
    bn1_act = tf.nn.elu(bn1)
    hidden_2 = tf.layers.dense(bn1_act, n_hidden_2, name="hidden_2")
    bn2 = tf.layers.batch_normalization(hidden_2, training=training, momentum=0.9)
    b2_act = tf.nn.elu(bn2)
    ## new hidden layer  is added
    hidden_3 = tf.layers.dense(b2_act, 100, name="hidden_3", activation=tf.nn.elu)
    logits = tf.layers.dense(hidden_3, n_cls, name="output")



extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden_[12]")
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore = tf.train.Saver(reuse_vars_dict)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
       
with tf.name_scope("train"):
    ## gradient clipping on optimization process
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, Y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()
log_writer = tf.summary.FileWriter('/home/tf_logs/new_mnist_dnn_{}_{}_b{}_e{}'.format(network,
                                                                                  datetime.utcnow().strftime("%Y%m%d%H%M%S"),
                                                                                  batch_size, 
                                                                                  n_epochs), graph=tf.get_default_graph())



with tf.Session() as sess:
    sess.run(init)
    restore.restore(sess, '/home/tf_logs/mnist_dnn.ckpt')
    
    for epoch in range(n_epochs):
        for sample in iter(random_batch):
            train_x = sample[:,:-1]
            train_y = sample[:,-1:]
            sess.run([train_op, extra_update_ops], feed_dict={X:train_x, Y: train_y.flatten(), training: True})
        summary = tf.summary.Summary()
        
        acc_train = sess.run(accuracy, feed_dict={X: train_data, Y: train_label, training: False})
        acc_test = sess.run(accuracy, feed_dict={X: test_data, Y: test_label, training: False})
        summary.value.add(tag="acc_train", simple_value=acc_train)
        summary.value.add(tag="acc_test", simple_value=acc_test)
        log_writer.add_summary(summary, epoch * batch_size)
    saver.save(sess,'/home/tf_logs/new_mnist_dnn.ckpt')
log_writer.close()



INFO:tensorflow:Restoring parameters from /home/tf_logs/mnist_dnn.ckpt


##### Freezing Lower Layers 
> lower layers could be reused without any change which, in turn, is more efficient and performs better

In [4]:
# building base model


tf.reset_default_graph()

X = tf.placeholder(dtype=tf.float32, shape=(None, n_features), name="X")
Y = tf.placeholder(dtype=tf.int32, shape=(None), name="Y")

with tf.name_scope("dnn"):
    network = "simple"
    hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden_1", activation=tf.nn.elu)
    hidden_2 = tf.layers.dense(hidden_1, n_hidden_2, name="hidden_2", activation=tf.nn.elu)
    ## new hidden layer  is added
    logits = tf.layers.dense(hidden_2, n_cls, name="output")


with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"):
    ## gradient clipping on optimization process
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, Y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()
log_writer = tf.summary.FileWriter('/home/tf_logs/base_mnist_dnn_{}_{}_b{}_e{}'.format(network,
                                                                                  datetime.utcnow().strftime("%Y%m%d%H%M%S"),
                                                                                  batch_size, 
                                                                                  n_epochs), graph=tf.get_default_graph())



with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for sample in iter(random_batch):
            train_x = sample[:,:-1]
            train_y = sample[:,-1:]
            sess.run([train_op], feed_dict={X: train_x, Y: train_y.flatten()})
        summary = tf.summary.Summary()
        
        acc_train = sess.run(accuracy, feed_dict={X: train_data, Y: train_label})
        acc_test = sess.run(accuracy, feed_dict={X: test_data, Y: test_label})
        summary.value.add(tag="acc_train", simple_value=acc_train)
        summary.value.add(tag="acc_test", simple_value=acc_test)
        log_writer.add_summary(summary, epoch * batch_size)
    saver.save(sess,'/home/tf_logs/base_mnist_dnn.ckpt')
log_writer.close()



In [7]:

tf.reset_default_graph()

X = tf.placeholder(dtype=tf.float32, shape=(None, n_features), name="X")
Y = tf.placeholder(dtype=tf.int32, shape=(None), name="Y")
he_init = tf.contrib.layers.variance_scaling_initializer()

with tf.name_scope("dnn"):
    network = "simple"
    hidden_1 = tf.layers.dense(X, n_hidden_1, name="hidden_1", activation=tf.nn.elu)
    hidden_2 = tf.layers.dense(hidden_1, n_hidden_2, name="hidden_2", activation=tf.nn.elu)
    hidden_3 = tf.layers.dense(hidden_2, n_hidden_2, name="hidden_3", activation=tf.nn.elu, kernel_initializer=he_init)
    ## new hidden layer  is added
    logits = tf.layers.dense(hidden_3, n_cls, name="output")
    
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="hidden_3|output")
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden_[12]")

restore_saver = tf.train.Saver(reuse_vars)
                               
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"):
    ## gradient clipping on optimization process
    optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9, use_nesterov=True)
    train_op = optimizer.minimize(loss, var_list=train_vars)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, Y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()
log_writer = tf.summary.FileWriter('/home/tf_logs/frz_mnist_dnn_{}_{}_b{}_e{}'.format(network,
                                                                                  datetime.utcnow().strftime("%Y%m%d%H%M%S"),
                                                                                  batch_size, 
                                                                                  n_epochs), graph=tf.get_default_graph())



with tf.Session() as sess:
    sess.run(init)
    restore_saver.restore(sess, '/home/tf_logs/base_mnist_dnn.ckpt')
    h2_cached = sess.run(hidden_2, feed_dict={X: train_data})
    cached_random_batch = RandomBatch(np.c_[h2_cached, train_label], batch_size)
    for epoch in range(n_epochs):
        for sample in iter(cached_random_batch):
            train_x = sample[:,:-1]
            train_y = sample[:,-1:]
            sess.run([train_op], feed_dict={hidden_2: train_x, Y: train_y.flatten()})
        summary = tf.summary.Summary()
        
        acc_train = sess.run(accuracy, feed_dict={X: train_data, Y: train_label})
        acc_test = sess.run(accuracy, feed_dict={X: test_data, Y: test_label})
        summary.value.add(tag="acc_train", simple_value=acc_train)
        summary.value.add(tag="acc_test", simple_value=acc_test)
        log_writer.add_summary(summary, epoch * batch_size)
    saver.save(sess,'/home/tf_logs/frz_mnist_dnn.ckpt')
log_writer.close()


INFO:tensorflow:Restoring parameters from /home/tf_logs/base_mnist_dnn.ckpt
