The vanishing gradient problem: cumulative back-propagated error signals either shrink rapidly, or grow out of bounds. They decay exponentially in the number of layers, or they explode. The result is that the final trained network converges to a poor local minimum.

Activation (non-linear) functions that do not saturate:

   Rectifier Linear Unit, ReLU: $y=max(0,x)$, $y \in [0,\infty]$, learning rate $\alpha \rightarrow 0$

Leaky ReLU: $y=max(s·x,x)$ , typically $s=0.01$

Exponential Linear Unit, ELU: $y=s(e^{x}-1$), usually $s=1$. if $s=1$, then $y \in [-1,\infty]$ 

# Reading data

Same function as previous examples:

In [3]:
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from sklearn.preprocessing import OneHotEncoder
from time import time

In [4]:
file_name = "notMNIST.pickle"
def make_datasets (file, n_training_samples=0, n_dev_samples=0, 
                   n_testing_samples=0, one_hot=False):
    with open (file,'rb') as f:
        dataset = pickle.load(f)
        f.close

    train_dataset = dataset['train_dataset']
    train_labels = dataset['train_labels']
    dev_dataset = dataset['valid_dataset']
    dev_labels = dataset['valid_labels']
    test_dataset = dataset['test_dataset']
    test_labels = dataset['test_labels']

    #Prepare training, dev (validation) and final testing data. 
    #It has to be reshaped since (n_samples, n_fatures) are expected

    all_training_samples, width, height = train_dataset.shape
    train_attributes = np.reshape(train_dataset, (all_training_samples, 
                                                  width * height))
    if (n_training_samples != 0):
        train_attributes = train_attributes[0:n_training_samples]
        train_labels = train_labels[0:n_training_samples]

    all_dev_samples, width, height = dev_dataset.shape
    dev_attributes = np.reshape(dev_dataset,
                                       (all_dev_samples, width * height))
    if (n_dev_samples != 0):
        dev_attributes = dev_attributes[0:n_dev_samples]
        dev_labels = dev_labels[0:n_dev_samples]

    all_testing_samples, width, height = test_dataset.shape
    test_attributes = np.reshape(test_dataset, (all_testing_samples, width * height))
    if (n_testing_samples != 0):
        test_attributes = test_attributes[0:n_testing_samples]
        test_labels = test_labels[0:n_testing_samples]

    # If one-hot encoding is requested, then funtion OneHotEcoding 
    # from SciKit-Learn is called    
    if one_hot:
        enc = OneHotEncoder(sparse=False)
        # Labels are one-dimensional vectors, 
        # and are reshaped to matrices of one column
        train_labels = enc.fit_transform(train_labels.reshape(len(train_labels),1))
        dev_labels = enc.fit_transform(dev_labels.reshape(len(dev_labels), 1))
        test_labels = enc.fit_transform(test_labels.reshape(len(test_labels), 1))

    return (train_attributes, train_labels, dev_attributes, 
            dev_labels, test_attributes, test_labels)

In [5]:
NUM_TRAINING_SAMPLES = 10000
NUM_DEV_SAMPLES = 1000
NUM_TESTING_SAMPLES = 1000

In [6]:
x_train, y_train, x_dev, y_dev, x_test, y_test = make_datasets(file_name, 
                                 n_training_samples=NUM_TRAINING_SAMPLES,
                                 n_dev_samples=NUM_DEV_SAMPLES, 
                                 n_testing_samples=NUM_TESTING_SAMPLES,
                                 one_hot=True)

# Building the 28x28-300-200-100-10 deep neural network

Hyper-paramenters configuration:

In [7]:
n_epochs = 10000
epochs_to_display = 200
batch_size = 200
learning_rate = 0.01

n_inputs = len(x_train[0])
n_hidden1 = 300
n_hidden2 = 200
n_hidden3 = 100
n_outputs = len(y_train[0])

First, the input __X__ and target __t__ matrices are defined as placeholders:

In [8]:
with tf.name_scope("io"):
    X = tf.placeholder(dtype=tf.float32, shape=(None,n_inputs), name="X")
    t = tf.placeholder(dtype=tf.float32, shape=(None,n_outputs), name="t")

Then, the neural network topology is defined: A full-connected 28x28-300-200-100-10 deep neural network. Note that ReLU is the activation function for the hidden layers, and linear logits with softmax for the output. net_out represents the logits of the output layer.

In [9]:
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3")
    net_out = tf.layers.dense(hidden3, n_outputs, name="net_out")
    y = tf.nn.softmax(logits=net_out, name="y")
    rounded_y = tf.round(y)

# Loss and cost functions with cross entropy and log-loss

In [10]:
with tf.name_scope("loss"):
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=t, logits=net_out)
    mean_log_loss = tf.reduce_mean(cross_entropy, name="mean_loss")

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



# Defining the learning algorithm: gradient descent with back-prop

In [11]:
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(mean_log_loss)

# Evaluating the model

In [12]:
correct_predictions = tf.equal(tf.argmax(y, 1), tf.argmax(t, 1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions,tf.float32))

# Executing the model

In [13]:
init = tf.global_variables_initializer()

In [14]:
start_time = time()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range (int(n_epochs / epochs_to_display)):
        for iteration in range (epochs_to_display):
            offset = (iteration * epoch * batch_size) % (y_train.shape[0] - batch_size)
            sess.run(train_step, feed_dict={X: x_train[offset:(offset+batch_size),:],
                                            t: y_train[offset:(offset+batch_size),:]})
        accuracy_train = accuracy.eval(feed_dict={X: x_train, t: y_train})
        accuracy_dev = accuracy.eval(feed_dict={X: x_dev, t: y_dev})
        print((epoch+1)*epochs_to_display, "Train accuracy: ", accuracy_train, 
              "Development accuracy: ", accuracy_dev)

    accuracy_test = accuracy.eval(feed_dict={X: x_test, t: y_test})
    print ("Test accuracy: ", accuracy_test)
    print ("Target values:\n", y_test[0:10], "\nComputed values:\n", 
           rounded_y.eval(feed_dict={X: x_test[0:10]}))
    print ("First 10 Predictions: ", 
           correct_predictions.eval(feed_dict={X: x_test[0:10], t: y_test[0:10]}))
print ("Elapsed time: ", time()-start_time, "secs.")

200 Train accuracy:  0.6782 Development accuracy:  0.674
400 Train accuracy:  0.7992 Development accuracy:  0.799
600 Train accuracy:  0.8182 Development accuracy:  0.811
800 Train accuracy:  0.8297 Development accuracy:  0.812
1000 Train accuracy:  0.8374 Development accuracy:  0.821
1200 Train accuracy:  0.8439 Development accuracy:  0.83
1400 Train accuracy:  0.8486 Development accuracy:  0.828
1600 Train accuracy:  0.8505 Development accuracy:  0.831
1800 Train accuracy:  0.8574 Development accuracy:  0.83
2000 Train accuracy:  0.8642 Development accuracy:  0.831
2200 Train accuracy:  0.8678 Development accuracy:  0.832
2400 Train accuracy:  0.8733 Development accuracy:  0.835
2600 Train accuracy:  0.8773 Development accuracy:  0.836
2800 Train accuracy:  0.8795 Development accuracy:  0.839
3000 Train accuracy:  0.8782 Development accuracy:  0.834
3200 Train accuracy:  0.8863 Development accuracy:  0.837
3400 Train accuracy:  0.8918 Development accuracy:  0.836
3600 Train accuracy:

__Results with respect to one-hidden layer model:__ 

Train: $85\% \rightarrow 97\%$; Development: $81\% \rightarrow 84\%$; Final test: $87\% \rightarrow 90\%$

This deep model involves a total of __600 neurons__ distributed across three hidden layers, __instead of 1,000__ neurons in the just one-hidden layer model.