## Udacity course - Deep Learning
### Assignment 3 
Problems and sample code are taken from Udacity's free course on Deep Learning.

First the necessary packages are imported. Then the notMNIST dataset is loaded (containing characters A to J).

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
import os

In [2]:
data_root = "C:\\Programming\\Udacity_Deep_Learning\\Exercise_01" # Change me to store data elsewhere

pickle_file = os.path.join(data_root, 'notMNIST.pickle')

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Changes to data format: 
    1) Images from 2D to 1D
    2) One hot encoding of labels

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

## Problem 1

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.


So, first the logistic model:

In [17]:
# logistic model (only minor modifications to original code to implement regularization)

batch_size = 128

reg_scale = 1e-3 #strength of regularization --> important hyperparameter --> tuning

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))                                   
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Weights.
    weights = tf.Variable(
        tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases
    base_loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    reg_loss = tf.reduce_sum(tf.nn.l2_loss(weights)) #nothing more then ~ sum(weights ** 2) / 2
    loss = tf.add(base_loss, reg_scale * reg_loss)

    # Optimizer.
    optimizer = tf.train.MomentumOptimizer(learning_rate=0.1, momentum=0.9).minimize(loss) #Momentum Optimizer instead of GradientDescent

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [18]:
num_steps = 5001 

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size) 
        
        # Generate a minibatch:
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
          [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 21.773857
Minibatch accuracy: 7.8%
Validation accuracy: 7.7%
Minibatch loss at step 1000: 0.965446
Minibatch accuracy: 79.7%
Validation accuracy: 80.7%
Minibatch loss at step 2000: 0.621625
Minibatch accuracy: 85.9%
Validation accuracy: 80.9%
Minibatch loss at step 3000: 0.756445
Minibatch accuracy: 80.5%
Validation accuracy: 80.9%
Minibatch loss at step 4000: 0.831370
Minibatch accuracy: 72.7%
Validation accuracy: 80.8%
Minibatch loss at step 5000: 0.702688
Minibatch accuracy: 82.0%
Validation accuracy: 81.3%
Test accuracy: 88.4%


#### next step:
Now add l2-regularization to a 1-layer neural network.

In [22]:
# simple neural networks:

reg_scale = 1e-3 #strength of regularization --> important hyperparameter --> tuning

batch_size = 128
num_hidden1 = 1024 #no of elements in hidden layer 1

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Weights
    weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_hidden1]))
    biases1 = tf.Variable(tf.zeros([num_hidden1]))
    weights2 = tf.Variable(tf.truncated_normal([num_hidden1, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    # Hidden layer 1
    hidden1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1) 
    
    # Output
    logits = tf.matmul(hidden1, weights2) + biases2 
    
    # Loss
    base_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    reg_loss = tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) #l2 regularization
    loss = tf.add(base_loss, reg_scale * reg_loss)

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    hidden1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1) #compute hidden layer
    valid_prediction = tf.nn.softmax(tf.matmul(hidden1_valid, weights2) + biases2)
    hidden1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1) #compute hidden layer
    test_prediction = tf.nn.softmax(tf.matmul(hidden1_test, weights2) + biases2)

In [25]:
num_steps = 10001 
with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch:
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            print("Mean weights1: ", np.mean(tf.abs(weights1).eval()), 
                  " | Mean weights2: ", np.mean(tf.abs(weights2).eval()))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 810.279602
Minibatch accuracy: 10.2%
Validation accuracy: 30.8%
Mean weights1:  0.722253  | Mean weights2:  0.83402
Minibatch loss at step 1000: 113.542099
Minibatch accuracy: 85.2%
Validation accuracy: 82.0%
Mean weights1:  0.431994  | Mean weights2:  0.243169
Minibatch loss at step 2000: 41.104588
Minibatch accuracy: 88.3%
Validation accuracy: 84.7%
Mean weights1:  0.260832  | Mean weights2:  0.132334
Minibatch loss at step 3000: 15.566499
Minibatch accuracy: 85.2%
Validation accuracy: 86.6%
Mean weights1:  0.158188  | Mean weights2:  0.0903077
Minibatch loss at step 4000: 6.059772
Minibatch accuracy: 87.5%
Validation accuracy: 87.2%
Mean weights1:  0.0964016  | Mean weights2:  0.0755019
Minibatch loss at step 5000: 2.448214
Minibatch accuracy: 89.8%
Validation accuracy: 88.0%
Mean weights1:  0.0592797  | Mean weights2:  0.0661187
Minibatch loss at step 6000: 1.164128
Minibatch accuracy: 90.6%
Validation accuracy: 88.0%
Mean weights1:  0.0371566 

Here we immediately get > 94% with a -still super-simple- neural network.... much better than logistic regression.

## Problem 2

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

In [30]:
num_steps = 3001 

num_batches = 4 # number of mini-batches to train on

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = ((step % num_batches) * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch:
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            print("Mean weights1: ", np.mean(tf.abs(weights1).eval()), 
                  " | Mean weights2: ", np.mean(tf.abs(weights2).eval()))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 701.815125
Minibatch accuracy: 7.0%
Validation accuracy: 36.4%
Mean weights1:  0.721942  | Mean weights2:  0.766291
Minibatch loss at step 1000: 115.419258
Minibatch accuracy: 100.0%
Validation accuracy: 75.5%
Mean weights1:  0.437205  | Mean weights2:  0.453178
Minibatch loss at step 2000: 42.449886
Minibatch accuracy: 100.0%
Validation accuracy: 75.5%
Mean weights1:  0.265145  | Mean weights2:  0.274849
Minibatch loss at step 3000: 15.614120
Minibatch accuracy: 100.0%
Validation accuracy: 75.8%
Mean weights1:  0.160802  | Mean weights2:  0.166925
Test accuracy: 83.7%


Not surprisingly, the model performs much worth. 
It is now overfitting on the little data of the few mini-batches provided for training.

## Problem 3

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

In [33]:
# simple neural networks + regularization + dropout

reg_scale = 1e-3 #strength of regularization --> important hyperparameter --> tuning
dropout_rate = 0.5 

batch_size = 128
num_hidden1 = 1024 #no of elements in hidden layer 1

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    #only use DROPOUT during training (default is training=False)
    training = tf.placeholder_with_default(False, shape=(), name='training') 
    tf_train_dataset_drop = tf.layers.dropout(tf_train_dataset, dropout_rate, training=training)

    # Weights.
    weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_hidden1]))
    biases1 = tf.Variable(tf.zeros([num_hidden1]))
    weights2 = tf.Variable(tf.truncated_normal([num_hidden1, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    # Hidden layer 1
    hidden1 = tf.nn.relu(tf.matmul(tf_train_dataset_drop, weights1) + biases1) 
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training) #only include in training!
    
    # Output
    logits = tf.matmul(hidden1_drop, weights2) + biases2 
    
    # Loss    
    base_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    reg_loss = tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) 
    loss = tf.add(base_loss, reg_scale * reg_loss)

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    hidden1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1) #compute hidden layer
    valid_prediction = tf.nn.softmax(tf.matmul(hidden1_valid, weights2) + biases2)
    hidden1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1) #compute hidden layer
    test_prediction = tf.nn.softmax(tf.matmul(hidden1_test, weights2) + biases2)

In [34]:
num_steps = 3001 

num_batches = 4 # number of mini-batches to train on

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = ((step % num_batches) * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch:
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {training : True, tf_train_dataset : batch_data, tf_train_labels : batch_labels} #set training -> True
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            print("Mean weights1: ", np.mean(tf.abs(weights1).eval()), 
                  " | Mean weights2: ", np.mean(tf.abs(weights2).eval()))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 1011.486694
Minibatch accuracy: 11.7%
Validation accuracy: 19.6%
Mean weights1:  0.722651  | Mean weights2:  0.765303
Minibatch loss at step 1000: 133.861786
Minibatch accuracy: 97.7%
Validation accuracy: 77.2%
Mean weights1:  0.445452  | Mean weights2:  0.667341
Minibatch loss at step 2000: 46.071102
Minibatch accuracy: 98.4%
Validation accuracy: 78.6%
Mean weights1:  0.271446  | Mean weights2:  0.423325
Minibatch loss at step 3000: 17.002235
Minibatch accuracy: 100.0%
Validation accuracy: 77.9%
Mean weights1:  0.164989  | Mean weights2:  0.261948
Test accuracy: 85.3%


Again, when restricted to few mini-batches, we see drastic overfitting!

## Problem 4

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is 97.1%.

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:






First, let's start with a 3-layer neural network:

In [44]:
# deeper NN + regularization + learning rate decay

# hyperparameters
scale_reg = 1e-4 #strength of regularization

initial_learning_rate = 0.1
decay_steps = 5000
decay_rate = 1/4

batch_size = 128

# network layers
num_hidden1 = 800 #no of elements in hidden layer 1
num_hidden2 = 400
num_hidden3 = 200


# weight initialization
def weight_variable(shape, name = None):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial, name = name)


graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))                               
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    global_step = tf.Variable(0, trainable=False, name="global_step") 

    #Layer 1
    weights1 = weight_variable([image_size * image_size, num_hidden1])
    biases1 = tf.Variable(tf.zeros([num_hidden1]))
    
    #Layer 2
    weights2 = weight_variable([num_hidden1, num_hidden2])
    biases2 = tf.Variable(tf.zeros([num_hidden2]))
    
    #Layer 3
    weights3 = weight_variable([num_hidden2, num_hidden3])
    biases3 = tf.Variable(tf.zeros([num_hidden3]))
    
    #Output layer
    weights4 = weight_variable([num_hidden3, num_labels])
    biases4 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    lay1_train = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    lay2_train = tf.nn.relu(tf.matmul(lay1_train, weights2) + biases2)
    lay3_train = tf.nn.relu(tf.matmul(lay2_train, weights3) + biases3)
    
    # Output
    logits = tf.matmul(lay3_train, weights4) + biases4
    
    # Loss
    base_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    reg_loss = tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
    loss = tf.add(base_loss, scale_reg * reg_loss)

    # Optimizer.
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate, staircase=True)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9).minimize(loss, global_step=global_step)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    lay1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    lay2_valid = tf.nn.relu(tf.matmul(lay1_valid, weights2) + biases2)
    lay3_valid = tf.nn.relu(tf.matmul(lay2_valid, weights3) + biases3)
    valid_prediction = tf.nn.softmax(tf.matmul(lay3_valid, weights4) + biases4)
    lay1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    lay2_test = tf.nn.relu(tf.matmul(lay1_test, weights2) + biases2)
    lay3_test = tf.nn.relu(tf.matmul(lay2_test, weights3) + biases3)
    test_prediction = tf.nn.softmax(tf.matmul(lay3_test, weights4) + biases4)

In [45]:
num_steps = 10001 
with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch:
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 1000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            print("Mean weights1: ", np.mean(tf.abs(weights1).eval()), 
                  " | Mean weights2: ", np.mean(tf.abs(weights2).eval()))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 6.234508
Minibatch accuracy: 12.5%
Validation accuracy: 17.3%
Mean weights1:  0.0722632  | Mean weights2:  0.072213
Minibatch loss at step 1000: 0.788205
Minibatch accuracy: 87.5%
Validation accuracy: 85.5%
Mean weights1:  0.0670178  | Mean weights2:  0.0686215
Minibatch loss at step 2000: 0.617137
Minibatch accuracy: 90.6%
Validation accuracy: 86.5%
Mean weights1:  0.0620059  | Mean weights2:  0.0636479
Minibatch loss at step 3000: 0.692472
Minibatch accuracy: 86.7%
Validation accuracy: 87.6%
Mean weights1:  0.0576146  | Mean weights2:  0.0593083
Minibatch loss at step 4000: 0.639147
Minibatch accuracy: 88.3%
Validation accuracy: 88.3%
Mean weights1:  0.0539482  | Mean weights2:  0.0556504
Minibatch loss at step 5000: 0.438611
Minibatch accuracy: 94.5%
Validation accuracy: 88.9%
Mean weights1:  0.0508358  | Mean weights2:  0.052532
Minibatch loss at step 6000: 0.405787
Minibatch accuracy: 93.0%
Validation accuracy: 90.3%
Mean weights1:  0.0497408 

That's becoming quite good ...


With increasing numbers of layers as well as additional optimization (dropout, batch normalization etc.), the way of constructing the network layers should better change. It simply becomes too messy if we go on as in the examples so far.

So now, instead of using the `tf.nn.` elements I'll switch to `tf.layers.`since those already include a lot of functions. And this way the weights will also be initialized automatically.

Next, dropout is added (`nn.layers.dropout`) and the number of neurons is slightly higher.

In [73]:
# deeper NN + dropout + regularization + learning rate decay

# hyperparameters
scale_reg = 0#1e-4 #strength of regularization
dropout_rate = 0.3 #0.5

initial_learning_rate = 0.1
decay_steps = 5000
decay_rate = 1/4

batch_size = 128

# network layers
num_hidden1 = 1200 #no of elements in hidden layer 1
num_hidden2 = 600
num_hidden3 = 300

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(None, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.int64, shape=(None))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    global_step = tf.Variable(0)
    
    training = tf.placeholder_with_default(False, shape=(), name='training') #only use DROPOUT during training!

    # First dropout layer:
    tf_train_dataset_drop = tf.layers.dropout(tf_train_dataset, dropout_rate, training=training)
  
    # Network layers.
    hidden1 = tf.layers.dense(tf_train_dataset_drop, num_hidden1, activation=tf.nn.relu, name="hidden1") #change: not tf.layers... will automatically initiate weights
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)                          
    
    hidden2 = tf.layers.dense(hidden1_drop, num_hidden2, activation=tf.nn.relu, name="hidden2") #change: not tf.layers... will automatically initiate weights
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    
    hidden3 = tf.layers.dense(hidden2_drop, num_hidden3, activation=tf.nn.relu, name="hidden3") #change: not tf.layers... will automatically initiate weights
    hidden3_drop = tf.layers.dropout(hidden3, dropout_rate, training=training)
        
    # Output
    logits = tf.layers.dense(hidden3_drop, num_labels, name="outputs") 
    
    # Loss
    #extract weights:
    weights1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
    weights2 = tf.get_default_graph().get_tensor_by_name("hidden2/kernel:0")
    weights3 = tf.get_default_graph().get_tensor_by_name("hidden3/kernel:0")
    weights4 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")
    
    base_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    reg_loss = tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
    loss = tf.add(base_loss, reg_scale * reg_loss)

    # Optimizer.
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)    
    
    # Evaluate
    correct = tf.equal(tf.argmax(logits, 1), tf.argmax(tf_train_labels, 1))
    accuracy_measure = tf.reduce_mean(tf.cast(correct, dtype=tf.float32))

I will also switch from "num_steps" to "epochs" since this is more easily comparable if we start playing with different batch-sizes.

In [74]:
num_epochs = 10 #one epoch = 1x forward pass + backward pass of all training examples
num_batches = train_labels.shape[0] // batch_size 

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    
    print("Initialized")
    
    for epoch in range(num_epochs):
        for step in range(num_batches):
            # Pick an offset within the training data, which has been randomized.
            # Note: we could use better randomization across epochs.
            offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

            # Generate a minibatch:
            batch_data = train_dataset[offset:(offset + batch_size), :]
            batch_labels = train_labels[offset:(offset + batch_size), :]
            feed_dict={training: True, tf_train_dataset : batch_data, tf_train_labels : batch_labels}
            
            # Train minibatch
            session.run(training_op, feed_dict) 
        
        acc_batch = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: batch_data, tf_train_labels: batch_labels})
        acc_valid = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: valid_dataset, tf_train_labels: valid_labels})
        loss_minibatch = loss.eval(session = session, feed_dict={tf_train_dataset: batch_data, tf_train_labels: batch_labels})
        print("Minibatch loss after epoch %d: %f" % (epoch+1, loss_minibatch))
        print("Minibatch accuracy: ", acc_batch, " | Validation accuracy: ", acc_valid)
    
    acc_test = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: test_dataset, tf_train_labels: test_labels})
    print("Test accuracy: ", acc_test)

Initialized
Minibatch loss after epoch 1: 0.702568
Minibatch accuracy:  0.84375  | Validation accuracy:  0.8598
Minibatch loss after epoch 2: 0.574617
Minibatch accuracy:  0.859375  | Validation accuracy:  0.872
Minibatch loss after epoch 3: 0.510364
Minibatch accuracy:  0.875  | Validation accuracy:  0.8811
Minibatch loss after epoch 4: 0.494948
Minibatch accuracy:  0.875  | Validation accuracy:  0.8907
Minibatch loss after epoch 5: 0.484861
Minibatch accuracy:  0.882813  | Validation accuracy:  0.8937
Minibatch loss after epoch 6: 0.458294
Minibatch accuracy:  0.890625  | Validation accuracy:  0.898
Minibatch loss after epoch 7: 0.444091
Minibatch accuracy:  0.898438  | Validation accuracy:  0.9009
Minibatch loss after epoch 8: 0.443158
Minibatch accuracy:  0.898438  | Validation accuracy:  0.903
Minibatch loss after epoch 9: 0.440389
Minibatch accuracy:  0.914063  | Validation accuracy:  0.9022
Minibatch loss after epoch 10: 0.438266
Minibatch accuracy:  0.90625  | Validation accura

Well, that's not really an improvement to the simplear case earlier on.

Now let's try some other networks then...

Here's a deeper network (4 hidden layers with 1000, 400, 100, 80 neurons). And I've added "early stopping" to avoid overfitting.

In [141]:
# DNN + dropout + learning rate decay + early stopping
# ELUs (instead of reLUs)

# Network architecture:
num_neurons_layers = np.array([1000, 400, 100, 80])

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(None, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.int64, shape=(None))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    tf_dropout_rate = tf.placeholder(tf.float32, shape=(None), name='tf_dropout_rate')
    #tf_initial_learning_rate = tf.placeholder(tf.float32, shape=(None), name='tf_initial_learning_rate')
    tf_decay_rate = tf.placeholder(tf.float32, shape=(None), name='tf_decay_rate')    
    tf_learning_rate = tf.placeholder(tf.float32, shape=(None), name='tf_decay_rate')  
    
    global_step = tf.Variable(0)
    
    training = tf.placeholder_with_default(False, shape=(), name='training') #only use DROPOUT during training!
    
    # Define DNN model
    def DNN_model(inputs):
        
        # go through dense layers:
        for layer in range(num_neurons_layers.shape[0]):
            num_neurons = num_neurons_layers[layer]
            if tf_dropout_rate is not None:
                inputs = tf.layers.dropout(inputs, tf_dropout_rate, training=training) 
            inputs = tf.layers.dense(inputs, num_neurons, activation=tf.nn.elu, name="hidden%d" % (layer + 1))
        
        # Output
        if tf_dropout_rate is not None:
            inputs = tf.layers.dropout(inputs, tf_dropout_rate, training=training) 
        return tf.layers.dense(inputs, num_labels, name="outputs") #10 output layers (logits, before softmax)
        
    # Training data
    logits = DNN_model(tf_train_dataset)
    
    
    # Loss 
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    
    # Optimizer.
    #learning_rate = tf.train.exponential_decay(tf_initial_learning_rate, global_step, decay_steps, tf_decay_rate)
    optimizer = tf.train.AdamOptimizer(learning_rate=tf_learning_rate) #optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)#.minimize(loss, global_step=global_step)
    training_op = optimizer.minimize(loss) #, global_step=global_step)#.minimize(loss)    
    
    # Evaluate
    correct = tf.equal(tf.argmax(logits, 1), tf.argmax(tf_train_labels, 1))
    accuracy_measure = tf.reduce_mean(tf.cast(correct, dtype=tf.float32))
    
    # Save model in early stopping
    model_saver = tf.train.Saver()
    

In [142]:
max_checks_without_progress = 10
checks_without_progress = 0
best_loss = np.infty

num_epochs = 40 #one epoch = 1x forward pass + backward pass of all training examples
#num_batches = train_labels.shape[0] // batch_size 

# Hyperparameters
DROP_rate = [0.1] 
LEARN_rate = np.array([1e-4], dtype='float')
#DECAY_rate = np.array([0.5, 0.25, 0.1], dtype='float32')
decay_rate = 1
BATCH_size = np.array([150], dtype='int')
decay_steps = 5000

#Simple (and not very elegant) hyper parameter search:
for dropout_rate in DROP_rate:
    for learning_rate in LEARN_rate:
        for batch_size in BATCH_size:
            num_batches = (train_labels.shape[0] // batch_size).astype(int) 
            
            with tf.Session(graph=graph) as session:
                tf.global_variables_initializer().run()

                print("Learning rate: ", learning_rate, " | Dropout rate: ", dropout_rate, 
                     " | batch_size: ", batch_size)

                for epoch in range(num_epochs):
                    for step in range(num_batches):
                        # Pick an offset within the training data, which has been randomized.
                        # Note: we could use better randomization across epochs.
                        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
                        # Generate a minibatch.
                        batch_data = train_dataset[offset:(offset + batch_size), :]
                        batch_labels = train_labels[offset:(offset + batch_size), :]

                        feed_dict={training: True, tf_train_dataset : batch_data, tf_train_labels : batch_labels, 
                           tf_learning_rate : learning_rate, tf_decay_rate : decay_rate, 
                           tf_dropout_rate : dropout_rate}
                        
                        # Train model:
                        session.run(training_op, feed_dict)
                   
                
                    feed_dict={tf_train_dataset: valid_dataset, tf_train_labels: valid_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0}
                    loss_valid, acc_valid = session.run([loss, accuracy_measure], feed_dict)
                    if loss_valid < best_loss:
                        save_path = model_saver.save(session, "C:\\Programming\\Udacity_Deep_Learning\\notMNIST_model_01.ckpt")
                        best_loss = loss_valid
                        checks_without_progress = 0
                    else:
                        checks_without_progress += 1
                        if checks_without_progress > max_checks_without_progress:
                            print("Early stopping!")
                            break
                    print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
                        (epoch+1), loss_valid, best_loss, acc_valid * 100))

                acc_batch = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: batch_data, tf_train_labels: batch_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                acc_valid = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: valid_dataset, tf_train_labels: valid_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                acc_test = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: test_dataset, tf_train_labels: test_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                print("Test accuracy: ", acc_test, " | Batch accuracy: ", acc_batch, " | Valid accuracy: ", acc_valid)

Learning rate:  0.0001  | Dropout rate:  0.1  | batch_size:  150
1	Validation loss: 0.493366	Best loss: 0.493366	Accuracy: 85.15%
2	Validation loss: 0.445149	Best loss: 0.445149	Accuracy: 86.32%
3	Validation loss: 0.415768	Best loss: 0.415768	Accuracy: 87.41%
4	Validation loss: 0.393848	Best loss: 0.393848	Accuracy: 88.13%
5	Validation loss: 0.383416	Best loss: 0.383416	Accuracy: 88.38%
6	Validation loss: 0.372152	Best loss: 0.372152	Accuracy: 88.75%
7	Validation loss: 0.358697	Best loss: 0.358697	Accuracy: 89.07%
8	Validation loss: 0.350609	Best loss: 0.350609	Accuracy: 89.38%
9	Validation loss: 0.346526	Best loss: 0.346526	Accuracy: 89.58%
10	Validation loss: 0.342017	Best loss: 0.342017	Accuracy: 89.65%
11	Validation loss: 0.335934	Best loss: 0.335934	Accuracy: 89.96%
12	Validation loss: 0.330833	Best loss: 0.330833	Accuracy: 90.07%
13	Validation loss: 0.327956	Best loss: 0.327956	Accuracy: 90.35%
14	Validation loss: 0.323960	Best loss: 0.323960	Accuracy: 90.20%
15	Validation loss: 

This is slowly getting real good!

For now, let's do one last try by increasing the dropout rate and decreasing the learning rate:



In [143]:
# DNN + dropout + learning rate decay + early stopping
# ELUs (instead of reLUs)

# Network architecture:
num_neurons_layers = np.array([1000, 300, 100, 80])

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(None, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.int64, shape=(None))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    tf_dropout_rate = tf.placeholder(tf.float32, shape=(None), name='tf_dropout_rate')
    #tf_initial_learning_rate = tf.placeholder(tf.float32, shape=(None), name='tf_initial_learning_rate')
    tf_decay_rate = tf.placeholder(tf.float32, shape=(None), name='tf_decay_rate')    
    tf_learning_rate = tf.placeholder(tf.float32, shape=(None), name='tf_decay_rate')  
    
    global_step = tf.Variable(0)
    
    training = tf.placeholder_with_default(False, shape=(), name='training') #only use DROPOUT during training!
    
    # Define DNN model
    def DNN_model(inputs):
        
        # go through dense layers:
        for layer in range(num_neurons_layers.shape[0]):
            num_neurons = num_neurons_layers[layer]
            if tf_dropout_rate is not None:
                inputs = tf.layers.dropout(inputs, tf_dropout_rate, training=training) 
            inputs = tf.layers.dense(inputs, num_neurons, activation=tf.nn.elu, name="hidden%d" % (layer + 1))
        
        # Output
        if tf_dropout_rate is not None:
            inputs = tf.layers.dropout(inputs, tf_dropout_rate, training=training) 
        return tf.layers.dense(inputs, num_labels, name="outputs") #10 output layers (logits, before softmax)
        
    # Training data
    logits = DNN_model(tf_train_dataset)
    
    
    # Loss 
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    
    # Optimizer.
    #learning_rate = tf.train.exponential_decay(tf_initial_learning_rate, global_step, decay_steps, tf_decay_rate)
    optimizer = tf.train.AdamOptimizer(learning_rate=tf_learning_rate) #optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)#.minimize(loss, global_step=global_step)
    training_op = optimizer.minimize(loss) #, global_step=global_step)#.minimize(loss)    
    
    # Evaluate
    correct = tf.equal(tf.argmax(logits, 1), tf.argmax(tf_train_labels, 1))
    accuracy_measure = tf.reduce_mean(tf.cast(correct, dtype=tf.float32))
    
    # Save model in early stopping
    model_saver = tf.train.Saver()
    

In [144]:
max_checks_without_progress = 10
checks_without_progress = 0
best_loss = np.infty

num_epochs = 70 #one epoch = 1x forward pass + backward pass of all training examples
#num_batches = train_labels.shape[0] // batch_size 

# Hyperparameters
DROP_rate = [0.3] 
LEARN_rate = np.array([5e-4], dtype='float')
#DECAY_rate = np.array([0.5, 0.25, 0.1], dtype='float32')
decay_rate = 1
BATCH_size = np.array([150], dtype='int')

#Simple (and not very elegant) hyper parameter search:
for dropout_rate in DROP_rate:
    for learning_rate in LEARN_rate:
        for batch_size in BATCH_size:
            num_batches = (train_labels.shape[0] // batch_size).astype(int) 
            
            with tf.Session(graph=graph) as session:
                tf.global_variables_initializer().run()

                print("Learning rate: ", learning_rate, " | Dropout rate: ", dropout_rate, 
                     " | batch_size: ", batch_size)

                for epoch in range(num_epochs):
                    for step in range(num_batches):
                        # Pick an offset within the training data, which has been randomized.
                        # Note: we could use better randomization across epochs.
                        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
                        # Generate a minibatch.
                        batch_data = train_dataset[offset:(offset + batch_size), :]
                        batch_labels = train_labels[offset:(offset + batch_size), :]

                        feed_dict={training: True, tf_train_dataset : batch_data, tf_train_labels : batch_labels, 
                           tf_learning_rate : learning_rate, tf_decay_rate : decay_rate, 
                           tf_dropout_rate : dropout_rate}
                        
                        # Train model:
                        session.run(training_op, feed_dict)
                   
                
                    feed_dict={tf_train_dataset: valid_dataset, tf_train_labels: valid_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0}
                    loss_valid, acc_valid = session.run([loss, accuracy_measure], feed_dict)
                    if loss_valid < best_loss:
                        save_path = model_saver.save(session, "C:\\Programming\\Udacity_Deep_Learning\\notMNIST_model_01.ckpt")
                        best_loss = loss_valid
                        checks_without_progress = 0
                    else:
                        checks_without_progress += 1
                        if checks_without_progress > max_checks_without_progress:
                            print("Early stopping!")
                            break
                    print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
                        (epoch+1), loss_valid, best_loss, acc_valid * 100))

                acc_batch = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: batch_data, tf_train_labels: batch_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                acc_valid = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: valid_dataset, tf_train_labels: valid_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                acc_test = accuracy_measure.eval(session = session, feed_dict={tf_train_dataset: test_dataset, tf_train_labels: test_labels, tf_learning_rate : 0, tf_decay_rate : 0, tf_dropout_rate : 0})
                print("Test accuracy: ", acc_test, " | Batch accuracy: ", acc_batch, " | Valid accuracy: ", acc_valid)

Learning rate:  0.0005  | Dropout rate:  0.3  | batch_size:  150
1	Validation loss: 0.493042	Best loss: 0.493042	Accuracy: 84.67%
2	Validation loss: 0.439880	Best loss: 0.439880	Accuracy: 86.15%
3	Validation loss: 0.411139	Best loss: 0.411139	Accuracy: 87.14%
4	Validation loss: 0.396362	Best loss: 0.396362	Accuracy: 87.40%
5	Validation loss: 0.377653	Best loss: 0.377653	Accuracy: 88.17%
6	Validation loss: 0.366886	Best loss: 0.366886	Accuracy: 88.48%
7	Validation loss: 0.356058	Best loss: 0.356058	Accuracy: 88.74%
8	Validation loss: 0.346375	Best loss: 0.346375	Accuracy: 89.03%
9	Validation loss: 0.338912	Best loss: 0.338912	Accuracy: 89.27%
10	Validation loss: 0.335615	Best loss: 0.335615	Accuracy: 89.38%
11	Validation loss: 0.332890	Best loss: 0.332890	Accuracy: 89.62%
12	Validation loss: 0.332380	Best loss: 0.332380	Accuracy: 89.67%
13	Validation loss: 0.326014	Best loss: 0.326014	Accuracy: 89.96%
14	Validation loss: 0.318966	Best loss: 0.318966	Accuracy: 90.19%
15	Validation loss: 

That looks fairly OK. 96,56 % accuracy for a not very clean dataset (at least not as clean as the classical MNIST).
I'll come back to it later and see if I can further improve stuff...