Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function

import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from time import time

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = './pickled/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [73]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

alpha = 0.001     # Global regularization rate
num_steps = 10001  # Training iterations

In [13]:
def Graph1():
    graph = tf.Graph()
    with graph.as_default():
        # Parameters
        learning_rate = 0.01
        batch_size = 128

        # tf Graph Input
        X = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
        y = tf.placeholder(tf.float32, [None, 10]) # 0-9 digits recognition => 10 classes

        # Set model weights
        W = tf.Variable(tf.zeros([784, 10]))
        b = tf.Variable(tf.zeros([10]))

        # Construct model
        pred = tf.nn.softmax(tf.matmul(X, W) + b) # Softmax

        # Minimize error using cross entropy    
        reg_loss = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1)) + alpha*tf.nn.l2_loss(W)
        # Gradient Descent
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(reg_loss)

        # Predictions
        valid_reg_prediction = tf.nn.softmax(tf.matmul(valid_dataset, W) + b)
        test_reg_prediction  = tf.nn.softmax(tf.matmul(test_dataset, W) + b)

######################################## Run the Graph #########################################################
    start = time()
    with tf.Session(graph=graph) as session1:

      tf.initialize_all_variables().run()
      print("Initialized")

      for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size),  :]
        batch_labels = train_labels[offset:(offset + batch_size), :]

        # Prepare  dictionary
        _, l, predictions = session1.run(
          [optimizer, reg_loss, pred], feed_dict={X : batch_data,
						  y : batch_labels})
        if (step % 500 == 0):
          print("Minibatch loss at step %d: %f" % (step, l))
          print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
          print("Validation accuracy: %.1f%%" % accuracy(
            valid_reg_prediction.eval(), valid_labels))
      print("Test accuracy: %.1f%%" % accuracy(test_reg_prediction.eval(), test_labels))

    print( time() - start )
    tf.reset_default_graph()

In [14]:
Graph1()

Minibatch loss at step 8000: 0.588391
Minibatch accuracy: 85.2%
Validation accuracy: 83.2%
Test accuracy: 89.2%
8.07359600067


Minibatch loss at step 7500: 0.679468
Minibatch accuracy: 84.4%
Validation accuracy: 83.2%


Minibatch loss at step 7000: 0.753507
Minibatch accuracy: 78.1%
Validation accuracy: 83.2%


Minibatch loss at step 6500: 0.635417
Minibatch accuracy: 82.8%
Validation accuracy: 83.0%


Minibatch loss at step 6000: 0.578683
Minibatch accuracy: 85.2%
Validation accuracy: 83.0%


Minibatch loss at step 5500: 0.669042
Minibatch accuracy: 85.2%
Validation accuracy: 83.0%


Minibatch loss at step 5000: 0.830461
Minibatch accuracy: 79.7%
Validation accuracy: 82.9%


Minibatch loss at step 4500: 0.646472
Minibatch accuracy: 84.4%
Validation accuracy: 83.0%


Minibatch loss at step 4000: 0.729477
Minibatch accuracy: 82.8%
Validation accuracy: 82.8%


Minibatch loss at step 3500: 0.636356
Minibatch accuracy: 81.2%
Validation accuracy: 82.8%


Minibatch loss at step 3000: 0.666336
Minibatch accuracy: 82.0%
Validation accuracy: 82.7%


Minibatch loss at step 2500: 0.697774
Minibatch accuracy: 82.0%
Validation accuracy: 82.4%


Minibatch loss at step 2000: 0.709083
Minibatch accuracy: 82.0%
Validation accuracy: 82.2%


Minibatch loss at step 1500: 0.801483
Minibatch accuracy: 76.6%
Validation accuracy: 82.0%


Minibatch loss at step 1000: 0.759210
Minibatch accuracy: 76.6%
Validation accuracy: 81.6%


Minibatch loss at step 500: 0.880955
Minibatch accuracy: 76.6%
Validation accuracy: 80.5%


Initialized
Minibatch loss at step 0: 2.302585
Minibatch accuracy: 7.0%
Validation accuracy: 51.3%


In [74]:
def Graph2():
    batch_size  =  128
    graph = tf.Graph()
    with graph.as_default():
      ############################################
      # Helper function: crunch through one layer
      def crunch(data, param):
      	  w,b = param
          hid_logits = tf.matmul(data, w) + b
          ReLUed = tf.nn.relu(hid_logits)
          return ReLUed
      ###############

      ######## Parameters #######
      hidden_size = 1024
      starter_learning_rate = 0.0005

      ####### Input data ##########
      # Runtime placeholders for training minibatches
      X_train = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
      y_train = tf.placeholder(tf.float32, shape=(batch_size, num_labels))

      # Probability switch for drop_out
      keep_prob = tf.placeholder(tf.float32)

      # The dataset itself is a constant
      tf_train_data = tf.constant(train_dataset)
      tf_valid_data = tf.constant(valid_dataset)
      tf_test_data  = tf.constant(test_dataset)

      ####### Variables ###########
      # decay rate step counter
      global_step = tf.Variable(0, trainable=False)
      
      # hidden layers
      W1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_size]))
      b1 = tf.Variable(tf.zeros([hidden_size]))
      
      lW = tf.Variable(
        tf.truncated_normal([hidden_size, hidden_size]))
      lb = tf.Variable(tf.zeros([hidden_size]))
      
      # activation layer
      W = tf.Variable(
        tf.truncated_normal([hidden_size, num_labels]))
      b = tf.Variable(tf.zeros([num_labels]))

       
      ####### Training computation #######

      hid1 = tf.nn.relu  (
      	    tf.nn.dropout(
			 tf.matmul(X_train, W1) + b1,
			 keep_prob))
			 
      lhid = tf.nn.relu  (
      	    tf.nn.dropout(
			 tf.matmul(hid1, lW) + lb,
			 keep_prob))

      logits = tf.matmul(lhid, W)  + b
      
      #
      loss = tf.reduce_mean(
             tf.nn.softmax_cross_entropy_with_logits(
			                            logits, y_train)) + alpha*tf.nn.l2_loss(W1) \
                                                                      + alpha*tf.nn.l2_loss(lW) \
                                                                      + alpha*tf.nn.l2_loss(W)  # Regularization
      # Optimizer
      learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
      						 800, 0.75, staircase=True)
      
      optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
      
      ###### Predictions for the training, validation, and test data  ######
      
      # current batch prediction
      batch_pred = tf.nn.softmax(logits)
      # valid and test
      def v_t(vp, tp):
          for param in [(W1,b1),(lW,lb)]:
                vp = crunch(vp, param )
                tp = crunch(tp, param )
	  return vp, tp
	  
      valid_prediction = tf.nn.softmax(tf.matmul(v_t(tf_valid_data, tf_test_data)[0], W) + b)
      test_prediction  = tf.nn.softmax(tf.matmul(v_t(tf_valid_data, tf_test_data)[1], W) + b)


    ############################ Run the Graph ##############################
    start = time()
    with tf.Session(graph=graph) as session:
      tf.initialize_all_variables().run()
      print("Initialized")
      for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]

        # Feed  dictionary for computation
        _, l, predictions = session.run(
            [optimizer, loss, batch_pred], feed_dict={X_train : batch_data,
                                                      y_train : batch_labels,
                                                      keep_prob: 0.95})
        if (step % 1000 == 0):
          print("Minibatch loss at step %d: %f" % (step, l))
          print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
          print("Validation accuracy: %.1f%%" % accuracy(
            valid_prediction.eval(), valid_labels))
      print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

    print( time() - start )
    tf.reset_default_graph()

In [75]:
Graph2()

Test accuracy: 89.2%
318.882736921


Validation accuracy: 83.0%


Minibatch loss at step 10000: 1177.946899
Minibatch accuracy: 77.3%


Validation accuracy: 83.0%


Minibatch loss at step 9000: 1029.190430
Minibatch accuracy: 81.2%


Validation accuracy: 83.0%


Minibatch loss at step 8000: 1054.619629
Minibatch accuracy: 83.6%


Validation accuracy: 82.8%


Minibatch loss at step 7000: 1166.821777
Minibatch accuracy: 74.2%


Validation accuracy: 82.8%


Minibatch loss at step 6000: 1212.757935
Minibatch accuracy: 79.7%


Validation accuracy: 82.6%


Minibatch loss at step 5000: 1122.982666
Minibatch accuracy: 78.9%


Validation accuracy: 82.5%


Minibatch loss at step 4000: 1175.996582
Minibatch accuracy: 77.3%


Validation accuracy: 82.4%


Minibatch loss at step 3000: 1130.191162
Minibatch accuracy: 83.6%


Validation accuracy: 81.6%


Minibatch loss at step 2000: 1231.240356
Minibatch accuracy: 81.2%


Validation accuracy: 80.6%


Minibatch loss at step 1000: 1352.035889
Minibatch accuracy: 76.6%


Validation accuracy: 13.5%


Initialized
Minibatch loss at step 0: 6381.747070
Minibatch accuracy: 18.0%


---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---
On logistic regression, nn.l2_loss, actually reduced the test accuracy from 88.5% to 84%. The problem was I had not set the regularization rate.
Still, in order to not lose accuracy it has to be set to at most 0.001 or smaller. In which case it just matches the performance of not having a
regularization term at all. In the 1-hidden ReLUed network, test accuracy actually jumped from 88.5% to 92.7% after introducing regularization. 

---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---
Probably not getting the intended results, all scores are lowered, mini-batch, validation and test. Though test accuracy is still pretty good at
83.8% with just 500 batches.

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training. 

What happens to our extreme overfitting case?

---
At first, drop out seems to slightly decrease accuracy. The highest observed accuracy without dropout was 92.7%. With the following "keep"
probabilities accuracies were, 0.45 -> 91.8%, 0.55 -> 92.3%, 0.85 -> 92.5%. So approaching 100% keep probability seemed to approach the highest
value of 92.7%. However with a drop probability of 0.95 there was an observed accuracy of 93.2%. There might be a slight improvement with dropout.


---
Problem 4
---------
Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).
One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

global_step = tf.Variable(0)  # count the number of steps taken.
learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

From the docs decay rate is calculated as:

decayed_learning_rate = learning_rate *
                        decay_rate ^ (global_step / decay_steps)
 
----


  First attempts at adding a second hidden layer have not produced good results. Model did not learn at all with an initial learning rate of 0.5, 
once lowered to 0.005 loss seized to explode into nan values. Still after increasing the num_steps to 8001 from 3001 in an attempt to expose it 
to more data, test accuracy hovers at just 90% down from the 93.2% of a single hidden layer, ReLUed, lightly dropped (keep_prop=0.95) neural network.
Increasing the steps to 8001 increased the logistic regressions accuracy from 88.5% to 89.2%, that is some evidence to its benefit. The layers are
naively implemented. The larger first layer (2048 nodes) is not ReLUed but dropout is applied, the second smaller (1024 nodes) is ReLUed but with
no dropout. My intuition was to use dropout on a larger layer with "fresher" input and applying non-linearity subsequently. What it all this seems
to indicate, to me, was that increasing exposure to data by increasing the steps to 8001 was the only thing that had a postive effect. Indeed
returning to single ReLU layer with the slight .95 drop and retaining the 8001 steps increased NN test accuracy to 94.2%.
  The addition of a decaying learning rate so far only increases accuracy by 0.02%, with an exponential decay base of .85 and decay step every 800
iterations. Adding a second identical hidden layer again results in a drop in accuracy. A reduced starter learning rate of 0.005 is necessary for NN
to learn at all.