## NLP 答疑 09.08. - assignment 9 

outline:
- train a fully connected NN with tensorflow
- add stochastic gradient descent
- add a hidden layer using `nn.relu`
- regularization (L2 & dropout)


In [3]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from sklearn.linear_model import LogisticRegression
from six.moves.urllib.request import urlretrieve
from six.moves import cPickle as pickle
from six.moves import range

In [4]:
# import pickle file

pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat the data into a shape that's more adapted to the models we are going to train:
+ data as a flat matrix;
+ labels as float 1-hot encodings

More info on softmax regression:
https://www.geeksforgeeks.org/softmax-regression-using-tensorflow/

In [5]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    # One shape dimension can be -1. 
    # In this case, the value is inferred from the length of the array 
    # and remaining dimensions.
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


Step 1 - specify the model

In [6]:
# With gradient descent training, even this much data is prohibitive.
# Subset the training data for faster turnaround.
train_subset = 200000

# Create graph object: instantiate
graph = tf.Graph()
with graph.as_default():

    '''INPUT DATA'''
    # Load the training, validation and test data into constants that are
    # attached to the graph.
    tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
    tf_train_labels = tf.constant(train_labels[:train_subset])
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    '''VARIABLES'''
    # These are the parameters that we are going to be training. The weight
    # matrix will be initialized using random values following a (truncated)
    # normal distribution. The biases get initialized to zero.
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))

    '''TRAINING COMPUTATION'''
    # We multiply the inputs with the weight matrix, and add biases. We compute
    # the softmax and cross-entropy (it's one operation in TensorFlow, because
    # it's very common, and it can be optimized)
    logits = tf.matmul(tf_train_dataset, weights) + biases
    # We take the average of this
    # cross-entropy across all training examples: that's our loss.
    loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(logits = logits, labels = tf_train_labels))

    '''OPTIMIZER'''
    # We are going to find the minimum of this loss using gradient descent.
    # 0.5 is the learning rate
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    '''PREDICTIONS for the training, validation, and test data.'''
    # These are not part of training, but merely here so that we can report
    # accuracy figures as we train.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Step 2 - run

In [None]:
num_steps = 801

def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
            / predictions.shape[0])

with tf.Session(graph=graph) as session:
    # This is a one-time operation which ensures the parameters get initialized as
    # we described in the graph: random weights for the matrix, zeros for the
    # biases. 
    session.run(tf.global_variables_initializer())
    print('Initialized')
    for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
        _, l, predictions = session.run([optimizer, loss, train_prediction])
        if (step % 100 == 0):
            print('Loss at step {}: {}'.format(step, l))
            print('Training accuracy: {:.1f}'.format(accuracy(predictions, 
                                                              train_labels[:train_subset, :])))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph
            # dependencies.
            
            # You don't have to do .eval above because we already ran the session for the
            # train_prediction
            print('Validation accuracy: {:.1f}'.format(accuracy(valid_prediction.eval(), 
                                                                valid_labels)))
    print('Test accuracy: {:.1f}'.format(accuracy(test_prediction.eval(), test_labels))) 

Initialized
Loss at step 0: 19.6794376373291
Training accuracy: 8.1
Validation accuracy: 10.4


Let's switch to using stochastic gradient descent (SGD).

The graph will be similar, except that instead of holding all the training data into a constant node, we create a `placeholder` node which will be fed actual data at every call of `session.run()`.

Step 1 - specify the model

In [17]:
batch_size = 128

graph = tf.Graph()
with graph.as_default():
    
    '''INPUT DATA'''
    # For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    # --- **the key difference between GD and SGD**!
    tf_train_dataset = tf.placeholder(tf.float32, 
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    '''VARIABLES'''
    # These are the parameters that we are going to be training. The weight
    # matrix will be initialized using random values following a (truncated)
    # normal distribution. The biases get initialized to zero.
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))

    '''TRAINING COMPUTATION'''
    # We multiply the inputs with the weight matrix, and add biases. We compute
    # the softmax and cross-entropy (it's one operation in TensorFlow, because
    # it's very common, and it can be optimized)
    logits = tf.matmul(tf_train_dataset, weights) + biases
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = tf_train_labels))

    '''OPTIMIZER'''
    # We are going to find the minimum of this loss using gradient descent.
    # 0.5 is the learning rate
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    '''PREDICTIONS for the training, validation, and test data'''
    # These are not part of training, but merely here so that we can report
    # accuracy figures as we train.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Step 2 - run the model

In [18]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    session.run(tf.global_variables_initializer())
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step {}: {}".format(step, l))
            print("Minibatch accuracy: {:.1f}".format(accuracy(predictions, batch_labels)))
            print("Validation accuracy: {:.1f}".format(accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: {:.1f}".format(accuracy(test_prediction.eval(), test_labels)))

Initialized
Minibatch loss at step 0: 20.22445297241211
Minibatch accuracy: 10.2
Validation accuracy: 13.7
Minibatch loss at step 500: 1.327433466911316
Minibatch accuracy: 78.1
Validation accuracy: 75.1
Minibatch loss at step 1000: 1.3761510848999023
Minibatch accuracy: 71.9
Validation accuracy: 76.6
Minibatch loss at step 1500: 1.4343000650405884
Minibatch accuracy: 73.4
Validation accuracy: 77.2
Minibatch loss at step 2000: 0.9583243131637573
Minibatch accuracy: 82.0
Validation accuracy: 77.0
Minibatch loss at step 2500: 1.2906465530395508
Minibatch accuracy: 74.2
Validation accuracy: 77.8
Minibatch loss at step 3000: 1.0107319355010986
Minibatch accuracy: 76.6
Validation accuracy: 78.2
Test accuracy: 84.8


### Problem - SGD with ReLu
Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units [`nn.relu()`](https://www.tensorflow.org/api_docs/python/tf/nn/relu) and 1024 hidden nodes. This model should improve your validation / test accuracy.

Step 1 - specify the model

In [25]:
num_nodes= 1024
batch_size = 128

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables.
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_nodes]))
    biases_1 = tf.Variable(tf.zeros([num_nodes]))
    weights_2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels]))
    biases_2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    logits_1 = tf.matmul(tf_train_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1) # add relu layer
    logits_2 = tf.matmul(relu_layer, weights_2) + biases_2
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits_2, labels = tf_train_labels))

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training
    train_prediction = tf.nn.softmax(logits_2)
    
    # Predictions for validation 
    logits_1 = tf.matmul(tf_valid_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1)
    logits_2 = tf.matmul(relu_layer, weights_2) + biases_2
    
    valid_prediction = tf.nn.softmax(logits_2)
    
    # Predictions for test
    logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1)
    logits_2 = tf.matmul(relu_layer, weights_2) + biases_2
    
    test_prediction =  tf.nn.softmax(logits_2)

Step 2 - run the model

In [26]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    session.run(tf.global_variables_initializer())
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step {}: {}".format(step, l))
            print("Batch accuracy: {:.1f}".format(accuracy(predictions, batch_labels)))
            print("Validation accuracy: {:.1f}".format(accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: {:.1f}".format(accuracy(test_prediction.eval(), test_labels)))

Initialized
Minibatch loss at step 0: 285.65771484375
Batch accuracy: 9.4
Validation accuracy: 30.4
Minibatch loss at step 500: 27.626670837402344
Batch accuracy: 80.5
Validation accuracy: 79.3
Minibatch loss at step 1000: 5.7307891845703125
Batch accuracy: 83.6
Validation accuracy: 79.2
Minibatch loss at step 1500: 40.735328674316406
Batch accuracy: 71.9
Validation accuracy: 74.9
Minibatch loss at step 2000: 5.909951210021973
Batch accuracy: 86.7
Validation accuracy: 82.1
Minibatch loss at step 2500: 11.044862747192383
Batch accuracy: 79.7
Validation accuracy: 81.2
Minibatch loss at step 3000: 2.101963758468628
Batch accuracy: 82.8
Validation accuracy: 82.9
Test accuracy: 88.6


More info:
+ Comparing Stochastic Gradient Descent vs. Mini-Batch Gradient Descent:
https://adventuresinmachinelearning.com/stochastic-gradient-descent/
+ Implementing Shallow NN with SGD: https://leemeng.tw/using-tensorflow-to-train-a-shallow-nn-with-stochastic-gradient-descent.html
+ Mathematical explanation for a fully-connected layer:
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/fc_layer.html

### Problems - Regularization
+ 1. Adding L2 regularization

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor $t$ using [`nn.l2_loss(t)`](https://www.tensorflow.org/api_docs/python/tf/nn/l2_loss). The right amount of regularization should improve your validation / test accuracy.

Note that $L2 = L + \beta\frac{1}{2}\Sigma|w_i|^2$

Step 1 - specify the model (the only difference is that we add a `nn.l2_loss` term to the loss function)

In [30]:
# take a small subset
train_subset = 10000
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [32]:
# This is a good beta value to start with
beta = 0.01

graph = tf.Graph()
with graph.as_default():
    # Input data.
    tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
    tf_train_labels = tf.constant(train_labels[:train_subset])
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Variables    
    # Declare the variables we want to update and optimize.
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases 
    # Original loss function
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = tf_train_labels))
    # Loss function using L2 Regularization
    regularizer = tf.nn.l2_loss(weights)
    loss = tf.reduce_mean(loss + beta * regularizer)
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Step 2 - run the model (similar procedures as before...)

**Challenge -- try adding L2 regularization to our previous model --- SGD with ReLu hidden layer?**

+ 2. An example for extreme overfitting

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

Continue with the simple model we specified above with L2 regularization, we now run the model with a small subsample of 500.

In [40]:
train_dataset_2 = train_dataset[:500, :]
train_labels_2 = train_labels[:500]

valid_dataset_2 = valid_dataset[:1000, :]
valid_labels_2 = valid_labels[:1000, :]

test_dataset_2 = test_dataset[:1000, :]
test_labels_2 = test_labels[:1000, :]

print('Training set', train_dataset_2.shape, train_labels_2.shape)
print('Validation set', valid_dataset_2.shape, valid_labels_2.shape)
print('Test set', test_dataset_2.shape, test_labels_2.shape)

Training set (500, 784) (500, 10)
Validation set (1000, 784) (1000, 10)
Test set (1000, 784) (1000, 10)


In [41]:
# specify the model -- change input data to the small subsample
beta = 0.01 # define the hyper-parameter, beta, for L2 regularization

graph = tf.Graph()
with graph.as_default():
    # Input data.
    tf_train_dataset = tf.constant(train_dataset_2)
    tf_train_labels = tf.constant(train_labels_2)
    tf_valid_dataset = tf.constant(valid_dataset_2)
    tf_test_dataset = tf.constant(test_dataset_2)
    
    # Variables    
    # Declare the variables we want to update and optimize.
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases 
    # Original loss function
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = tf_train_labels))
    # Loss function using L2 Regularization
    regularizer = tf.nn.l2_loss(weights)
    loss = tf.reduce_mean(loss + beta * regularizer)
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [43]:
# Run the model
num_steps = 3001

with tf.Session(graph=graph) as session:
    session.run(tf.global_variables_initializer())
    print("Initialized")
    for step in range(num_steps):
        # Run the computations. We tell .run() that we want to run the optimizer,
        # and get the loss value and the training predictions returned as numpy
        # arrays.
        _, l, predictions = session.run([optimizer, loss, train_prediction])
        if (step % 100 == 0):
            print('Loss at step {}: {}'.format(step, l))
            print('Training accuracy: {:.1f}'.format(accuracy(predictions, 
                                                              train_labels_2)))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph
            # dependencies.
            
            # You don't have to do .eval above because we already ran the session for the
            # train_prediction
            print('Validation accuracy: {:.1f}'.format(accuracy(valid_prediction.eval(), 
                                                                valid_labels_2)))
    print('Test accuracy: {:.1f}'.format(accuracy(test_prediction.eval(), test_labels_2))) 

Initialized
Loss at step 0: 47.51041793823242
Training accuracy: 10.2
Validation accuracy: 15.2
Loss at step 100: 10.971570014953613
Training accuracy: 90.8
Validation accuracy: 68.4
Loss at step 200: 4.081431865692139
Training accuracy: 97.2
Validation accuracy: 71.9
Loss at step 300: 1.6595813035964966
Training accuracy: 99.0
Validation accuracy: 74.5
Loss at step 400: 0.7914624214172363
Training accuracy: 99.4
Validation accuracy: 75.9
Loss at step 500: 0.4788540005683899
Training accuracy: 99.4
Validation accuracy: 76.8
Loss at step 600: 0.3658328652381897
Training accuracy: 99.4
Validation accuracy: 76.9
Loss at step 700: 0.32476598024368286
Training accuracy: 99.4
Validation accuracy: 76.6
Loss at step 800: 0.3097442388534546
Training accuracy: 99.4
Validation accuracy: 76.8
Loss at step 900: 0.30419057607650757
Training accuracy: 99.4
Validation accuracy: 76.7
Loss at step 1000: 0.3020966947078705
Training accuracy: 99.4
Validation accuracy: 76.7
Loss at step 1100: 0.30127716064

Extreme overfitting -- training accuracy is high while validation and test accuracy is low.

+ 3. Introducing dropout

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

The logic behind "dropout" --- "dropout" means we randomly deactivate a certain number of neurons in the network during training, so it is like we are drawing a random "sample" from different network structures and by "averaging" the predictions produced by those different networks, we reduce the error related to overfitting (i.e., the model becomes more generalizable on the test data).

In [49]:
# specify the model -- a fully connected graph + a hiden ReLu layer (with dropout)
num_nodes= 1024
batch_size = 128

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables.
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, num_nodes]))
    biases_1 = tf.Variable(tf.zeros([num_nodes]))
    weights_2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels]))
    biases_2 = tf.Variable(tf.zeros([num_labels]))
    
    # Training computation.
    logits_1 = tf.matmul(tf_train_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1)
    # Dropout on hidden layer: RELU layer
    keep_prob = tf.placeholder("float")
    relu_layer_dropout = tf.nn.dropout(relu_layer, keep_prob)
    
    logits_2 = tf.matmul(relu_layer_dropout, weights_2) + biases_2
    # Normal loss function
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits_2, labels = tf_train_labels))
    # Loss function with L2 Regularization with beta=0.01
    regularizers = tf.nn.l2_loss(weights_1) + tf.nn.l2_loss(weights_2)
    loss = tf.reduce_mean(loss + beta * regularizers)

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training
    train_prediction = tf.nn.softmax(logits_2)
    
    # Predictions for validation 
    logits_1 = tf.matmul(tf_valid_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1)
    logits_2 = tf.matmul(relu_layer, weights_2) + biases_2
    
    valid_prediction = tf.nn.softmax(logits_2)
    
    # Predictions for test
    logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
    relu_layer= tf.nn.relu(logits_1)
    logits_2 = tf.matmul(relu_layer, weights_2) + biases_2
    
    test_prediction =  tf.nn.softmax(logits_2)

In [50]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step {}: {}".format(step, l))
            print("Minibatch accuracy: {:.1f}".format(accuracy(predictions, batch_labels)))
            print("Validation accuracy: {:.1f}".format(accuracy(valid_prediction.eval(), valid_labels)))
    print("Test accuracy: {:.1f}".format(accuracy(test_prediction.eval(), test_labels)))

Initialized
Minibatch loss at step 0: 3674.429443359375
Minibatch accuracy: 11.7
Validation accuracy: 30.6
Minibatch loss at step 500: 21.52471923828125
Minibatch accuracy: 83.6
Validation accuracy: 84.2
Minibatch loss at step 1000: 1.0514339208602905
Minibatch accuracy: 82.8
Validation accuracy: 83.8
Minibatch loss at step 1500: 0.9212013483047485
Minibatch accuracy: 77.3
Validation accuracy: 83.5
Minibatch loss at step 2000: 0.7128937840461731
Minibatch accuracy: 86.7
Validation accuracy: 83.5
Minibatch loss at step 2500: 0.8624097108840942
Minibatch accuracy: 79.7
Validation accuracy: 83.2
Minibatch loss at step 3000: 0.8109461069107056
Minibatch accuracy: 82.8
Validation accuracy: 83.2
Test accuracy: 89.6


More info:
+ Illustrating the difference between L1 and L2 regularization:
https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577

+ Demonstration of regularizationn with TensorFlow & an example for multi-layer NN:
https://www.ritchieng.com/machine-learning/deep-learning/tensorflow/regularization/

+ A code-efficient way to build a more complex multi-layer NN -- using [Keras](https://www.tensorflow.org/beta/guide/keras/overview)

+ Hinton's original paper on the logic behind using random "dropout" as a technique to overcome overfitting:
https://arxiv.org/pdf/1207.0580.pdf


+ Comparing different learning rates with different optimizers:
https://medium.com/octavian-ai/which-optimizer-and-learning-rate-should-i-use-for-deep-learning-5acb418f9b2
