## Table of Content: <a class="anchor" id="top"/>
* [Problem 1 - Regularization](#problem-1)
* [Problem 2 - Overtraining](#problem-2)
* [Problem 3 - Dropout](#problem-3)
* [Problem 4 - Best Accuracy](#problem-4)


Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in `1_notmnist.ipynb`.

In [3]:
# file for home computer
# pickle_file = '/Users/gundeep/Documents/Notebooks/deep-learning/notMNIST.pickle'

# file for office computer
pickle_file = '/Users/gundeepsingh/Documents/Notebook/notMNIST.pickle'

# file for paperspace
pickle_file = '/home/paperspace/Notebook/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [4]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

print(train_labels[0])

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)
[ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]


In [5]:
def accuracy(predictions, labels):
    '''print('calculating accuracy...prediction 0 - ' + str(predictions[0]) + ', label 0 = ' + str(labels[0]))
    print('calculating accuracy...prediction 1 - ' + str(predictions[1]) + ', label 1 = ' + str(labels[1]))
    print('calculating accuracy...prediction 2 - ' + str(predictions[2]) + ', label 2 = ' + str(labels[2]))
    print('calculating accuracy...prediction 3 - ' + str(predictions[3]) + ', label 3 = ' + str(labels[3]))'''
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1 <a class="anchor" id="problem-1"></a>
---

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

## 01. Stochastic Descent Optimizer + relu + l2 regularization

### Steps:
1. Load up the __constants, placeholders and variables(weights, biases)__ with graph = tf.Graph() and graph.as_default():
2. __Calculate cross entropy and loss__ and __minimize the loss__ using optimizer
2. __Define the optimizer__ using optimizer like tf.train.GradientDescentOptimizer with some learning rate and minimize the loss
4. After every __session.run__, the variable values will change and we can print the accuracy after every few 100 runs

### Load data:

In [19]:
batch_size = 128
image_size = 28
num_nodes = 1024
# num_nodes = 1280
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.003

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # initialize the variables like weights and biases
    weights1 = tf.Variable(tf.truncated_normal([image_size*image_size, num_nodes]))
    biases1 = tf.Variable(tf.zeros([num_nodes]))
    
    weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    
    relufied_weights = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    logits = tf.matmul(relufied_weights, weights2) + biases2
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    
    # regularization
    loss = loss + 0.5 * lambda_parameter * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1), weights2) + biases2)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1), weights2) + biases2)

Now run the tf session and train the network
### Run session:

In [20]:
num_steps = (train_dataset.shape[0]/batch_size) * 2

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        if(step % 500 == 0):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))

Initialized.
loss at step 0 = 736.515625
train accuracy at step 0 = 14.843750
test accuracy at step 0 = 18.760000
loss at step 500 = 225.902008
train accuracy at step 500 = 76.562500
test accuracy at step 500 = 78.460000
loss at step 1000 = 106.617416
train accuracy at step 1000 = 83.593750
test accuracy at step 1000 = 82.670000
loss at step 1500 = 48.897873
train accuracy at step 1500 = 85.156250
test accuracy at step 1500 = 83.140000
loss at step 2000 = 23.003733
train accuracy at step 2000 = 88.281250
test accuracy at step 2000 = 86.110000
loss at step 2500 = 11.181416
train accuracy at step 2500 = 84.375000
test accuracy at step 2500 = 86.830000
loss at step 3000 = 5.481342
train accuracy at step 3000 = 89.062500
test accuracy at step 3000 = 86.810000
test accuracy = 93.180000


---

## 02. L2 Regularized Stochastic GD Network

### Steps:
1. Load data just like before
2. Calculate the new loss as $\mathcal{L}' = \mathcal{L} + \beta \frac{1}{2} ||w||^2_{2}$
3. Use optimizer to train the network
### Load data:

In [21]:
batch_size = 256
image_size = 28
# num_nodes = 1024
num_labels = 10
learning_rate = 0.6
lambda_parameter = 0.004

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    # beta = tf.constant()
    
    # initialize the variables like weights and biases
    weights = tf.Variable(tf.truncated_normal([image_size*image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    logits = tf.matmul(tf_train_dataset, weights) + biases
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    old_loss = tf.reduce_mean(cross_entropy)
    loss = old_loss + lambda_parameter/2 * tf.nn.l2_loss(weights)
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset,weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Now run the tf session and train the network
### Run session:

In [22]:
num_steps = (train_dataset.shape[0]/batch_size) * 2 + 15

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        if(step % 500 == 0 or step == int(num_steps)-1):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))
    # print('test accuracy = %f' % accuracy(train_prediction.eval(), train_labels))

Initialized.
loss at step 0 = 21.187765
train accuracy at step 0 = 8.203125
test accuracy at step 0 = 11.720000
loss at step 500 = 2.352277
train accuracy at step 500 = 81.250000
test accuracy at step 500 = 77.840000
loss at step 1000 = 1.075831
train accuracy at step 1000 = 82.031250
test accuracy at step 1000 = 80.570000
loss at step 1500 = 0.855944
train accuracy at step 1500 = 82.031250
test accuracy at step 1500 = 81.720000
loss at step 1576 = 0.975441
train accuracy at step 1576 = 75.000000
test accuracy at step 1576 = 81.910000
test accuracy = 88.970000


---
## Problem 2 <a class="anchor" id="problem-2"></a>

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

Choses á faire pour l'accaparement:
1. Remove the regularization (probably anti regularize)
2. Reduce the training dataset
3. Increase the learning rate maybe

---

### Load Data (Overfitting example)

In [212]:
batch_size = 1
image_size = 28
# num_nodes = 1024
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.0045

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    # beta = tf.constant()
    
    # initialize the variables like weights and biases
    weights = tf.Variable(tf.truncated_normal([image_size*image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    logits = tf.matmul(tf_train_dataset, weights) + biases
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    loss = loss + lambda_parameter/2 * tf.nn.l2_loss(weights)
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset,weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

### Run the session (overfitting)

In [9]:
# reduce the dataset size to half
# dataset_size = train_dataset.shape[0]
dataset_size = 257
num_steps = (dataset_size/batch_size) * 4  + 15

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (dataset_size - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        if(step % 500 == 0 or step == int(num_steps)-1):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('num_steps = ' + str(num_steps))
    print('batch_size = ' + str(batch_size))
    print('dataset_size = ' + str(dataset_size))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))
    # print('test accuracy = %f' % accuracy(train_prediction.eval(), train_labels))

Initialized.
loss at step 0 = 1438.563965
train accuracy at step 0 = 0.000000
test accuracy at step 0 = 10.000000
loss at step 500 = nan
train accuracy at step 500 = 0.000000
test accuracy at step 500 = 10.000000
loss at step 1000 = nan
train accuracy at step 1000 = 0.000000
test accuracy at step 1000 = 10.000000
loss at step 1042 = nan
train accuracy at step 1042 = 0.000000
test accuracy at step 1042 = 10.000000
num_steps = 1043.0
batch_size = 1
dataset_size = 257
test accuracy = 10.000000


---
## Problem 3 <a class="anchor" id="problem-3"></a>

Dropout & Regularization on NN with 1 hidden layer
---

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

In [24]:
# """
# added dropout on relu output
batch_size = 128
image_size = 28
num_nodes = 2048
# num_nodes = 1280
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.003
keep_probability = 0.9
# """

# params for overfitting
'''
batch_size = 10
image_size = 28
num_nodes = 1024
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.0045
keep_probability = 0.9
'''

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # initialize the variables like weights and biases
    weights1 = tf.Variable(tf.truncated_normal([image_size*image_size, num_nodes]))
    biases1 = tf.Variable(tf.zeros([num_nodes]))
    
    weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))
    
    relufied_weights = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    
    # introduce dropout here
    dropped_output = tf.nn.dropout(relufied_weights, keep_prob=keep_probability)
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    logits = tf.matmul(dropped_output, weights2) + biases2
    # logits = tf.matmul(relufied_weights, weights2) + biases2
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    
    # regularization
    loss = loss + 0.5 * lambda_parameter * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1), weights2) + biases2)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1), weights2) + biases2)

Now run the tf session and train the network
### Run session:

In [25]:
num_steps = (train_dataset.shape[0]/batch_size) * 2

# overfitting params
# dataset_size = 2570
# num_steps = (dataset_size/batch_size) * 10  + 15

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        if(step % 500 == 0):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))

Initialized.
loss at step 0 = 1456.982422
train accuracy at step 0 = 10.156250
test accuracy at step 0 = 39.540000
loss at step 500 = 455.016602
train accuracy at step 500 = 77.343750
test accuracy at step 500 = 82.150000
loss at step 1000 = 210.602020
train accuracy at step 1000 = 81.250000
test accuracy at step 1000 = 82.530000
loss at step 1500 = 98.541847
train accuracy at step 1500 = 82.031250
test accuracy at step 1500 = 83.040000
loss at step 2000 = 45.807381
train accuracy at step 2000 = 92.187500
test accuracy at step 2000 = 86.240000
loss at step 2500 = 21.967333
train accuracy at step 2500 = 85.937500
test accuracy at step 2500 = 86.980000
loss at step 3000 = 10.574685
train accuracy at step 3000 = 91.406250
test accuracy at step 3000 = 86.970000
test accuracy = 93.480000


### Observations
#### Params:
batch_size = 128
image_size = 28
num_nodes = 1024
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.003
keep_probability = 0.9


---

#### test accuracy = 93.21, train accuracy = 87.5 with params: (no dropout)

---

#### test accuracy = 93.22, train accuracy = 85.15 with params: (with dropout) and same params

---

when nodes were increased to 2048, then
#### test accuracy = 93.24, train accuracy = 86.71


---
## Problem 4 <a class="anchor" id="problem-4"/>

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


### Trying to get the best accuracy

Things to try:
1. adding more hidden layers
2.

#### 01. Add 1 more layer


In [6]:
# """
# added dropout on relu output
batch_size = 128
image_size = 28
num_nodes = 2048
num_nodes = 1024
num_nodes2 = 500
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.0025
lambda_parameter2 = 5e-6
keep_probability = 0.9
use_layer_3 = True
use_stddev = False
base_learning_rate = 0.4
decay_rate = 0.65
decay_steps = 500
train_size = train_dataset.shape[0]
num_steps = 5001
# """

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    if use_stddev:
        std1 = (2/(image_size*image_size))**.5
        std2 = (2/num_nodes)**.5
        std3 = (2/num_nodes2)**.5
    else:
        std1 = 0.1
        std2 = 0.1
        std3 = 0.1
    
    # initialize the variables like weights and biases
    weights1 = tf.Variable(tf.truncated_normal([image_size*image_size, num_nodes], stddev=std1))
    biases1 = tf.Variable(tf.zeros([num_nodes]))
    
    if use_layer_3:
        weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes2], stddev=std2))
        biases2 = tf.Variable(tf.zeros([num_nodes2]))
    else:
        weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels], stddev=std3))
        biases2 = tf.Variable(tf.zeros([num_labels]))
        
    print('weights2.shape = ' + str(weights2.shape))
    
    if use_layer_3:
        weights3 = tf.Variable(tf.truncated_normal([num_nodes2, num_labels]))
        biases3 = tf.Variable(tf.zeros([num_labels]))
    
    layer_1_logits = tf.matmul(tf_train_dataset, weights1) + biases1
    layer_1_relufied = tf.nn.relu(layer_1_logits)
    # logits = relufied_weights
    
    # introduce dropout here
    # dropped_output = tf.nn.dropout(relufied_weights, keep_prob=keep_probability)
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    # logits = tf.matmul(dropped_output, weights2) + biases2
    # logits = tf.matmul(tf.nn.relu(logits1), weights3) + biases3
    layer_2_logits = tf.matmul(layer_1_relufied, weights2) + biases2
    layer_2_relufied = tf.nn.relu(layer_2_logits)
    if use_layer_3:
        logits = tf.matmul(layer_2_relufied, weights3) + biases3
    else:
        logits = layer_2_logits
    
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    
    # regularization
    loss = loss + 0.5 * lambda_parameter * tf.nn.l2_loss(weights1) + 0.5 * lambda_parameter2 * tf.nn.l2_loss(weights2)
    if use_layer_3:
        loss = loss + 0.5 * lambda_parameter2 * tf.nn.l2_loss(weights3)
        
    # minimize the loss using an optimizer
    global_step = tf.Variable(0)  # count the number of steps taken...update
    
    # Optimizer: set up a variable that's incremented once per batch and
    # controls the learning rate decay.
    batch = tf.Variable(0, dtype=tf.float32)
    
    learning_rate = tf.train.exponential_decay(
      base_learning_rate,                # Base learning rate.
      batch * batch_size,  # Current index into the dataset.
      decay_steps,          # Decay steps.
      decay_rate,                # Decay rate.
      staircase=True)
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    # optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_pre = tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1), weights2) + biases2
    test_pre = tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1), weights2) + biases2
    if not use_layer_3:
        valid_prediction = tf.nn.softmax(valid_pre)
        test_prediction = tf.nn.softmax(test_pre)
    else:
        valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(valid_pre), weights3) + biases3)
        test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(test_pre), weights3) + biases3)

weights2.shape = (1024, 500)


In [7]:
with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        batch = batch + 1
        
        if(step % 500 == 0):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))

Initialized.


ResourceExhaustedError: OOM when allocating tensor with shape[784,1024]
	 [[Node: gradients_1/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable/read, gradients_1/mul_grad/tuple/control_dependency_1)]]
	 [[Node: Softmax/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2_Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'gradients_1/L2Loss_grad/mul', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-5893e6784616>", line 100, in <module>
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 386, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 560, in gradients
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 368, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 560, in <lambda>
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 556, in _L2LossGrad
    return op.inputs[0] * grad
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 821, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1044, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1434, in _mul
    result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'L2Loss', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
[elided 18 identical lines from previous traceback]
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-5893e6784616>", line 78, in <module>
    loss = loss + 0.5 * lambda_parameter * tf.nn.l2_loss(weights1) + 0.5 * lambda_parameter2 * tf.nn.l2_loss(weights2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1511, in l2_loss
    result = _op_def_lib.apply_op("L2Loss", t=t, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[784,1024]
	 [[Node: gradients_1/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable/read, gradients_1/mul_grad/tuple/control_dependency_1)]]
	 [[Node: Softmax/_21 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2_Softmax", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]


# Try the following later

In [38]:
# """
# added dropout on relu output
batch_size = 128
image_size = 28
num_nodes = 1024
num_nodes2 = 500
# num_nodes = 1280
num_labels = 10
learning_rate = 0.5
lambda_parameter = 0.003
keep_probability = 0.9
# """

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # initialize the variables like weights and biases
    weights1 = tf.Variable(tf.truncated_normal([image_size*image_size, num_nodes], stddev=(2/(image_size*image_size))**.5))
    biases1 = tf.Variable(tf.zeros([num_nodes]))
    
    weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes2], stddev=(2/(num_nodes))**.5))
    biases2 = tf.Variable(tf.zeros([num_nodes2]))
    
    weights3 = tf.Variable(tf.truncated_normal([num_nodes2, num_labels], stddev=(2/(num_nodes2))**.5))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    relufied_weights = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    
    # introduce dropout here
    # dropped_output = tf.nn.dropout(relufied_weights, keep_prob=keep_probability)
    
    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    
    # find logits from layer 1 & 2
    pre_logits = tf.matmul(relufied_weights, weights2) + biases2
    
    # insert a relu to the results
    relu_logits = tf.nn.relu(pre_logits)
    
    # update logits from previous result and layer 3
    logits = tf.matmul(relufied_weights, weights2) + biases2
    # logits = tf.matmul(relu_logits, weights3) + biases3
    
    # logits = tf.matmul(relufied_weights, weights2) + biases2
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    
    # regularization
    loss = loss + 0.5 * lambda_parameter * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # minimize the loss using an optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1), weights2) + biases2), weights3) + biases3) 
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1), weights2) + biases2), weights3) + biases3)

In [53]:
num_steps = (train_dataset.shape[0]/batch_size) * 2

# overfitting params
# dataset_size = 2570
# num_steps = (dataset_size/batch_size) * 10  + 15

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        if(step % 500 == 0):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))

Initialized.
loss at step 0 = 862.584412
train accuracy at step 0 = 7.812500
test accuracy at step 0 = 32.920000
loss at step 500 = 229.147766
train accuracy at step 500 = 77.343750
test accuracy at step 500 = 79.250000
loss at step 1000 = 105.120415
train accuracy at step 1000 = 78.125000
test accuracy at step 1000 = 79.940000
loss at step 1500 = 49.085506
train accuracy at step 1500 = 81.250000
test accuracy at step 1500 = 82.950000
loss at step 2000 = 22.985241
train accuracy at step 2000 = 88.281250
test accuracy at step 2000 = 86.150000
loss at step 2500 = 11.220746
train accuracy at step 2500 = 82.812500
test accuracy at step 2500 = 86.750000
loss at step 3000 = 5.443682
train accuracy at step 3000 = 92.187500
test accuracy at step 3000 = 86.760000
test accuracy = 93.260000


#### 02. learning rate decay

```
global_step = tf.Variable(0)  # count the number of steps taken.
learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
```

In [17]:
# """
# added dropout on relu output
batch_size = 128
image_size = 28
num_nodes = 1024
num_nodes2 = 500
num_nodes3 = 300
# num_nodes = 1280
num_labels = 10
base_learning_rate = 0.4
lambda_parameter = 5e-4#0.0015
lambda_parameter2 = 1e-5
decay_rate = 0.65
keep_probability = 0.9
keep_probability2 = 0.6
train_size = train_dataset.shape[0]
num_steps = 3001#(train_size/batch_size) * 4
# """

graph = tf.Graph()
with graph.as_default():
    # input the data as placeholders and constants
    tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size,num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # initialize the variables like weights and biases
    weights1 = tf.Variable(tf.truncated_normal([image_size*image_size, num_nodes],stddev=(2/(image_size*image_size))**.5))
    biases1 = tf.Variable(tf.zeros([num_nodes]))
    weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes2],stddev=(2/num_nodes)**.5))
    biases2 = tf.Variable(tf.zeros([num_nodes2]))
    # weights2 = tf.Variable(tf.truncated_normal([num_nodes, num_labels])) -- old weights
    # biases2 = tf.Variable(tf.zeros([num_labels])) -- old biases
    weights3 = tf.Variable(tf.truncated_normal([num_nodes2, num_nodes3], stddev=(2/num_nodes2)**.5))
    biases3 = tf.Variable(tf.zeros([num_nodes3]))
    weights4 = tf.Variable(tf.truncated_normal([num_nodes3, num_labels], stddev=(2/num_nodes3)**.5))
    biases4 = tf.Variable(tf.zeros([num_labels]))
    
    # ----- layer 1 start (relu & dropout)-----
    relufied_weights = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    result_layer_1 = tf.nn.dropout(relufied_weights, keep_prob=keep_probability)
    
    # ----- layer 2 start (relu & dropout)-----
    
    prelogits1 = tf.matmul(result_layer_1, weights2) + biases2
    relufied_weights2 = tf.nn.relu(prelogits1)
    result_layer_2 = tf.nn.dropout(relufied_weights2, keep_prob=keep_probability2)
    
    # ----- layer 3 start (relu) -----
    prelogits2 = tf.matmul(relufied_weights2, weights3) + biases3
    relufied_weights3 = tf.nn.relu(prelogits2)
    
    

    # calculate the logits, cross_entropy and loss
    # and then minimize the loss
    logits = tf.matmul(relufied_weights3, weights4) + biases4
    # logits = tf.matmul(relufied_weights, weights2) + biases2 -- old logits
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
    loss = tf.reduce_mean(cross_entropy)
    
    # regularization
    loss = loss + lambda_parameter2 * (tf.nn.l2_loss(weights1)) + lambda_parameter2 * (tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4))
    
    # minimize the loss using an optimizer
    global_step = tf.Variable(0)  # count the number of steps taken...update
    
    # Optimizer: set up a variable that's incremented once per batch and
    # controls the learning rate decay.
    batch = tf.Variable(0, dtype=tf.float32)
    
    learning_rate = tf.train.exponential_decay(
      base_learning_rate,                # Base learning rate.
      batch * batch_size,  # Current index into the dataset.
      train_size/100,          # Decay steps.
      decay_rate,                # Decay rate.
      staircase=True)
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    # optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    
    # calculate the predictions
    train_prediction = tf.nn.softmax(logits)
    layer_2_valid_prediction = tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1), weights2) + biases2
    layer_2_test_prediction = tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1), weights2) + biases2
    layer_3_valid_prediction = tf.matmul(tf.nn.relu(layer_2_valid_prediction),weights3) + biases3
    layer_3_test_prediction = tf.matmul(tf.nn.relu(layer_2_test_prediction), weights3) + biases3
    valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(layer_3_valid_prediction), weights4) + biases4)
    test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(layer_3_test_prediction), weights4) + biases4)

In [18]:


# overfitting params
# dataset_size = 2570
# num_steps = (dataset_size/batch_size) * 10  + 15

with tf.Session(graph=graph) as session:
    # initialize global variables
    tf.global_variables_initializer().run()
    print('Initialized.')
    # run the session with the optimizer, prediction and loss
    # print the loss and accuracy every 100 steps
    for step in range(int(num_steps)):
        # 01. calculate the offset for the current batch of the training data
        offset = (batch_size * step) % (train_dataset.shape[0] - batch_size)
        
        # 02. generate a mini batch
        batch_data = train_dataset[offset:offset+batch_size, :]
        batch_labels = train_labels[offset:offset+batch_size, :]
        
        # 03. create a feed dict for the placeholders
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        # 04. run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict = feed_dict )
        
        batch = batch + 1
        
        if(step % 500 == 0):
            print('loss at step %d = %f' % (step, l))
            print('train accuracy at step %d = %f' % (step, accuracy(predictions, batch_labels)))
            print('test accuracy at step %d = %f' % (step, accuracy(
                valid_prediction.eval(), valid_labels)))
    print('test accuracy = %f' % accuracy(test_prediction.eval(), test_labels))

Initialized.
loss at step 0 = 2.302585
train accuracy at step 0 = 7.031250
test accuracy at step 0 = 10.000000
loss at step 500 = 2.311472
train accuracy at step 500 = 9.375000
test accuracy at step 500 = 10.000000
loss at step 1000 = 2.302636
train accuracy at step 1000 = 10.937500
test accuracy at step 1000 = 10.000000
loss at step 1500 = 2.305499
train accuracy at step 1500 = 11.718750
test accuracy at step 1500 = 10.000000
loss at step 2000 = 2.306109
train accuracy at step 2000 = 6.250000
test accuracy at step 2000 = 10.000000
loss at step 2500 = 2.304066
train accuracy at step 2500 = 10.156250
test accuracy at step 2500 = 10.000000
loss at step 3000 = 2.300430
train accuracy at step 3000 = 10.156250
test accuracy at step 3000 = 10.000000
test accuracy = 10.000000


### Results
```
batch_size = 128
image_size = 28
num_nodes = 2048
num_nodes = 1280
num_labels = 10
base_learning_rate = 0.5
lambda_parameter = 0.003
keep_probability = 0.9
train_size = train_dataset.shape[0]
num_steps = (train_size/batch_size) * 2
```

test accuracy | train accuracy | change
---|---|---
93.32 | 86.71 | no change
93.03 | 85.93 | base_learning_rate = 0.45, lambda_parameter = 0.005
93.39 | 86.75 | learning decay rate = .90
93.34 | 86.71 | weights2 initialized with sqrt(2/num_nodes)
93.78 | 88.28 | added a second hidden layer with 500 nodes
94.26 | 88.66 | base_learning_rate = 0.4, lambda_parameter2 = 1e-5
94.48 | 86.71 | number of steps = 4000
94.77 | 89.06 | base_learning_rate = 0.4, lambda_parameter = 5e-4, decay_rate = 0.65, num_steps = 3001

* [Table of Contents](#top)