<h1>Digit Recognizer Kaggle Competition Part 2</h1>
<h2>Neural Nets with TensorFlow & Keras</h2>
<h3>Bryan Bruno</h3>

<h3>Building Environment</h3>

In [1]:
import tensorflow as tf

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [3]:
import keras
from keras.models import Model
from keras.layers import *
from keras import optimizers

Using TensorFlow backend.


In [4]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [5]:
train_ft = train.iloc[:, 1:785]
train_lbl = train.iloc[:, 0]
X_test = test.iloc[:, 0:784]

In [6]:
# splitting data for test and train

X_train, X_val, y_train, y_val = train_test_split(train_ft, train_lbl, random_state = 12)

In [7]:
# continuing with the same split above while placing into matrix

X_train = X_train.as_matrix().reshape(31500, 784) #.75
X_val = X_val.as_matrix().reshape(10500, 784) #.25

X_test = X_test.as_matrix().reshape(28000, 784)

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  


In [8]:
# normalizing data

X_train = X_train.astype("float32")
X_val = X_val.astype("float32")
X_test = X_test.astype("float32")

X_train /= 255
X_val /= 255
X_test /= 255

In [9]:
# converting normalized data into categories for matrix allocation

y_train = keras.utils.to_categorical(y_train, 10)
y_val = keras.utils.to_categorical(y_val, 10)

In [10]:
# printing values from index, will be either 0 or 1 to indicate the numeric value in matrix

print(y_train[0], y_train[1], y_train[2])
print(y_train[3], y_train[4], y_train[5])
print(y_train[6], y_train[7], y_train[8])
print(y_train[9], y_train[10], y_train[11])

[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


<h3>Neural Network Models</h3>

In [11]:
# setting up standard params

n_inputs = 28*28
n_hidden1 = 400
n_hidden2 = 300
n_hidden3 = 200
n_hidden4 = 100
n_hidden5 = 50
n_outputs = 10

In [12]:
# because of computational time and the number of neural networks, only using a single epoch 
# this is for very basic benchmarking and will produce less accurate models than additional runs

n_epochs = 1
n_batch = 50
sgd = optimizers.SGD(lr = 0.1)

In [13]:
# four hidden layers using softmax

Inp = Input(shape=(784,))
l = Dense(n_hidden1, activation="relu", name = "hidden1")(Inp)
l = Dense(n_hidden2, activation="relu", name = "hidden2")(l)
l = Dense(n_hidden3, activation="relu", name = "hidden3")(l)
l = Dense(n_hidden4, activation="relu", name = "hidden4")(l)
output = Dense(n_outputs, activation = "softmax", name = "outputs")(l)

In [14]:
# first neural network is built using stochastic gradient descent

nn1 = Model(Inp, output)
nn1.compile(loss = "categorical_crossentropy", optimizer = "sgd", metrics = ["accuracy"])
nn1.summary() 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 784)               0         
_________________________________________________________________
hidden1 (Dense)              (None, 400)               314000    
_________________________________________________________________
hidden2 (Dense)              (None, 300)               120300    
_________________________________________________________________
hidden3 (Dense)              (None, 200)               60200     
_________________________________________________________________
hidden4 (Dense)              (None, 100)               20100     
_________________________________________________________________
outputs (Dense)              (None, 10)                1010      
Total params: 515,610
Trainable params: 515,610
Non-trainable params: 0
_________________________________________________________________


In [15]:
# second neural network built with Adam

l = Dense(n_hidden1, activation = "relu", name = "hidden1")(Inp)
l = Dense(n_hidden2, activation = "relu", name = "hidden2")(l)
l = Dense(n_hidden3, activation = "relu", name = "hidden3")(l)
l = Dense(n_hidden4, activation = "relu", name = "hidden4")(l)
output = Dense(n_outputs, activation = "softmax", name = "outputs")(l)

adam = keras.optimizers.Adam(lr = 0.01)
nn2 = Model(Inp, output)

nn2.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics=["accuracy"])

In [16]:
# third neural network built with Adam and two hidden layers

l = Dense(n_hidden1, activation = "relu", name = "hidden1")(Inp)
l = Dense(n_hidden4, activation = "relu", name = "hidden4")(l)
output = Dense(n_outputs, activation = "softmax", name = "outputs")(l)

adam = keras.optimizers.Adam(lr = 0.1)
nn3 = Model(Inp, output)

nn3.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

In [17]:
# fourth neural network built with Adam and five hidden layers

l = Dense(n_hidden1, activation = "relu", name = "hidden1")(Inp)
l = Dense(n_hidden2, activation = "relu", name = "hidden2")(l)
l = Dense(n_hidden3, activation = "relu", name = "hidden3")(l)
l = Dense(n_hidden4, activation = "relu", name = "hidden4")(l)
l = Dense(n_hidden5, activation = "relu", name = "hidden5")(l)
output = Dense(n_outputs, activation = "softmax", name = "outputs")(l)

adam = keras.optimizers.Adam(lr = 0.01)
nn4 = Model(Inp, output)

nn4.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics=["accuracy"])

In [18]:
print("Neural Nets Benchmark Experiment\n--------------------------------------------------------------------------")
print("Neural Net 1: Stochastic Gradient Descent")
print("0.1 Learing Rate | 4 Layers | Batches of 50")
nn1_fit = nn1.fit(X_train, y_train, batch_size = n_batch, verbose = 2,
                   epochs = n_epochs, validation_data=(X_val, y_val))

print("\nNeural Net 2: Adam")
print("0.01 Learing Rate | 4 Layers | Batches of 50")
nn2_fit = nn2.fit(X_train, y_train, batch_size = n_batch, verbose = 2,
                   epochs = n_epochs, validation_data=(X_val, y_val))

print("\nNeural Net 3: Adam")
print("0.1 Learing Rate | 2 Layers | Batches of 100")
nn3_fit = nn3.fit(X_train, y_train, batch_size = 100, verbose = 2,
                   epochs = n_epochs, validation_data=(X_val, y_val))

print("\nNeural Net 4: Adam")
print("0.01 Learing Rate | 5 Layers | Batches of 50")
nn4_fit = nn4.fit(X_train, y_train, batch_size = n_batch, verbose = 2,
                   epochs = n_epochs, validation_data=(X_val, y_val))

Neural Nets Benchmark Experiment
--------------------------------------------------------------------------
Neural Net 1: Stochastic Gradient Descent
0.1 Learing Rate | 4 Layers | Batches of 50
Train on 31500 samples, validate on 10500 samples
Epoch 1/1
 - 5s - loss: 1.0368 - acc: 0.7281 - val_loss: 0.4239 - val_acc: 0.8796

Neural Net 2: Adam
0.01 Learing Rate | 4 Layers | Batches of 50
Train on 31500 samples, validate on 10500 samples
Epoch 1/1
 - 7s - loss: 0.2755 - acc: 0.9172 - val_loss: 0.1508 - val_acc: 0.9534

Neural Net 3: Adam
0.1 Learing Rate | 2 Layers | Batches of 100
Train on 31500 samples, validate on 10500 samples
Epoch 1/1
 - 4s - loss: 0.3143 - acc: 0.9101 - val_loss: 0.1648 - val_acc: 0.9508

Neural Net 4: Adam
0.01 Learing Rate | 5 Layers | Batches of 50
Train on 31500 samples, validate on 10500 samples
Epoch 1/1
 - 8s - loss: 0.2917 - acc: 0.9123 - val_loss: 0.1527 - val_acc: 0.9544


<h3>Benchmark Results Conclusion</h3>

As we can see, there are some extremely disappointing results of all four neural networks. There are numerous items to discuss regarding these results. Neural networks are highly intricate in design, which allow for an immense amount of customization. This is often reflected through parameter tuning. 

The first item I’d like to bring up are the number of epochs allotted for testing purposes. A single epoch simple does not allow for enough validation and testing, resulting in highly underfit results. Conversely, increasing the number of epochs will eventually cause overfitting. The purpose is to train our NN, not for our model to memorize the data. As is, these models are grossly underfitting the data.

The accuracies and error rates of each NN were very poor. The Stochastic Gradient Descent being far too abysmal to even consider tuning parameters for. While the Adam optimizer appeared to have more consistent results, the single epoch makes it very difficult to get an understanding of how well they would perform. 

There were simple adaptations to the three of these models. I found little significance to changing the learning rate from below the 0.1 value. The number of layers appeared to have some significance; however, more tests would need to be performed to verify this. From the very limited differences between these models, I found the most promise in the second NN iteration. The loss is relatively low while containing higher validation accuracy. This suggests that with more testing and especially the inclusion of additional epochs, it may perform significantly better.

I’m intrigued enough to take the second NN and increase the epoch to five. 


In [19]:
# demonstration of the second Adam model with five epochs

l = Dense(n_hidden1, activation = "relu", name = "hidden1")(Inp)
l = Dense(n_hidden2, activation = "relu", name = "hidden2")(l)
l = Dense(n_hidden3, activation = "relu", name = "hidden3")(l)
l = Dense(n_hidden4, activation = "relu", name = "hidden4")(l)
output = Dense(n_outputs, activation = "softmax", name = "outputs")(l)

adam = keras.optimizers.Adam(lr = 0.01)
nn2 = Model(Inp, output)

nn2.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics=["accuracy"])

nn2_fit = nn2.fit(X_train, y_train, batch_size = n_batch, verbose = 1,
                   epochs = 5, validation_data=(X_val, y_val))

Train on 31500 samples, validate on 10500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


As expected, simply running five epochs significantly improved each performance category. It’s also worth noting the variance between this first iteration and the single iteration performed above. This is an incredibly important consideration to be aware of due to the validity of underfit models.

While these results are much better and may be used in as a model, it’s important to remember that it was much simpler to achieve an accuracy score of around 0.96 using Random Forests. Not only was it simpler to implement and requires less computational time, but it performed better as well. Based on this, I cannot recommend any of these neural network models as replacement for Random Forests. However, I would choose the second NN optimized with Adam of these four.

I’m not finished yet. Neural networks are immensely powerful, so I’m committed to build a new one from scratch and (hopefully) blow away Random Forests…

<h3>TensorFlow Neural Network</h3>
<h4>Built for Competition!</h4>

In [20]:
# starting from scratch...

n_inputs = 28*28
n_hidden1 = 400
n_hidden2 = 300
n_hidden3 = 200
n_hidden4 = 100
n_outputs = 10

In [21]:
X = tf.placeholder(tf.float32, shape = (None, n_inputs), name = "X")
y = tf.placeholder(tf.int64, shape = (None), name = "y")

In [22]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name = "weights")
        b = tf.Variable(tf.zeros([n_neurons]), name = "biases")
        z = tf.matmul(X, W) + b
        if activation == "relu":
            return tf.nn.relu(z)
        else:
            return z 

In [23]:
with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
    hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
    hidden3 = neuron_layer(hidden2, n_hidden3, "hidden3", activation="relu")
    hidden4 = neuron_layer(hidden3, n_hidden4, "hidden4", activation="relu")
    logits = neuron_layer(hidden4, n_outputs, "outputs") 

In [24]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = y, logits = logits)
    loss = tf.reduce_mean(xentropy, name = "loss")

In [25]:
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate) 
    training_op = optimizer.minimize(loss) 

In [26]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) 

In [27]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [28]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/") 

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [30]:
n_epochs = 60 # some real epochs for learning!
batch_size = 50

In [31]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Train accuracy: 0.98 Test accuracy: 0.9344
1 Train accuracy: 0.96 Test accuracy: 0.9478
2 Train accuracy: 1.0 Test accuracy: 0.955
3 Train accuracy: 1.0 Test accuracy: 0.9614
4 Train accuracy: 0.98 Test accuracy: 0.9649
5 Train accuracy: 0.96 Test accuracy: 0.9672
6 Train accuracy: 0.98 Test accuracy: 0.9693
7 Train accuracy: 1.0 Test accuracy: 0.9699
8 Train accuracy: 1.0 Test accuracy: 0.9715
9 Train accuracy: 1.0 Test accuracy: 0.9735
10 Train accuracy: 1.0 Test accuracy: 0.9728
11 Train accuracy: 1.0 Test accuracy: 0.9737
12 Train accuracy: 1.0 Test accuracy: 0.9744
13 Train accuracy: 1.0 Test accuracy: 0.972
14 Train accuracy: 1.0 Test accuracy: 0.9754
15 Train accuracy: 1.0 Test accuracy: 0.9772
16 Train accuracy: 1.0 Test accuracy: 0.9735
17 Train accuracy: 1.0 Test accuracy: 0.9761
18 Train accuracy: 1.0 Test accuracy: 0.9758
19 Train accuracy: 1.0 Test accuracy: 0.9762
20 Train accuracy: 1.0 Test accuracy: 0.9757
21 Train accuracy: 1.0 Test accuracy: 0.9767
22 Train accuracy

Look at thses scores! Much, much better!

In [32]:
test = pd.read_csv("test.csv")

In [33]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")
    X_new_scaled = test[:]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1) 

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt


In [34]:
print("Predicted classes:", y_pred)

Predicted classes: [2 0 9 ... 3 9 2]


In [35]:
np.savetxt("testout.csv", y_pred, delimiter = ",")

In [36]:
pd.DataFrame({"ImageId": list(range(1, len(y_pred) + 1)),
              "Label": y_pred}).to_csv("testout.csv", index = False, header = True)

Update from original 1600 score w/ 60 epochs

Submitted to Kaggle.com for a score of 0.99485.

Rank: 525

User ID: 2698396

<h3>The "Real" Conclusion</h3>

I think it goes without saying that I would highly recommend a Gradient Descent optimized Neural Network to identify hand written numbers. Gradient Descent is superior to Random Forests in the context of this data. The constant differentiable progression heavily minimizes loss. Additionally, there are other topics that need to be addressed. 

This recommendation stands when accuracy is the objective over computation time and resources. I can’t believe I forgot to time it in the code, but the 24 epochs took just under an hour to run. This is not a quick solution, but it is extremely accurate.

On the same note, while the session was running, I was concerned for overfitting. At just about the halfway mark of the epochs, this NN hit a 1.0 accuracy score on the training data. I was very worried that this model had memorized the training data… However, I also noticed that the training scores started to slightly decrease while the test data continued to achieve higher accuracy scores. This allowed me to remain optimistic and run the model against the submission test data.

I would love to run some more tests revolving around the number of epochs. The final results suggest that this model is not overfit and I would like to verify. As of now, I’m extremely satisfied with this score, but I may be coming back to this in the very near future!