## MNIST data with a fully connected neural network 

MNIST dataset was downloaded from kaggle.com in .csv format.

train.csv: contains 42000 train example. Each row in the csv file contains the digit label in the first column, and 784 pixels values of the image in the following columns.

When classifying this data with a random forest classifier, we achieved a accuract of ca 96.4% on the test set. Let's see if one can do better with a fully connected neural network!


In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import metrics
from sklearn.model_selection import train_test_split

Load data and convert to numpy arrays 'labels' and 'train':

In [3]:
dataset = pd.read_csv(r"~/Python/Kaggle Digits/train.csv")
labels = dataset['label'].values
train = dataset.iloc[:, 1:].values

In [4]:
print 'Train set: ' + str(train.shape)
print 'Labels: ' + str(labels.shape)

Train set: (42000, 784)
Labels: (42000,)


Normalize Data to zero mean and unit variance:

In [5]:
def normalizeData(X): 
    #input: X: 2D-numpy array with shape (#datapoints, #features)
    #output: normalized X
    return (X -128.0)/255.0
X = normalizeData(train)

For a neural network, we need to reformat our labels (0,1,2,3, ...9) into one-hot encoded labels, i.e.:
label 1 is represented as [1.0, 0.0, 0.0 ...], 
label 2 is represented as [0.0, 1.0, 0.0 ...], etc.

In [6]:
def reformatLabels(labels):
    #input: 1D numpy array labels with shape (#samples)
    #output: 2D numpy array of encoded labels with shape (#samples, #labels)
    num_labels = len(np.unique(labels))
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return labels
labels = reformatLabels(labels)
print 'Labels: ' + str(labels.shape)

Labels: (42000, 10)


Also, Tensorflow cannot deal with numpy float 64 array, thus we need to convert:

In [7]:
X = X.astype(np.float32)

Divide the data into random test and train set:

In [8]:
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.1, random_state=0)

Before we can build the network architecture in the tensorflow graph, we need to define some variables:

In [9]:
image_size  = 28     #image size of MNIST data: 28x28 pixels
num_labels = 10      #number of classes in the data set
batch_size = 128     #batch used in each training pass
learningrate = 0.02  #learning rate 
alpha =  0.0002      #regularization strength
train_epochs = 200    #for how many epochs we want to train the network
totalsteps = int((train_epochs * X_train.shape[0])/batch_size) #calculate #train passes

In [10]:
graph = tf.Graph()
with graph.as_default():
    def weight_variable(shape):
        #returns weight variables of shape shape, normally distributed
        return tf.Variable(tf.truncated_normal(shape=shape, stddev=0.1))

    def bias_variable(shape):
        #returns bias variables of shape shape, of constant value 0.1
        return tf.Variable(tf.constant(0.1, shape = shape))

    # Create placeholders for batches of input data
    tf_X_train = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
    tf_y_train = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    # Create constant for test data
    tf_X_test = tf.constant(X_test)
    
    # define neural network: input layer of image_size * image_size, then
    # two layers of 1024 nodes, and then the output layer of size num_labels
    layersetup = [image_size * image_size, 1024, 1024, num_labels]
 
    # initialize weights and biases
    weights = []
    biases = []
    for i in range(len(layersetup)-1):
        weights.append(weight_variable([layersetup[i], layersetup[i+1]]))
        biases.append(bias_variable([layersetup[i+1]]))
   
    #Forward propagation through network: 
    def forwardProp(X, dropout = False):
        inputlayer = X
        for i in range(len(weights)-1):
            h0 = tf.matmul(inputlayer, weights[i]) + biases[i]  #multiply
            h1 = tf.nn.relu(h0)  #RELU nonlinearity
            if dropout == True:
                h1 = tf.nn.dropout(h1, 0.7)
            inputlayer = h1
        outputlayer= tf.matmul(inputlayer, weights[-1]) + biases[-1]
        if dropout == True:
            outputlayer = tf.nn.dropout(outputlayer, 0.7)
        return outputlayer
    
    # Calculate forward propagation for training: 
    # (dropout usually only needed for large networks, so it probably won't improve the performance in our case)  
    logits = forwardProp(tf_X_train, dropout = True) 
    
    # calculate regularization loss (weight loss):
    l2 = tf.nn.l2_loss(weights[0])
    for i in range(1, len(weights)):
        l2 += tf.nn.l2_loss(weights[i])
    # total loss functions: data loss + weight loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_y_train) + alpha*l2)
    
    # Set up learning rate decay
    global_step = tf.Variable(0)  # count the number of steps taken
    learning_rate = tf.train.exponential_decay(learningrate, global_step, 1000, 0.9)

    # Optimizer to minimize the loss function which we defined above
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    
    # Predictions for the training and test data.
    # Note: dropout only used for training, not for prediction!
    train_prediction = tf.nn.softmax(forwardProp(tf_X_train, dropout = False))
    test_prediction = tf.nn.softmax(forwardProp(tf_X_test, dropout = False))

Also, we need to define a function that yields the input/indices for each batch of training data:

In [11]:
def yield_next_batch_generator(sizeX, batch_size):
    #Generator function which returns indices of samples (batches) used for each training pass
    
    currentindex = 0
    indeces = np.arange(0,sizeX) 
    np.random.shuffle(indeces)
    while True:  
        start = currentindex
        currentindex += batch_size
        # when all trainig data have been already used, it is shuffled and currentindex is reset
        if currentindex > sizeX:
            np.random.shuffle(indeces)
            start = 0
            currentindex = batch_size
        end = currentindex
        yield indeces[start:end] 

We need to define an metric for the classifier. For this problem, we can use classification accuracy:

In [12]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

Now we can initialize and train the network:

In [13]:
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print 'Initialized'
    print 'Training for a total number of steps of: ' + str(totalsteps) 
    #Initialize generator:
    g = yield_next_batch_generator(X_train.shape[0], batch_size)
    for step in xrange(totalsteps):
        indeces = g.next() #indeces from generator
        feed_dict = {tf_X_train : X_train[indeces, :], tf_y_train : y_train[indeces, :]}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)

        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, y_train[indeces, :]))
            print ' '
        if (step % 2000 == 0):
            print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), y_test))
            print ' ' 
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), y_test))
    print 'Done training!'

Initialized
Training for a total number of steps of: 59062
Minibatch loss at step 0: 18.380714
Minibatch accuracy: 4.7%
 
Test accuracy: 10.0%
 
Minibatch loss at step 500: 2.247907
Minibatch accuracy: 93.8%
 
Minibatch loss at step 1000: 2.114293
Minibatch accuracy: 94.5%
 
Minibatch loss at step 1500: 2.149573
Minibatch accuracy: 97.7%
 
Minibatch loss at step 2000: 2.263469
Minibatch accuracy: 93.8%
 
Test accuracy: 94.2%
 
Minibatch loss at step 2500: 2.040787
Minibatch accuracy: 96.1%
 
Minibatch loss at step 3000: 2.015380
Minibatch accuracy: 95.3%
 
Minibatch loss at step 3500: 2.123134
Minibatch accuracy: 96.1%
 
Minibatch loss at step 4000: 1.974600
Minibatch accuracy: 95.3%
 
Test accuracy: 95.3%
 
Minibatch loss at step 4500: 2.001168
Minibatch accuracy: 96.1%
 
Minibatch loss at step 5000: 1.856049
Minibatch accuracy: 98.4%
 
Minibatch loss at step 5500: 1.983018
Minibatch accuracy: 99.2%
 
Minibatch loss at step 6000: 1.896112
Minibatch accuracy: 98.4%
 
Test accuracy: 95.

So for the fully-connected neural network, we get an accuracy of ca 96.6% on the test set. This is similar to the performance of the random forest classifier (ca. 96.4%), and given that the neural network takes a significantly longer time to train, it does not seem to advantageous to the random forest. 

I also tried to tune the network parameters, e.g., learning rate, learning rate decay rate, #nodes, #layers, batchsize, but I did not get a better performance than ca 96.6%. However, I did not spend much time on it, so one could possibly obtain a better performance than, e.g., by trying to use a different optimizer, train for a longer time, or find a different learning rate decay schedule.