#  Digit Recognizer

# Model Engineering

## 1 - Introduction

> The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.  As the competition progresses, we will release tutorials which explain different machine learning algorithms and help you to get started.

> The data for this competition were taken from the MNIST dataset. The MNIST ("Modified National Institute of Standards and Technology") dataset is a classic within the Machine Learning community that has been extensively studied.  More detail about the dataset, including Machine Learning algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html.


## 2 - Setup

In [1]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import tensorflow as tf
%matplotlib inline

In [2]:
project_dir = os.path.join(os.path.dirname('__file__'), os.pardir)
settings = json.loads(open(os.path.join(project_dir, 'SETTINGS.json')).read())
train_path = os.path.join(project_dir, settings['TRAIN_DATA_PATH'])
test_path = os.path.join(project_dir, settings['TEST_DATA_PATH'])

train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

train_imgs = train_data.drop('label', axis=1)
one_hot_target = pd.get_dummies(train_data['label'], prefix='dig')
train_data = pd.concat([train_imgs, one_hot_target], axis=1, join='inner')

train_val_ratio = 0.8
train_data_size = len(train_data)
train_set = train_data[:int(train_data_size*train_val_ratio)]
val_set = train_data[int(train_data_size*train_val_ratio)+1:]

print(train_set.shape)
print(val_set.shape)
print(test_data.shape)

(33600, 794)
(8399, 794)
(28000, 784)


As discribed on Kaggle's website:

> Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

> The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

> Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).


The image bellow shows a sample of data from train dataset.

In [None]:
f, axarr = plt.subplots(10, 10)
for row in range(10):
    for column in range(10):
        entry = train_data[train_data['label']==column].iloc[row].drop('label').as_matrix()
        axarr[row, column].imshow(entry.reshape([28, 28]))
        axarr[row, column].get_xaxis().set_visible(False)
        axarr[row, column].get_yaxis().set_visible(False)

## 3 - Logistic Regression

The first model, that will serve as a benchmark, will be [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression). We use [tensorflow](https://www.tensorflow.org) in order to get a well suited model.


### 3.1 - Model

First we will define the placeholder for input x. Then we define the weights and biases of our model as Variables. Finally we will define our Logist Regression model.

In [None]:
# input tensor
x = tf.placeholder(tf.float32, [None, 784])

# weights (w) and biases (s) 
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

# model output (y)
y = tf.nn.softmax(tf.matmul(x, W) + b)

### 3.2 - Training

First we will define the placeholder for the targets of our train data. Then the cross entropy can be calculated as follows:

In [None]:
# target
y_ = tf.placeholder(tf.float32, [None, 10])
# cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

We will describe bellow the train step and the accuracy mesure of the model.

In [None]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In order to train teh model we first need to initialize all variables. Second we will iterate over many epochs and evaluate the model using a train and test set.

In [None]:
init = tf.initialize_all_variables()
saver = tf.train.Saver()
sess = tf.InteractiveSession()
sess.run(init)

train_eval_list = []
val_eval_list = []
for i in range(1000):
    batch = train_set.sample(frac=0.1)
    batch_xs = batch.drop('label', axis=1).as_matrix()/255.0
    batch_ys = pd.get_dummies(batch['label']).as_matrix()
    val_xs = val_set.drop('label', axis=1).as_matrix()/255.0
    val_ys = pd.get_dummies(val_set['label']).as_matrix()

    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
 
    train_eval = sess.run(accuracy, feed_dict={x: batch_xs, y_: batch_ys})
    val_eval = sess.run(accuracy, feed_dict={x: val_xs, y_: val_ys})
    
    train_eval_list.append(train_eval)
    val_eval_list.append(val_eval)
sess.close()

### 3.3 - Evaluation

As we can see the training did not overfit. So it is ready to be aplied to the test data.

In [None]:
plt.plot(train_eval_list, label='Train set')
plt.plot(val_eval_list, label='Validation set')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc=4)

The following cell runs the model on the test data and generates a CSV file for Kaggle's submission. This submission scores 0.91714. That is not a particularly good result for MNIST, where the state of art can score over 0.99. We will work on other models to try to get closer to this score.

In [None]:
predict = sess.run(y, feed_dict={x: test_data.as_matrix() / 255.0})
pred = [[i + 1, np.argmax(one_hot_list)] for i, one_hot_list in enumerate(predict)]
submission = pd.DataFrame(pred, columns=['ImageId', 'Label'])
submission_path = os.path.join(project_dir, settings['SUBMISSION_PATH'], 'logistic_regression.csv')
submission.to_csv(submission_path, index=False)

To keep this trained model for further use, we can just save the session.

In [None]:
model_path = os.path.join(project_dir, settings['MODEL_PATH'], "logistic_regression.ckpt")
saver.save(sess, model_path)
sess.close()

## 4 - Deep Convolutional Neural Network

To get closer to stat of art accuracy, let's try to apply a Convolutional Neural Network.

### 4.1 - Model

Let's define functions to create weight and bias variables in a way to avoid 0 gradients and "dead neurons".

In [3]:
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

And define our convolution and pooling operations, as follows.

In [4]:
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

Now we can define our first convolutional layer.

In [5]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x = tf.placeholder(tf.float32, shape=[None,784])
x_image = tf.reshape(x, [-1,28,28,1])

We get the layer output by convolving the image with the weight, adding bias and max pooling.

In [6]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

We do pretty much the same for the second layer.

In [7]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

Then we define a fully connected layer as follows.

In [8]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

Finally we define dropout to avoid overfiting and a softmax layer.

In [9]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

### 4.2 - Training

In [18]:
sess = tf.InteractiveSession()
saver = tf.train.Saver()
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess.run(tf.initialize_all_variables())

for i in range(1000):
    batch = train_set.sample(50)
    batch_xs = batch.filter(like='pixel',axis=1).as_matrix()/255.0
    batch_ys = batch.filter(like='dig',axis=1).as_matrix()
    if i%100 == 0:
        val_batch = train_set.sample(500)
        val_xs = val_batch.filter(like='pixel',axis=1).as_matrix()/255.0
        val_ys = val_batch.filter(like='dig',axis=1).as_matrix()
        train_eval = accuracy.eval(feed_dict={x: batch_xs, y_: batch_ys, keep_prob: 0.5})
        val_eval = accuracy.eval(feed_dict={x: val_xs, y_: val_ys, keep_prob: 0.5})
        print('---')
        print("step %d, training accuracy %g"%(i, train_eval))
        print("step %d, validation accuracy %g"%(i, val_eval))
    train_step.run(feed_dict={x: batch_xs, y_: batch_ys, keep_prob: 0.5})

Exception AssertionError: AssertionError("Nesting violated for default stack of <type 'weakref'> objects",) in <bound method InteractiveSession.__del__ of <tensorflow.python.client.session.InteractiveSession object at 0x7f8a80f2f8d0>> ignored


---
step 0, training accuracy 0.04
step 0, validation accuracy 0.1
---
step 100, training accuracy 0.56
step 100, validation accuracy 0.67
---
step 200, training accuracy 0.78
step 200, validation accuracy 0.79
---
step 300, training accuracy 0.86
step 300, validation accuracy 0.87
---
step 400, training accuracy 0.76
step 400, validation accuracy 0.87
---
step 500, training accuracy 0.88
step 500, validation accuracy 0.92
---
step 600, training accuracy 0.92
step 600, validation accuracy 0.94
---
step 700, training accuracy 0.94
step 700, validation accuracy 0.89
---
step 800, training accuracy 0.9
step 800, validation accuracy 0.92
---
step 900, training accuracy 0.94
step 900, validation accuracy 0.91


In [17]:
model_path = os.path.join(project_dir, settings['MODEL_PATH'], "conv_nn.ckpt")
saver.save(sess, model_path)
sess.close()

NameError: name 'saver' is not defined