# Kaggle Digits Recognizer

This notebook shows the method of developing a digits recognizer for the Digits Recognizer Kaggle Competition

First, we import all the classes and functions we need

In [None]:
import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Convolution2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K
import pandas as pd
import random
K.set_image_dim_ordering('th')

Next we set the random number seed, to ensure reprducibility of our results

In [None]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

Next we load the required data from file.

The train data is located in "train.csv", while the test data is located in "test.csv". The train data includes labels, while the test data does not include labels. The prediction labels obtained form the test data is uploaded on kaggle to be scored.

In [None]:
# load data
dataset = pd.read_csv("train.csv")
y = dataset[[0]].values.ravel()
X = dataset.iloc[:,1:].values
X_test = pd.read_csv("test.csv").values

print("The shape of the three data sets is: ")
print(X.shape)
print(y.shape)
print(X_test.shape)

print("The type of the three data sets is: ")
print(type(X))
print(type(y))
print(type(X_test))

Split X and y data into train and validate data sets

In [None]:
# split data into train = 0.8*number of dataset, validate = 0.2*number of dataset
nRows = X.shape[0]
nColumns = X.shape[1]
validRatio = 0.2
nValid = int(validRatio*nRows)
nTrain = nRows - nValid

validIndex = random.sample(range(nRows), nValid)
trainIndex = numpy.array(list(set(numpy.arange(0,nRows)) - set(validIndex)))
X_valid = X[validIndex,]
X_train = X[trainIndex,]
y_valid = y[validIndex,]
y_train = y[trainIndex,]

Reshape the features

In [None]:
# reshape to be [samples][pixels][width][height]
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28).astype('float32')
X_valid = X_valid.reshape(X_valid.shape[0], 1, 28, 28).astype('float32')
print("The shape of X_train is: " + str(X_train.shape))
print("The shape of X_test is: " + str(X_test.shape))
print(y_test.shape)

We now scale the features from 0 - 255 to 0 - 1, and we one hot encode the outputs

In [None]:
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
y_test.shape

This time we define a large CNN architecture with additional convolutional, max pooling layers and fully connected layers. The network topology can be summarized as follows.

1. Convolutional layer with 30 feature maps of size 5×5.
2. Pooling layer taking the max over 2*2 patches.
3. Convolutional layer with 15 feature maps of size 3×3.
4. Pooling layer taking the max over 2*2 patches.
5. Dropout layer with a probability of 20%.
6. Flatten layer.
7. Fully connected layer with 128 neurons and rectifier activation.
8. Fully connected layer with 50 neurons and rectifier activation.
9. Output layer.


In [None]:
def larger_model():
    # create model
    model = Sequential()
    model.add(Convolution2D(30, 5, 5, border_mode='valid', input_shape=(1, 28, 28), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Convolution2D(15, 3, 3, activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

The model is fit over 10 epochs with a batch size of 200.

In [None]:
# build the model
model = larger_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_valid, y_valid), nb_epoch=50, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))

In [None]:
y_test = model.predict(X_test)