# MNIST Dataset: Digit Recognizer

## Introduction

MNIST ("Modified National Institute of Standards and Technology") is the de facto “Hello World” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this Analysis I identify digits from a dataset of tens of thousands of handwritten images.

## Approach

In this analysis I used Keras (with TensorFlow as our backend) as the main package to create a simple neural network to predict, as accurately as I can, digits from handwritten images. In particular, we will be calling the Functional Model API of Keras, and creating a 4-layered and 5-layered neural network.

Also, I will be experimenting with various optimizers: the plain vanilla Stochastic Gradient Descent optimizer and the Adam optimizer. However, there are many other parameters, such as training epochs which will I will not be experimenting with.

In addition, the choice of hidden layer units are completely arbitrary and may not be optimal. This is yet another parameter which I will not attempt to tinker with. Lastly, I introduce dropout, a form of regularisation, in our neural networks to prevent overfitting.

## Result

Following our simulations on the cross validation dataset, it appears that a 4-layered neural network, using 'Adam' as the optimizer along with a learning rate of 0.01, performs best. We proceed to introduce dropout in the model, and use the model to predict for the test set.

The test predictions (submitted to Kaggle) generated by our model predicts with an accuracy score of 97.600%.

In [1]:
# imprting key libraries and reading data

In [2]:
import pandas as pd
import numpy as np

np.random.seed(1212)

import keras
from keras.models import Model
from keras.layers import *
from keras import optimizers

In [3]:
# load the MNIST dataset
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [4]:
df_train.head() # 784 features, 1 label

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Spliting data into training_validation dataset

In [5]:
df_features = df_train.iloc[:, 1:785]
df_label = df_train.iloc[:, 0]

X_test = df_test.iloc[:, 0:784]

print(X_test.shape)

(28000, 784)


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(df_features, df_label, 
                                                test_size = 0.2,
                                                random_state = 1212)

X_train = X_train.to_numpy().reshape(33600, 784) #(33600, 784)
X_cv = X_cv.to_numpy().reshape(8400, 784) #(8400, 784)

X_test = X_test.to_numpy().reshape(28000, 784)

In [9]:
print((min(X_train[1]), max(X_train[1])))

(0, 255)


Proceed to convert our labels from a class vector to binary One Hot Encoded

In [10]:
# Feature Normalization 
X_train = X_train.astype('float32'); X_cv= X_cv.astype('float32'); X_test = X_test.astype('float32')
X_train /= 255; X_cv /= 255; X_test /= 255

# Convert labels to One Hot Encoded
num_digits = 10
y_train = keras.utils.to_categorical(y_train, num_digits)
y_cv = keras.utils.to_categorical(y_cv, num_digits)

In [11]:
# Printing 2 examples of labels after conversion
print(y_train[0]) # 2
print(y_train[3]) # 7

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


## Model Fitting

Proceed to fitting several simple neural network models using Keras (with TensorFlow as our backend) and collect their accuracy.

In [12]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 200
num_digits = 10

In [13]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [14]:
# Our model would have '6' layers - input layer, 4 hidden layer and 1 output layer
model = Model(Inp, output)
model.summary() # We have 297,910 parameters to estimate

In [16]:
# Insert Hyperparameters
learning_rate = 0.1
training_epochs = 20
batch_size = 100
sgd = optimizers.SGD(learning_rate=0.1)

In [17]:
# We rely on the plain vanilla Stochastic Gradient Descent as our optimizing methodology
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

In [18]:
history1 = model.fit(X_train, y_train,
                     batch_size = batch_size,
                     epochs = training_epochs,
                     verbose = 2,
                     validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 4s - 13ms/step - accuracy: 0.4794 - loss: 1.8723 - val_accuracy: 0.7814 - val_loss: 1.0048
Epoch 2/20
336/336 - 2s - 6ms/step - accuracy: 0.8343 - loss: 0.6358 - val_accuracy: 0.8769 - val_loss: 0.4555
Epoch 3/20
336/336 - 3s - 8ms/step - accuracy: 0.8840 - loss: 0.4084 - val_accuracy: 0.8963 - val_loss: 0.3610
Epoch 4/20
336/336 - 2s - 7ms/step - accuracy: 0.9019 - loss: 0.3380 - val_accuracy: 0.9062 - val_loss: 0.3180
Epoch 5/20
336/336 - 2s - 7ms/step - accuracy: 0.9141 - loss: 0.2976 - val_accuracy: 0.9171 - val_loss: 0.2857
Epoch 6/20
336/336 - 2s - 6ms/step - accuracy: 0.9220 - loss: 0.2680 - val_accuracy: 0.9235 - val_loss: 0.2648
Epoch 7/20
336/336 - 2s - 6ms/step - accuracy: 0.9289 - loss: 0.2446 - val_accuracy: 0.9261 - val_loss: 0.2470
Epoch 8/20
336/336 - 2s - 6ms/step - accuracy: 0.9350 - loss: 0.2252 - val_accuracy: 0.9321 - val_loss: 0.2315
Epoch 9/20
336/336 - 2s - 6ms/step - accuracy: 0.9397 - loss: 0.2093 - val_accuracy: 0.9367 - val_loss: 0.2182


## Using a 4 layer neural network with:

20 training epochs

A training batch size of 100

Hidden layers set as (300, 100, 100, 200)
    
Learning rate of 0.1

Achieved a training score of around 96-98% and a test score of around 95 - 97%.

Can we do better if we change the optimizer? So we use the Adam optimizer for our second model, while maintaining the same parameter values for all other parameters.

In [20]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

# We rely on ADAM as our optimizing methodology
adam = keras.optimizers.Adam(learning_rate=0.1)
model2 = Model(Inp, output)

model2.compile(loss='categorical_crossentropy',
              optimizer='adam', # Optimizer Now Adam
              metrics=['accuracy'])

In [21]:
history2 = model2.fit(X_train, y_train,
                      batch_size = batch_size,
                      epochs = training_epochs,
                      verbose = 2,
                      validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 7s - 22ms/step - accuracy: 0.9003 - loss: 0.3361 - val_accuracy: 0.9506 - val_loss: 0.1665
Epoch 2/20
336/336 - 3s - 8ms/step - accuracy: 0.9620 - loss: 0.1228 - val_accuracy: 0.9598 - val_loss: 0.1303
Epoch 3/20
336/336 - 3s - 8ms/step - accuracy: 0.9740 - loss: 0.0809 - val_accuracy: 0.9674 - val_loss: 0.1164
Epoch 4/20
336/336 - 3s - 8ms/step - accuracy: 0.9808 - loss: 0.0602 - val_accuracy: 0.9670 - val_loss: 0.1073
Epoch 5/20
336/336 - 3s - 9ms/step - accuracy: 0.9860 - loss: 0.0435 - val_accuracy: 0.9724 - val_loss: 0.0969
Epoch 6/20
336/336 - 3s - 9ms/step - accuracy: 0.9873 - loss: 0.0368 - val_accuracy: 0.9743 - val_loss: 0.0978
Epoch 7/20
336/336 - 3s - 9ms/step - accuracy: 0.9910 - loss: 0.0287 - val_accuracy: 0.9718 - val_loss: 0.1073
Epoch 8/20
336/336 - 6s - 17ms/step - accuracy: 0.9922 - loss: 0.0237 - val_accuracy: 0.9740 - val_loss: 0.1087
Epoch 9/20
336/336 - 5s - 14ms/step - accuracy: 0.9932 - loss: 0.0210 - val_accuracy: 0.9706 - val_loss: 0.117

the optimizer plays a crucial part in the validation score. In particular, the model which relies on 'Adam' as its optimizer tend to perform 1.5 - 2.5% better on average. Going forward, we will use 'Adam' as our optimizer of choice.

In [23]:
test_pred = pd.DataFrame(model2.predict(X_test, batch_size=200))
test_pred = pd.DataFrame(test_pred.idxmax(axis = 1))
test_pred.index.name = 'ImageId'
test_pred = test_pred.rename(columns = {0: 'Label'}).reset_index()
test_pred['ImageId'] = test_pred['ImageId'] + 1

test_pred.head()

[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step


Unnamed: 0,ImageId,Label
0,1,2
1,2,0
2,3,9
3,4,9
4,5,3
