<a href="https://colab.research.google.com/github/anantshinde143/Digit_recognizer/blob/main/Digit_recognization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

MNIST ("Modified National Institute of Standards and Technology") is the de facto “Hello World” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.


**Approach**

For this Project, we will be using Keras (with TensorFlow as our backend) as the main package to create a simple neural network to predict, as accurately as we can, digits from handwritten images. In particular, we will be calling the Functional Model API of Keras, and creating a 4-layered and 5-layered neural network.

Also, we will be experimenting with various optimizers: the plain vanilla Stochastic Gradient Descent optimizer and the Adam optimizer. However, there are many other parameters, such as training epochs which will we will not be experimenting with.

In addition, the choice of hidden layer units are completely arbitrary and may not be optimal. This is yet another parameter which we will not attempt to tinker with. Lastly, we introduce dropout, a form of regularisation, in our neural networks to prevent overfitting.

**Result**

Following our simulations on the cross validation dataset, it appears that a 4-layered neural network, using 'Adam' as the optimizer along with a learning rate of 0.01, performs best. We proceed to introduce dropout in the model, and use the model to predict for the test set.

The test predictions (submitted to Kaggle) generated by our model predicts with an accuracy score of 97.600%, which places us at the top 55 percentile of the competition.

Importing key libraries, and reading data

In [1]:
import pandas as pd
import numpy as np

np.random.seed(1212)

import keras
from keras.models import Model
from keras.layers import *
from keras import optimizers

Using TensorFlow backend

In [2]:
from google.colab import files


uploaded1= files.upload()
uploaded2= files.upload()

Saving train.csv to train.csv


Saving test.csv to test.csv


In [4]:
import io
df_train = pd.read_csv(io.BytesIO(uploaded1['train.csv']))
df_test = pd.read_csv(io.BytesIO(uploaded2['test.csv']))


In [5]:
df_train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Splitting into training and validation dataset**

In [6]:
df_features = df_train.iloc[:, 1:785]
df_label = df_train.iloc[:, 0]

X_test = df_test.iloc[:, 0:784]

print(X_test.shape)

(28000, 784)


In [9]:
from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(df_features, df_label,
                                                test_size = 0.2,
                                                random_state = 1212)

X_train = X_train.to_numpy().reshape(33600, 784) #(33600, 784)
X_cv = X_cv.to_numpy().reshape(8400, 784) #(8400, 784)

X_test = X_test.to_numpy().reshape(28000, 784)

**Data cleaning, normalization and selection**

In [10]:
print((min(X_train[1]), max(X_train[1])))

(0, 255)


As the pixel intensities are currently between the range of 0 and 255, we proceed to normalize the features, using broadcasting. In addition, we proceed to convert our labels from a class vector to binary One Hot Encoded

In [11]:
# Feature Normalization
X_train = X_train.astype('float32'); X_cv= X_cv.astype('float32'); X_test = X_test.astype('float32')
X_train /= 255; X_cv /= 255; X_test /= 255

# Convert labels to One Hot Encoded
num_digits = 10
y_train = keras.utils.to_categorical(y_train, num_digits)
y_cv = keras.utils.to_categorical(y_cv, num_digits)

In [12]:
# Printing 2 examples of labels after conversion
print(y_train[0]) # 2
print(y_train[3]) # 7

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


**Model Fitting**

We proceed by fitting several simple neural network models using Keras (with TensorFlow as our backend) and collect their accuracy. The model that performs the best on the validation set will be used as the model of choice for the competition.

Model 1: Simple Neural Network with 4 layers (300, 100, 100, 200)

In our first model, we will use the Keras library to train a neural network with the activation function set as ReLu. To determine which class to output, we will rely on the SoftMax function

In [13]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 200
num_digits = 10

In [14]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [15]:
# Our model would have '6' layers - input layer, 4 hidden layer and 1 output layer
model = Model(Inp, output)
model.summary() # We have 297,910 parameters to estimate

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_4 (Dense)      (None, 200)               20200     
                                                                 
 Output_Layer (Dense)        (None, 10)                2010      
                                                                 
Total params: 297,910
Trainable params: 297,910
Non-trainable

In [16]:
# Insert Hyperparameters
learning_rate = 0.1
training_epochs = 20
batch_size = 100
sgd = optimizers.SGD(lr=learning_rate)

  super().__init__(name, **kwargs)


In [17]:
# We rely on the plain vanilla Stochastic Gradient Descent as our optimizing methodology
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

In [18]:
history1 = model.fit(X_train, y_train,
                     batch_size = batch_size,
                     epochs = training_epochs,
                     verbose = 2,
                     validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 4s - loss: 1.7811 - accuracy: 0.5478 - val_loss: 0.9230 - val_accuracy: 0.7886 - 4s/epoch - 12ms/step
Epoch 2/20
336/336 - 3s - loss: 0.6142 - accuracy: 0.8368 - val_loss: 0.4490 - val_accuracy: 0.8758 - 3s/epoch - 9ms/step
Epoch 3/20
336/336 - 2s - loss: 0.4028 - accuracy: 0.8831 - val_loss: 0.3591 - val_accuracy: 0.8954 - 2s/epoch - 7ms/step
Epoch 4/20
336/336 - 2s - loss: 0.3350 - accuracy: 0.9023 - val_loss: 0.3198 - val_accuracy: 0.9068 - 2s/epoch - 7ms/step
Epoch 5/20
336/336 - 2s - loss: 0.2948 - accuracy: 0.9140 - val_loss: 0.2861 - val_accuracy: 0.9156 - 2s/epoch - 7ms/step
Epoch 6/20
336/336 - 3s - loss: 0.2674 - accuracy: 0.9226 - val_loss: 0.2761 - val_accuracy: 0.9207 - 3s/epoch - 9ms/step
Epoch 7/20
336/336 - 3s - loss: 0.2461 - accuracy: 0.9287 - val_loss: 0.2450 - val_accuracy: 0.9300 - 3s/epoch - 8ms/step
Epoch 8/20
336/336 - 2s - loss: 0.2271 - accuracy: 0.9337 - val_loss: 0.2312 - val_accuracy: 0.9344 - 2s/epoch - 7ms/step
Epoch 9/20
336/336 - 2s

Using a 4 layer neural network with:

1.  20 training epochs
2.  A training batch size of 100
3.  Hidden layers set as (300, 100, 100, 200)
4.  Learning rate of 0.1

Achieved a training score of around 96-98% and a test score of around 95 - 97%.

Can we do better if we were to change the optimizer? To find out, we use the Adam optimizer for our second model, while maintaining the same parameter values for all other parameters.

In [19]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

# We rely on ADAM as our optimizing methodology
adam = keras.optimizers.Adam(lr=learning_rate)
model2 = Model(Inp, output)

model2.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

  super().__init__(name, **kwargs)


In [20]:
history2 = model2.fit(X_train, y_train,
                      batch_size = batch_size,
                      epochs = training_epochs,
                      verbose = 2,
                      validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 4s - loss: 0.3402 - accuracy: 0.8966 - val_loss: 0.1539 - val_accuracy: 0.9518 - 4s/epoch - 13ms/step
Epoch 2/20
336/336 - 4s - loss: 0.1232 - accuracy: 0.9620 - val_loss: 0.1270 - val_accuracy: 0.9617 - 4s/epoch - 11ms/step
Epoch 3/20
336/336 - 3s - loss: 0.0792 - accuracy: 0.9748 - val_loss: 0.0988 - val_accuracy: 0.9695 - 3s/epoch - 9ms/step
Epoch 4/20
336/336 - 3s - loss: 0.0569 - accuracy: 0.9817 - val_loss: 0.1112 - val_accuracy: 0.9640 - 3s/epoch - 8ms/step
Epoch 5/20
336/336 - 3s - loss: 0.0421 - accuracy: 0.9863 - val_loss: 0.0939 - val_accuracy: 0.9727 - 3s/epoch - 9ms/step
Epoch 6/20
336/336 - 4s - loss: 0.0323 - accuracy: 0.9893 - val_loss: 0.0940 - val_accuracy: 0.9739 - 4s/epoch - 12ms/step
Epoch 7/20
336/336 - 3s - loss: 0.0311 - accuracy: 0.9903 - val_loss: 0.1039 - val_accuracy: 0.9727 - 3s/epoch - 9ms/step
Epoch 8/20
336/336 - 3s - loss: 0.0253 - accuracy: 0.9916 - val_loss: 0.1005 - val_accuracy: 0.9749 - 3s/epoch - 9ms/step
Epoch 9/20
336/336 - 

As it turns out, it does appear to be the case that the optimizer plays a crucial part in the validation score. In particular, the model which relies on 'Adam' as its optimizer tend to perform 1.5 - 2.5% better on average. Going forward, we will use 'Adam' as our optimizer of choice.

What if we changed the learning rate from 0.1 to 0.01, or 0.5? Will it have any impact on the accuracy?

**Model 2A**

In [21]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

learning_rate = 0.01
adam = keras.optimizers.Adam(lr=learning_rate)
model2a = Model(Inp, output)

model2a.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [22]:
history2a = model2a.fit(X_train, y_train,
                        batch_size = batch_size,
                        epochs = training_epochs,
                        verbose = 2,
                        validation_data=(X_cv, y_cv))


Epoch 1/20
336/336 - 6s - loss: 0.3575 - accuracy: 0.8934 - val_loss: 0.1659 - val_accuracy: 0.9492 - 6s/epoch - 18ms/step
Epoch 2/20
336/336 - 3s - loss: 0.1237 - accuracy: 0.9618 - val_loss: 0.1518 - val_accuracy: 0.9542 - 3s/epoch - 9ms/step
Epoch 3/20
336/336 - 3s - loss: 0.0810 - accuracy: 0.9750 - val_loss: 0.1112 - val_accuracy: 0.9651 - 3s/epoch - 9ms/step
Epoch 4/20
336/336 - 3s - loss: 0.0598 - accuracy: 0.9814 - val_loss: 0.0983 - val_accuracy: 0.9690 - 3s/epoch - 10ms/step
Epoch 5/20
336/336 - 4s - loss: 0.0461 - accuracy: 0.9843 - val_loss: 0.1013 - val_accuracy: 0.9706 - 4s/epoch - 11ms/step
Epoch 6/20
336/336 - 3s - loss: 0.0393 - accuracy: 0.9869 - val_loss: 0.0976 - val_accuracy: 0.9718 - 3s/epoch - 9ms/step
Epoch 7/20
336/336 - 3s - loss: 0.0287 - accuracy: 0.9913 - val_loss: 0.1070 - val_accuracy: 0.9712 - 3s/epoch - 9ms/step
Epoch 8/20
336/336 - 4s - loss: 0.0267 - accuracy: 0.9918 - val_loss: 0.1045 - val_accuracy: 0.9710 - 4s/epoch - 11ms/step
Epoch 9/20
336/336 -

**Model 2B**

In [23]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

learning_rate = 0.5
adam = keras.optimizers.Adam(lr=learning_rate)
model2b = Model(Inp, output)

model2b.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [24]:
history2b = model2b.fit(X_train, y_train,
                        batch_size = batch_size,
                        epochs = training_epochs,
                            validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The accuracy, as measured by the 3 different learning rates 0.01, 0.1 and 0.5 are around 98%, 97% and 98% respectively. As there are no considerable gains by changing the learning rates, we stick with the default learning rate of 0.01.

We proceed to fit a neural network with 5 hidden layers with the features in the hidden layer set as (300, 100, 100, 100, 200) respectively. To ensure that the two models are comparable, we will set the training epochs as 20, and the training batch size as 100.

In [25]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 100
n_hidden_5 = 200
num_digits = 10

In [26]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
x = Dense(n_hidden_5, activation='relu', name = "Hidden_Layer_5")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [27]:
# Our model would have '7' layers - input layer, 5 hidden layer and 1 output layer
model3 = Model(Inp, output)
model3.summary() # We have 308,010 parameters to estimate

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_4 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_5 (Dense)      (None, 200)               20200     
                                                                 
 Output_Layer (Dense)        (None, 10)                2010

In [28]:
# We rely on 'Adam' as our optimizing methodology
adam = keras.optimizers.Adam(lr=0.01)

model3.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [29]:
history3 = model3.fit(X_train, y_train,
                      batch_size = batch_size,
                      epochs = training_epochs,
                      validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Compared to our first model, adding an additional layer did not significantly improve the accuracy from our previous model. However, there are computational costs (in terms of complexity) in implementing an additional layer in our neural network. Given that the benefits of an additional layer are low while the costs are high, we will stick with the 4 layer neural network.

We now proceed to include dropout (dropout rate of 0.3) in our second model to prevent overfitting.

In [30]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 200
num_digits = 10

In [31]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dropout(0.3)(x)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dropout(0.3)(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dropout(0.3)(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [32]:
# Our model would have '6' layers - input layer, 4 hidden layer and 1 output layer
model4 = Model(Inp, output)
model4.summary() # We have 297,910 parameters to estimate

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 dropout (Dropout)           (None, 300)               0         
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 dropout_2 (Dropout)         (None, 100)               0   

In [33]:
model4.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [34]:
history = model4.fit(X_train, y_train,
                    batch_size = batch_size,
                    epochs = training_epochs,
                    validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


With a validation score of close to 98%, we proceed to use this model to predict for the test set.

In [35]:
test_pred = pd.DataFrame(model4.predict(X_test, batch_size=200))
test_pred = pd.DataFrame(test_pred.idxmax(axis = 1))
test_pred.index.name = 'ImageId'
test_pred = test_pred.rename(columns = {0: 'Label'}).reset_index()
test_pred['ImageId'] = test_pred['ImageId'] + 1

test_pred.head()



Unnamed: 0,ImageId,Label
0,1,2
1,2,0
2,3,9
3,4,9
4,5,3


test_pred.to_csv('mnist_submission.csv', index = False)

Using this model, we are able to achieve a score of 0.976, which places us at the top 55th percentile!