**MNIST Notebook by Alan Choon Zhen Yu**

In [12]:
import pandas as pd
import numpy as np
import math
import scipy
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(1)

In [13]:
train = pd.read_csv("../input/train.csv")

In [14]:
Y_train = train["label"]
X_train = train.drop(labels='label',axis=1)

In [5]:
X_train.head()

In [6]:
Y_train.head()

In [15]:
y_r=pd.DataFrame(Y_train.values.reshape(Y_train.shape[0],1))

**One Hot Encoding:**

In [16]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
#Change the column to a 2-dim array
y_r = enc.fit_transform(y_r)

In [17]:
pd.DataFrame(y_r).head()

**Split Dataset into training and development sets:**

In [18]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_train, y_r, test_size=0.20, random_state=101)

**Training a Decision Tree Classifier:**

In [11]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(x_train,y_train)

In [12]:
from sklearn.metrics import accuracy_score, classification_report
dectreepred = dtree.predict(x_test)
print(accuracy_score(y_test,dectreepred))

In [13]:
print(classification_report(y_test,dectreepred))

Hurray for Decision Tree Classifier! Seems like performance on the development set is already not bad, with an accuracy of ~85%.
Accuracy can be improved further with more advanced classifiers!

**Training a Random Forest Model:**

In [14]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

In [15]:
rfprediction = rf.predict(x_test)
print(accuracy_score(y_test,rfprediction))

In [16]:
print(classification_report(y_test,rfprediction))

Surprisingly, Random Forests Classifier does worse than decision tree, in terms of accuracy although random forests does have a higher f1 score and precision.
Accuracy might be boosted further with neural networks.

In [27]:
x_train_nn = x_train.values
x_test_nn = x_test.values

In [28]:
x_train_nn.shape

In [29]:
x_train_nn = np.reshape(x_train_nn, (-1,28,28,1))
x_train_nn.shape

In [30]:
x_test_nn.shape

In [31]:
x_test_nn = np.reshape(x_test_nn, (-1,28,28,1))
x_test_nn.shape

In [22]:
y_train.shape

In [23]:
y_test.shape

**Convolutional Neural Network:**

* The model below takes inspiration from Convolutional Neural Networks course taught by deeplearning.ai

 as well as from: https://www.tensorflow.org/tutorials/layers and http://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
 
* Keras shall be used.

In [32]:
x_train_nn = x_train_nn/255
x_test_nn = x_test_nn/255

In [3]:
import tensorflow as tf
from tensorflow.python.framework import ops
import keras
from keras.layers import Dense, Flatten, Dropout
from keras.layers import Conv2D, MaxPooling2D
from keras.models import Sequential

**Model:**

In [53]:
model = Sequential()

Models in Keras come in two forms: Sequential and through the Functional API. For most deep learning networks, Sequential model is likely used.

Sequential layers (and even recurrent layers) of the network can be stacked in order from input to output

In [54]:
input_shape = (28,28,1)
model.add(Conv2D(32,kernel_size=(5,5),strides=(1,1),activation='relu',input_shape = input_shape))
model.add(MaxPooling2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(64,kernel_size=(5,5),strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2),strides=(2,2)))
model.add(Flatten())
model.add(Dense(1024,activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(10,activation='softmax'))

In [55]:
model.compile(loss=keras.losses.categorical_crossentropy,optimizer=keras.optimizers.Adam(),metrics=['accuracy'])

**Logging Metrics in Keras:**

"callbacks" can be used to create checkpoints to save the model at different stages of training

In [57]:
class AccuracyHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.acc = []
        
    def on_epoch_end(self,batch,logs={}):
        self.acc.append(logs.get('acc'))

The Callback superclass that AccuracyHistory inherits from, have several methods that can be overridden  such as on_train_begin, on_epoch_end, on_batch_begin and on_batch_end. 

At the beginning of training, A list self.acc = [] is initialized to store our accuracy results.

Using the on_epoch_end() method, the variable wanted can be extracted from the logs, which is a dictionary that holds, as a default, the loss and accuracy during training

In [58]:
history = AccuracyHistory()

In [65]:
batch_size = 1024
epochs = 30
model.fit(x_train_nn,y_train,batch_size=batch_size,epochs=epochs,verbose=1,validation_data=(x_test_nn,y_test),callbacks=[history])

Train on 33600 samples, validate on 8400 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fb823e2db00>

With CNN, the development set accuracy has increased to a staggering ~99.9%!

In [66]:
test = pd.read_csv("../input/test.csv")
X_test = np.reshape(test.values,(-1,28,28,1))
X_test = X_test/255
preds = model.predict_classes(X_test)

In [61]:
submit = pd.DataFrame(preds,columns=['Label'])

In [62]:
submit['ImageId'] = range(1,28001)

In [63]:
submit = submit[['ImageId','Label']]
submit.to_csv("submission2.csv",index=False)