The first chunk of the code, imports all the necessary libraries and dependencies.

In [1]:
import os
from tqdm import tqdm
import pandas as pd 
import matplotlib.image as mpimg 
import matplotlib.pyplot as plt 
import numpy as np
from scipy import ndimage

Next we will read all the training data from the images. Then we have to perform denoising on them using the median filter. afterwards all image data will be saved in a dataframe which will then be saved in a csv file named "train_denoised.csv".

In [None]:
columns = ['label' ] + list(range(0, 784))
df = pd.DataFrame( columns = columns)

j = 0
for i in tqdm(range(0,10)):
    directory = os.fsencode("training/" + str(i) + "/")
    ls = []
    for file in tqdm(os.listdir(directory)):
        filename = os.fsdecode(file)
        img = mpimg.imread("training/" + str(i) + "/" + filename) 
        img = ndimage.median_filter(img , size=3)
        img = img.ravel()
        row = [i] 
        row.extend(img)
        ls.append(row)
    df_tmp = pd.DataFrame(data = ls  , columns = columns)
    frames = [df, df_tmp]
    df = pd.concat(frames)
df.to_csv('train_denoised.csv', encoding='utf-8' , index=False )

After this, we will only work with the training data in the dataframe format.

In [2]:

train = pd.read_csv('train_denoised.csv')
   
y = train['label']
X = train[list(train)[1:]]


Next, we will import all necessary sklearn dependancies.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix  
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC

In order to do dimensionality reduction, we will use PCA and LDA. PCA performs better though and will be used for reporting final results. Now, we will obtain the transformed data using PCA and LDA.

In [10]:
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
lda = LinearDiscriminantAnalysis(n_components=50)
X_lda = lda.fit(X, y).transform(X)



next, we will use validation techniques to estimate test error and select the best model. We will randomly split the data and use 20 percent of data for model validation.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size = 0.20) 

First, we will use the obtained training data to fit a knn model. Through expermentation, we have concluded that 7 is suitable for number of neighbors. 

In [12]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='uniform')

Next, we will use the obtained model to estimate the test error. We will use the validation set to do so.

In [13]:
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[1174    3    0    0    0    1    6    1    0    1]
 [   0 1358    4    0    0    0    0    4    0    1]
 [  10    2 1160    0    1    2    2   12    3    2]
 [   1    1    5 1194    0   13    1   10    6    5]
 [   1    5    1    0 1103    0    5    1    0   18]
 [   3    7    3   12    2 1016   14    2    4    4]
 [   6    3    0    0    0    3 1160    0    0    0]
 [   0    8    5    0    6    1    0 1246    0   13]
 [   2   11    4   14    4   12    8    3 1096   12]
 [   2    0    0   15   15    2    0   16    5 1144]]
             precision    recall  f1-score   support

          0       0.98      0.99      0.98      1186
          1       0.97      0.99      0.98      1367
          2       0.98      0.97      0.98      1194
          3       0.97      0.97      0.97      1236
          4       0.98      0.97      0.97      1134
          5       0.97      0.95      0.96      1067
          6       0.97      0.99      0.98      1172
          7       0.96      0.97      0.97  

We have obtained a precision of 97 percent which is really good. Next, We will plot the ROC curve for this classifier.

Next, we will train a random forest with 100 estimators. The criterion for decision making will be "entropy" instead of "gini" and the classifier will be trained using the maximum number of CPU threads to maximize training speed.

In [20]:
rfc = RandomForestClassifier(criterion = "entropy" , n_jobs=-1, n_estimators=50)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Next, we will validate the model using validation data.

In [21]:
y_pred = rfc.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[1149    1    2    4    0    5   14    2    7    2]
 [   0 1344    6    3    1    0    3    4    3    3]
 [   6    1 1130   17    7    2    4   12   14    1]
 [   3    1   22 1142    0   17    3   14   22   12]
 [   1    7    5    0 1076    1   14    2    4   24]
 [   6    1    5   19    7 1003   17    4    3    2]
 [  14    1   10    0    3    4 1139    1    0    0]
 [   0    7   14    1    9    3    0 1221    4   20]
 [   2   11   13   29    6   12   14    7 1059   13]
 [   0    2    7   33   30   10    1   31    6 1079]]
             precision    recall  f1-score   support

          0       0.97      0.97      0.97      1186
          1       0.98      0.98      0.98      1367
          2       0.93      0.95      0.94      1194
          3       0.92      0.92      0.92      1236
          4       0.94      0.95      0.95      1134
          5       0.95      0.94      0.94      1067
          6       0.94      0.97      0.96      1172
          7       0.94      0.95      0.95  

We have obtainded a precision of 95 percent which is very good. Next we will plot the ROC curve for this classifier.

Next, we will train a Support Vector Classifier using the Gaussian kernel and the mapped training data.

In [16]:
svclassifier = SVC(kernel='rbf' )
svclassifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Next, we will use the trained model and the validation data, to estimate the test error of this classifier.

In [17]:
y_pred = svclassifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[1173    1    1    1    1    4    1    0    2    2]
 [   0 1352    6    1    1    0    0    5    1    1]
 [   5    1 1170    3    4    1    1    6    3    0]
 [   1    0    7 1199    0   12    0    9    7    1]
 [   3    5    2    0 1096    1    6    2    1   18]
 [   1    1    4   13    4 1030    8    0    3    3]
 [   8    0    2    0    0    6 1156    0    0    0]
 [   0    4    6    0    8    1    0 1253    0    7]
 [   1    5    4    8    2    6    5    2 1129    4]
 [   2    2    0    7   20    3    0   17    5 1143]]
             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1186
          1       0.99      0.99      0.99      1367
          2       0.97      0.98      0.98      1194
          3       0.97      0.97      0.97      1236
          4       0.96      0.97      0.97      1134
          5       0.97      0.97      0.97      1067
          6       0.98      0.99      0.98      1172
          7       0.97      0.98      0.97  

We have obtained a precision of 98 percent which is best among the three models.  Afterwards, we will plot the ROC curve for this classifier aswell.

Next, we will use the pickle library to save the models.

In [22]:
import pickle
pickle.dump(knn, open("KNN_model", 'wb'))
pickle.dump(rfc, open("Random_forest_model", 'wb'))
pickle.dump(svclassifier, open("SVC_RBF_model", 'wb'))