# NF264- Project 2: Digit recognizer

We  are  working  for  a  small  company  that  provides  machinelearning  solutions  for  its  customers.   The  postal  office  needs an  AI system  to  automatically  deliver  mail.   As  a  part  of  the  system,  they  need a  computer  program  that  recognises  handwritten  digits. We are providing  this  program  and  as machine  learning  experts, we write  the code that produces a classifier and this report that describes what we have done.

In [88]:
import pandas as pd
import numpy as np
import plotly.express as px
import tensorflow as tf 
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## The Dataset
The MNIST dataset consist of 70000 images of handwritten digits. Each image consist of a 28x28 pixel images with a grayscale value between 0-255. They are given as a list of 70000 with each list having length 28x28 = 784. Which is confirmed bu the dhape printed below.

In [89]:
X = pd.read_csv('handwritten_digits_images.csv', header=None).to_numpy()
y = pd.read_csv('handwritten_digits_labels.csv', header=None).to_numpy()
print(X.shape)

(70000, 784)


## Preprocessing steps
The labels of each images is represented as a digit between 0 and 9. We can make this label categorical, meaning they are all represented the same way as a bit array with 10 elements, where for example 4 is a 1 at the 5th index.

We also want to normalize the grayscale values from 0-255 to 0-1

In [90]:
from keras.utils import to_categorical

y = to_categorical(y)

# Normalize to range 0-1
X = X.astype('float32')
X = X / 255.0

Here is an example of a image and its corresponding label.

In [91]:
print(y[15000])

px.imshow(X[15000].reshape(28,28), color_continuous_scale=["white", "black"])

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]


## Splitting data
We split the data in 80% training data, 10% validation data used for evaluating and tuning hyperparameters, and 10% unsees test data which is used to choose the best model.

In [93]:
from sklearn.model_selection import train_test_split

X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

print('Train: X=%s, y=%s' % (X_train.shape, y_train.shape))
print('Test: X=%s, y=%s' % (X_test.shape, y_test.shape))
print('Validate: X=%s, y=%s' % (X_val.shape, y_val.shape))

Train: X=(56000, 784), y=(56000, 10)
Test: X=(7000, 784), y=(7000, 10)
Validate: X=(7000, 784), y=(7000, 10)


## Candidate  algorithms  and  choice  of  candidate  hyperparameters  (and why were the others left out)

We want to chose a classifier algorithm since a classifier utilizes some training data to understand how given input variables relate to a class. In this case, pictures of integers 0-9 are used as the training data. When the classifier is trained accurately, it can be used to detect integers for the Postal office. There are many candidates in this space. 

Firstly, we choose K Nearest Neighbors Classifier since this is kind of a baseline model, and we have implemnted this model from scratch is previous courses so we are well aware of the algorithm.

Secondly, we want to test a descision tree classifier since we know that this is an effective Classifier from our previous Project, where we classified a dataset with 10 features. When we think about a writtin digit, there are probably some descisions thart could be made in a descision tree, such as if it has a single line in vertical direction it is a 1 or 7, or if it contains two circles it is a 8. From the pixel data we ecpect there will be some kind of denominator that could classify the image into a category of digits.

Lastly, we  want to explore a Sequential Convolutional Neural Network Classifier since we are not as familier with this tool, and want to learn more about implemtning this Classifier. Neural Nets can be very powerfull if trained accuratly, so we want to explore if this could be a feasible solution for recognizing digits.

## Chosen performance measure
When chosing perfomance measure there are several that could be used, i.e MSE and RMSE, but we want to use the accuracy in percentage (0-100%) on the test data for model selection, and the accuracy on validation data for model evaluation.

## K Nearest Neighbors Classifier
The K-nearest neighbors (KNN) algorithm is a data classification method for estimating the probability that a data point will become a member of one or another group based on which group the data points are closest to it. A classification problem has a discrete value as its output. It is a type of supervised machine learning algorithm used to solve classification (and regression) problems. The algorithm is also called a lazy learning and non-parametic algorithm. This is because it is lazy and dosen’t preform any training when you supply the training data. It just stores the data during the training time and does not perform any calculations. The KNN algorithm does not build a model until a query is performed on the data set. It is also considered a non-parametric methods because it does not make any assumptions about the underlying data distribution. KNN tries to determine what group a data point belongs to by looking at the data points around it. It also involves classifying a data point by looking at the nearest annotated data point.

KNN is a supervised classification algorithm that classifies new data points based on the nearest data points. A advantage of using it, is that the training phase of K-nearest neighbor classification is much faster compared to other classification algorithms. There is no need to train a model for generalization, that is why KNN is known as the simple and instance-based learning algorithm. One disadvantage of using KNN is that the testing phase of K-nearest neighbor classification is slower and costlier in terms of time and memory. It requires large memory for storing the entire training dataset for prediction.

### Hyperparameters
We tune the number of nearest neighbors k to asses for choosing the label of the each image.

In [6]:
from sklearn.neighbors import KNeighborsClassifier

kVals = [1, 2, 3, 4, 5, 6, 7, 8 , 9, 10, 15, 30]
accuracies = []

for k in kVals: # Testing many k hyperparametyers to optimize performance
	model = KNeighborsClassifier(algorithm='auto', n_neighbors=k)
	model.fit(X_train, y_train)

	score = model.score(X_val, y_val)
	print("k=%d, validation accuracy=%.2f%%" % (k, score * 100))
	accuracies.append([k, score * 100])

k=1, validation accuracy=96.73%
k=2, validation accuracy=94.48%
k=3, validation accuracy=96.65%
k=4, validation accuracy=95.26%
k=5, validation accuracy=96.28%
k=6, validation accuracy=95.30%
k=7, validation accuracy=96.04%
k=8, validation accuracy=95.34%
k=9, validation accuracy=95.82%
k=10, validation accuracy=94.96%
k=15, validation accuracy=95.21%
k=30, validation accuracy=93.80%


In [7]:
#Plotting data
df = pd.DataFrame(accuracies, columns = ['k', 'Accuracy'])
px.line(df, x="k", y = 'Accuracy', title="kNN Model accuracy on validation data")

In [94]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

model = KNeighborsClassifier(n_neighbors=1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("Evaluation of test data")
print(classification_report(y_test, predictions))
print("sklearn KNeighborsClassifier Test data accuracy: {:3.2f}%".format(accuracy_score(y_test, predictions)*100))

Evaluation of test data
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       685
           1       0.97      0.99      0.98       778
           2       0.98      0.97      0.98       671
           3       0.97      0.96      0.96       690
           4       0.98      0.97      0.98       733
           5       0.96      0.96      0.96       644
           6       0.98      0.98      0.98       729
           7       0.96      0.97      0.97       694
           8       0.99      0.94      0.96       670
           9       0.96      0.97      0.96       706

   micro avg       0.97      0.97      0.97      7000
   macro avg       0.97      0.97      0.97      7000
weighted avg       0.97      0.97      0.97      7000
 samples avg       0.97      0.97      0.97      7000

sklearn KNeighborsClassifier Test data accuracy: 97.14%


### kNN Findings
We found best results with k=1 which gives us a accuracy on test of 97,14%

## Decision Tree Classifier
Decision Tree is a Supervised Machine Learning Algorithm that uses a set of rules to make decisions. A decision tree has a flowchart-like tree structure where an internal node represents feature, the branch represents a decision rule, and each leaf node represents the outcome. 

The most important feature is the capability of capturing descriptive decisionmaking knowledge from the supplied data. Decision tree can be generated from training sets. A decision tree classifier generates the actual prediction at the leaf nodes, more information can be stored at the leaf nodes. The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. It can handle high dimensional data with good accuracy. 


### Hyperparameters
As implemented in project 1 there are mainly 2 parameters to tweak. the impurity measure gini or entropy, and the max depth of the tree.

In [9]:
from sklearn.tree import DecisionTreeClassifier

criterion = ['gini','entropy']
max_depth = list(range(3, 31))

df = []
for d in max_depth:
    for c in criterion:
        dtc = DecisionTreeClassifier(criterion=c, max_depth=d)
        dtc.fit(X_train, y_train)
        acc = accuracy_score(y_val, dtc.predict(X_val))*100
        df.append([d,c,acc])
        print("Descicion tree with max depth: "+str(d) + ", impurity measure: "+c+", Accuracy: {:3.2f}%".format(acc))

Descicion tree with max depth: 3, impurity measure: gini, Accuracy: 24.28%
Descicion tree with max depth: 3, impurity measure: entropy, Accuracy: 30.12%
Descicion tree with max depth: 4, impurity measure: gini, Accuracy: 39.70%
Descicion tree with max depth: 4, impurity measure: entropy, Accuracy: 46.63%
Descicion tree with max depth: 5, impurity measure: gini, Accuracy: 57.43%
Descicion tree with max depth: 5, impurity measure: entropy, Accuracy: 58.47%
Descicion tree with max depth: 6, impurity measure: gini, Accuracy: 65.04%
Descicion tree with max depth: 6, impurity measure: entropy, Accuracy: 68.11%
Descicion tree with max depth: 7, impurity measure: gini, Accuracy: 72.07%
Descicion tree with max depth: 7, impurity measure: entropy, Accuracy: 74.55%
Descicion tree with max depth: 8, impurity measure: gini, Accuracy: 75.40%
Descicion tree with max depth: 8, impurity measure: entropy, Accuracy: 78.67%
Descicion tree with max depth: 9, impurity measure: gini, Accuracy: 81.65%
Descici

In [10]:
#Plotting data
df = pd.DataFrame(df, columns = ['Depth', 'Impurity', 'Accuracy'])
px.line(df, x="Depth", y = 'Accuracy', color='Impurity', title="DecisionTreeClassifier accuracy on validation data")

We chose the best hyperparameters based on the graph above

In [11]:
DecisionTree = DecisionTreeClassifier()
DecisionTree.fit(X_train, y_train)
predictions = DecisionTree.predict(X_test)

print("Evaluation of test data")
print(classification_report(y_test, predictions))

print("sklearn DecisionTreeClassifier Test data accuracy: {:3.2f}%".format(accuracy_score(y_test, predictions)*100))

Evaluation on Test data
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       972
           1       0.94      0.94      0.94      1177
           2       0.87      0.87      0.87      1008
           3       0.86      0.84      0.85      1031
           4       0.86      0.87      0.87      1071
           5       0.84      0.83      0.83      1026
           6       0.89      0.89      0.89      1063
           7       0.91      0.93      0.92      1085
           8       0.81      0.80      0.80      1038
           9       0.84      0.82      0.83      1029

   micro avg       0.87      0.87      0.87     10500
   macro avg       0.87      0.87      0.87     10500
weighted avg       0.87      0.87      0.87     10500
 samples avg       0.87      0.87      0.87     10500

sklearn DecisionTreeClassifier Test data accuracy: 87.34%


We found best results with impurity measure entropy and depth of 30 which gives us a accuracy on test of 87.34%

## Sequential Convolutional Neural Network Classifier
TODO: What is Sequential Convolutional Neural Network Classifier?

https://en.wikipedia.org/wiki/Convolutional_neural_network

### Hyperparameters

Layers
Activation function
keras.optimizers import SGD

In [81]:
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.optimizers import SGD
from keras.optimizers import Adam
from sklearn.model_selection import KFold
from keras.layers import BatchNormalization

print(tf.__version__)

if tf.test.gpu_device_name(): 
    print('GPU Device:{}'.format(tf.test.gpu_device_name()))

1.10.0
GPU Device:/device:GPU:0


We need to reshape the image data for the nural network to accept it as a single color channel.

In [95]:
X = X.reshape(X.shape[0], 28, 28, 1)
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

Define Model

Next, we need to define a baseline convolutional neural network model for the problem.

The model has two main aspects: the feature extraction front end comprised of convolutional and pooling layers, and the classifier backend that will make a prediction.

For the convolutional front-end, we can start with a single convolutional layer with a small filter size (3,3) and a modest number of filters (32) followed by a max pooling layer. The filter maps can then be flattened to provide features to the classifier.

Given that the problem is a multi-class classification task, we know that we will require an output layer with 10 nodes in order to predict the probability distribution of an image belonging to each of the 10 classes. This will also require the use of a softmax activation function. Between the feature extractor and the output layer, we can add a dense layer to interpret the features, in this case with 100 nodes.

All layers will use the ReLU activation function and the He weight initialization scheme, both best practices.

We will use a conservative configuration for the stochastic gradient descent optimizer with a learning rate of 0.01 and a momentum of 0.9. The categorical cross-entropy loss function will be optimized, suitable for multi-class classification, and we will monitor the classification accuracy metric, which is appropriate given we have the same number of examples in each of the 10 classes.

In [96]:
def neural_net_1(X_train, y_train, X_val, y_val, X_test, y_test, epochs = 5):
	model = Sequential()
	model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
	model.add(MaxPooling2D((2, 2)))
	model.add(Flatten())
	model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(10, activation='softmax')) # Output layer of 10 integers
	opt = SGD(lr=0.01, momentum=0.9)
	model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

	model.fit(X_train, y_train, epochs=epochs, batch_size=64, validation_data=(X_val, y_val), verbose=1)
	_, acc = model.evaluate(X_test, y_test, verbose=1)
	print('Model accuracy on test data:  %.3f' % (acc * 100.0))
	
	return model
	
neural_net = neural_net_1(X_train, y_train, X_val, y_val, X_test, y_test, epochs=15)

Train on 56000 samples, validate on 7000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Model accuracy on test data:  98.757


Increase in Model Depth

There are many ways to change the model configuration in order to explore improvements over the baseline model.

Two common approaches involve changing the capacity of the feature extraction part of the model or changing the capacity or function of the classifier part of the model. Perhaps the point of biggest influence is a change to the feature extractor.

We can increase the depth of the feature extractor part of the model, following a VGG-like pattern of adding more convolutional and pooling layers with the same sized filter, while increasing the number of filters. In this case, we will add a double convolutional layer with 64 filters each, followed by another max pooling layer.

The updated version of the define_model() function with this change is listed below.

In [84]:
def neural_net_2(X_train, y_train, X_val, y_val, X_test, y_test, epochs = 5):
	model = Sequential()
	model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
	model.add(MaxPooling2D((2, 2)))
	model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
	model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
	model.add(MaxPooling2D((2, 2)))
	model.add(Flatten())
	model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(10, activation='softmax'))
	opt = SGD(lr=0.01, momentum=0.9)
	model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

	model.fit(X_train, y_train, epochs=epochs, batch_size=64, validation_data=(X_val, y_val), verbose=1)
	_, acc = model.evaluate(X_test, y_test, verbose=1)
	print('Model accuracy on test data:  %.3f' % (acc * 100.0))

	return model

neural_net = neural_net_2(X_train, y_train, X_val, y_val, X_test, y_test, epochs=15)

Train on 63000 samples, validate on 3500 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Model accuracy on test data:  99.114


In [85]:
def neural_net(X, y, epochs = 5, folds=3):
    scores = []
    hist = []
    kfold = KFold(folds, shuffle=True, random_state=1)
    
    for train_ix, val_ix in kfold.split(X):
        model = Sequential()
        model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
        model.add(MaxPooling2D((2, 2)))
        model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
        model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
        model.add(MaxPooling2D((2, 2)))
        model.add(Flatten())
        model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
        model.add(Dense(10, activation='softmax'))
        opt = SGD(lr=0.01, momentum=0.9)
        model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

        trainX, trainY, valtX, valY = X[train_ix], y[train_ix], X[val_ix], y[val_ix]
        model.fit(trainX, trainY, epochs=epochs, batch_size=32, validation_data=(valX, valY), verbose=1)
        
        _, acc = model.evaluate(valX, valY, verbose=1)
        print('> %.3f' % (acc * 100.0))
        
        scores.append(acc)
        hist.append(model)
        
    return scores, hist
    

final = neural_net(X,y, epochs=20, folds=5)

Train on 56000 samples, validate on 14000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
> 99.293
Train on 56000 samples, validate on 14000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
> 99.150
Train on 56000 samples, validate on 14000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
> 99.236
Train on 56000 samples, validate on 14000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epo

In [86]:
print(final[0])
tests = list(range(1,len(final[0])+1))

fig = px.scatter(x = tests, y = final[0], labels=dict(x="K-folds test number:", y="Accuracy on validation data"))

fig.update_layout(xaxis={'tickformat': ',d'})
fig.update_layout(yaxis_range=[0,1])

[0.9929285714285714, 0.9915, 0.9923571428571428, 0.9919285714285714, 0.9912857142857143]


By using kfolds and plotting the graph above we see that the data is not overfitted to the specific data-split and we can assume it is generilzed for unseen data. We now choose the best performing model and test it on unseen data.

In [87]:
best_model = final[1][final[0].index(max(final[0]))]

pred = best_model.predict_classes(X_test)
y_test_digits = [np.argmax(dig) for dig in y_test]
print(classification_report(y_test_digits, pred))

print("keras Sequential Test data accuracy: {:3.2f}%".format(accuracy_score(y_test_digits, pred)*100))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       365
           1       1.00      1.00      1.00       403
           2       1.00      1.00      1.00       339
           3       1.00      1.00      1.00       321
           4       1.00      1.00      1.00       344
           5       1.00      1.00      1.00       341
           6       1.00      1.00      1.00       349
           7       1.00      1.00      1.00       357
           8       1.00      1.00      1.00       326
           9       1.00      1.00      1.00       355

    accuracy                           1.00      3500
   macro avg       1.00      1.00      1.00      3500
weighted avg       1.00      1.00      1.00      3500

keras Sequential Test data accuracy: 99.89%


## What is your final classifier and how does it work.

We choose the Sequential neural net model as our final classifier with accuracy of 99.89%. Its details are described above
Neural net
## How  well  it  is  expected  to  perform  in  production  (on  unseen  data).Justify your estimate
We expect it to perform good by classifying almost all (99%) of unsees digits

## Measures taken to avoid overfitting
We tested the final model with folding data dataset to ensure that the split does not affect model performance. We also clearly splitted the set into train, validation and test which were used for each their part of assesing and improving the model.

## Given more resources (time or computing resources),  how would you improve your solution
More Epochs and K-folds if i had a stronger GPU. Also i had to use older versions of CUDA, CUDNN, Tenserflow and Keras to work with my GPU. Current versions of the library may be better optimized and produce better results. 