# MNIST handwritten digits classification with MLPs

In this notebook, we'll train a multi-layer perceptron model to classify MNIST digits using **Keras** (version $\ge$ 2 required). 

First, the needed imports.

In [None]:
%matplotlib inline
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Flatten, InputLayer
#from tensorflow.keras.utils import np_utils
from tensorflow.keras import utils
from tensorflow.keras import backend as K

from distutils.version import LooseVersion as LV
from tensorflow.keras import __version__

from IPython.display import SVG, Image
#from tensorflow.keras.utils.vis_utils import model_to_dot
#from tensorflow.keras.utils import vis_utils
from tensorflow.python.keras.utils.vis_utils import model_to_dot
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Keras version:', __version__, 'backend:', K.backend())
assert(LV(__version__) >= LV("2.0.0"))

In [None]:
!nvidia-smi  # -L

In [None]:
!cat /proc/cpuinfo

## MNIST data set

Next we'll load the MNIST handwritten digits data set.  First time we may have to download the data, which can take a while.


In [None]:
from tensorflow.keras.datasets import mnist, fashion_mnist

## MNIST:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
## Fashion-MNIST:
#(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

nb_classes = 10

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# one-hot encoding:
Y_train = utils.to_categorical(y_train, nb_classes)
Y_test = utils.to_categorical(y_test, nb_classes)

print()
print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('Y_train:', Y_train.shape)

The training data (`X_train`) is a 3rd-order tensor of size (60000, 28, 28), i.e. it consists of 60000 images of size 28x28 pixels. `y_train` is a 60000-dimensional vector containing the correct classes ("0", "1", ..., "9") for each training sample, and `Y_train` is a [one-hot](https://en.wikipedia.org/wiki/One-hot) encoding of `y_train`.

Let's take a closer look. Here are the first 10 training digits (or fashion items for Fashion-MNIST):

In [None]:
pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train[i,:,:], cmap="gray")
    plt.title('Class: '+str(y_train[i]))
    print('Training sample',i,': class:',y_train[i], ', one-hot encoded:', Y_train[i])

## Linear model

### Initialization

Let's begin with a simple linear model.  We first initialize the model with `Sequential()`. The first layer is an `InputLayer` and then we use a `Flatten` layer to convert image data into vectors. 

Then we add a `Dense` layer that has 28*28=784 input nodes (one for each pixel in the input image) and 10 output nodes. The `Dense` layer connects each input to each output with some weight parameter. 

Finally, we select *categorical crossentropy* as the loss function, select [*stochastic gradient descent*](https://keras.io/optimizers/#sgd) as the optimizer, add *accuracy* to the list of metrics to be evaluated, and `compile()` the model. Note there are [several different options](https://keras.io/optimizers/) for the optimizer in Keras that we could use instead of *sgd*.

In [None]:
linmodel = Sequential()

linmodel.add(InputLayer(input_shape=(28, 28)))
linmodel.add(Flatten())

linmodel.add(Dense(units=10, activation='softmax'))

linmodel.compile(loss='categorical_crossentropy', 
                 optimizer='sgd', 
                 metrics=['accuracy'])
print(linmodel.summary())

The summary shows that there are 7850 parameters in our model, as the weight matrix is of size 785x10 (not 784, as there's an additional bias term).

We can also draw a fancier graph of our model.

In [None]:
# Image(model_to_dot(linmodel, show_shapes=True).create(prog='dot', format='png'))
import pydot
SVG(model_to_dot(linmodel, show_shapes=True, dpi=72).create(prog='dot', format='svg'))

### Learning

Now we are ready to train our first model.  An *epoch* means one pass through the whole training data. 

You can run code below multiple times and it will continue the training process from where it left off.  If you want to start from scratch, re-initialize the model using the code a few cells ago. 

In [None]:
%%time
epochs = 10 # one epoch takes about 3 seconds on Google Colab GPU (NVIDIA T4) as of Aug 2020

linhistory = linmodel.fit(X_train, Y_train, 
                          epochs=epochs, 
                          batch_size=32,
                          verbose=2)

Let's now see how the training progressed. 

* *Loss* is a function of the difference of the network output and the target values.  We are minimizing the loss function during training so it should decrease over time.
* *Accuracy* is the classification accuracy for the training data.  It gives some indication of the real accuracy of the model but cannot be fully trusted, as it may have overfitted and just memorizes the training data.

In [None]:
plt.figure(figsize=(5,3))
plt.plot(linhistory.epoch,linhistory.history['loss'])
plt.title('loss')

plt.figure(figsize=(5,3))
plt.plot(linhistory.epoch,linhistory.history['accuracy'])
plt.title('accuracy');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
linscores = linmodel.evaluate(X_test, Y_test, verbose=2)
print("%s: %.2f%%" % (linmodel.metrics_names[1], linscores[1]*100))

We can now take a closer look on the results.

Let's define a helper function to show the failure cases of our classifier. 

In [None]:
def show_failures(predictions, trueclass=None, predictedclass=None, maxtoshow=10):
    rounded = np.argmax(predictions, axis=1)
    errors = rounded!=y_test
    print('Showing max', maxtoshow, 'first failures. '
          'The predicted class is shown first and the correct class in parenthesis.')
    ii = 0
    plt.figure(figsize=(maxtoshow, 1))
    for i in range(X_test.shape[0]):
        if ii>=maxtoshow:
            break
        if errors[i]:
            if trueclass is not None and y_test[i] != trueclass:
                continue
            if predictedclass is not None and rounded[i] != predictedclass:
                continue
            plt.subplot(1, maxtoshow, ii+1)
            plt.axis('off')
            plt.imshow(X_test[i,:,:], cmap="gray")
            plt.title("%d (%d)" % (rounded[i], y_test[i]))
            ii = ii + 1

Here are the first 10 test digits the linear model classified to a wrong class:

In [None]:
linpredictions = linmodel.predict(X_test)

show_failures(linpredictions)

## Multi-layer perceptron (MLP) network

### Initialization

Let's now create a more complex MLP model that has multiple layers, non-linear activation functions, and dropout layers.  `Dropout()` randomly sets a fraction of inputs to zero during training, which is one approach to regularization and can sometimes help to prevent overfitting.

There are two options below, a simple and a bit more complex model.  Select either one.

The output of the last layer needs to be a softmaxed 10-dimensional vector to match the groundtruth (`Y_train`). 

Finally, we again `compile()` the model, this time using [*RMSProp*](https://keras.io/optimizers/#rmsprop) as the optimizer.

In [None]:
# Model initialization:
model = Sequential()
model.add(InputLayer(input_shape=(28, 28)))
model.add(Flatten())

# A simple model:
model.add(Dense(units=20))
model.add(Activation('relu'))

# A bit more complex model:
#model.add(Dense(units=50))
#model.add(Activation('relu'))
#model.add(Dropout(0.2))

#model.add(Dense(units=50))
#model.add(Activation('relu'))
#model.add(Dropout(0.2))

# The last layer needs to be like this:
model.add(Dense(units=10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(model.summary())

In [None]:
#Image(model_to_dot(model, show_shapes=True, dpi=72).create(prog='dot', format='png'))
SVG(model_to_dot(model, show_shapes=True, dpi=72).create(prog='dot', format='svg'))

### Learning

In [None]:
%%time
epochs = 10 # one epoch takes about 4 seconds on Google Colab GPU (NVIDIA T4) as of Aug 2020

history = model.fit(X_train, Y_train, 
                    epochs=epochs, 
                    batch_size=32,
                    verbose=2)

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'])
plt.title('loss')

plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['accuracy'])
plt.title('accuracy');

### Inference

Accuracy for test data.  The model should be somewhat better than the linear model. 

In [None]:
%%time
scores = model.evaluate(X_test, Y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can again take a closer look on the results, using the `show_failures()` function defined earlier.

Here are the first 10 test digits the MLP classified to a wrong class:

In [None]:
predictions = model.predict(X_test)

show_failures(predictions)

We can use `show_failures()` to inspect failures in more detail. For example, here are failures in which the true class was "6":

In [None]:
show_failures(predictions, trueclass=6)

We can also compute the confusion matrix to see which digits get mixed the most, and look at classification accuracies separately for each class:

In [None]:
from sklearn.metrics import confusion_matrix

print('Confusion matrix (rows: true classes; columns: predicted classes):'); print()
cm=confusion_matrix(y_test, np.argmax(predictions, axis=1), labels=list(range(10)))
print(cm); print()

print('Classification accuracy for each class:'); print()
for i,j in enumerate(cm.diagonal()/cm.sum(axis=1)): print("%d: %.4f" % (i,j))