# Classifying MNIST Digits - Reloaded
In this notebook, we will dive deeper into the problem of classifying the digits of the MNIST dataset.

## Preparations
We start with the usual preparations. These are very similar to what you have already seen in the previous notebook.

### Load libraries

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

In [None]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Reshape, Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import regularizers
from tensorflow.keras.regularizers import L1, L2

In [None]:
import pickle

In [None]:
tf.random.set_seed(123)
np.random.seed(123)

### Prepare data
Next, we prepare the data. As all images have pixel values in the range between 0 and 255, dividing them by 255 will give us an input on the scale between 0 and 1. Also, we use the usual `train_test_split` function to get separate dataset for training and validation.

In [None]:
# Load data:
mnist = tf.keras.datasets.mnist
(train_val_images, train_val_labels), (test_images, test_labels) = mnist.load_data()

# Scale image data:
train_val_images = train_val_images / 255.0
test_images = test_images / 255.0

# Split into training / validation
train_images_all, val_images, train_labels_all, val_labels = train_test_split(train_val_images, train_val_labels,
                                                                              test_size=0.20, random_state=42)

We look at the distribution of labels on the training and validation sets:

In [None]:
train_val_label_df = pd.DataFrame(train_val_labels)
train_val_label_df.columns = ['label']
train_val_label_df['label'].value_counts()

In [None]:
val_label_df = pd.DataFrame(val_labels)
val_label_df.columns = ['label']
val_label_df['label'].value_counts(sort=False, ascending=True)

We see that both dataset have a pretty even distribution of the labels.

Next, we choose a random 1000 samples that we will use for training. While this might look artificial here, it helps us to illustrate the problems related to overfitting while keeping the training times reasonably small.

In [None]:
# randomly choose a given number of data points for training
n_train = 1000

n_train_all = train_images_all.shape[0]
train_indices = np.random.choice(range(n_train_all), n_train)

train_images = train_images_all[train_indices]
train_labels = train_labels_all[train_indices]

In [None]:
# convert to one-hot vector
train_labels_OH = to_categorical(train_labels)
val_labels_OH = to_categorical(val_labels)
test_labels_OH = to_categorical(test_labels)

## Defining the Network
We start with a definition of the network that is similar to the one in the previous notebook:

In [None]:
mnist_classifier = tf.keras.Sequential([
    tf.keras.layers.Input(shape = (28, 28, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

Now we compile it and use the `summary` method to get an overview of the model:

In [None]:
mnist_classifier.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
mnist_classifier.summary()

## Varying the Training Time

### Short Training
We start with an initial training over 5 epochs:

In [None]:
nEpochs = 5
mnist_classifier.fit(train_images, train_labels_OH, epochs = nEpochs, verbose = True)

We apply the trained model to the validation imates. Note that we will get, for every image, 10 numbers, representing the probability (as estimated by our model) that the given image represents the corresponding number:

In [None]:
train_labels_OH_est = mnist_classifier.predict(train_images)

In [None]:
train_labels_OH_est[:5]

In order to compare with the label where for each image the corresponding label is given, we search the most probable label:

In [None]:
train_labels_est = np.argmax(train_labels_OH_est, 1)

In [None]:
train_labels_est[:5]

Let's calculate the accuracy of these predictions:

In [None]:
accuracy_score(train_labels, train_labels_est)

We do the same for the test images, and we also check the confusion matrix on the test images:

In [None]:
val_labels_OH_est = mnist_classifier.predict(val_images)
val_labels_est = np.argmax(val_labels_OH_est, 1)

In [None]:
accuracy_score(val_labels, val_labels_est)

In [None]:
test_labels_OH_est = mnist_classifier.predict(test_images)
test_labels_est = np.argmax(test_labels_OH_est, 1)
accuracy_score(test_labels, test_labels_est)

In [None]:
ConfusionMatrixDisplay.from_predictions(test_labels, test_labels_est, normalize='true', values_format='.2f')

We store the accuracies in a dataframe that we will expand with the performance of other models as we try them out:

In [None]:
accuracies_df = pd.DataFrame({'Method': 'Short Training', 
                              'Training': accuracy_score(train_labels, train_labels_est), 
                              'Validation': accuracy_score(val_labels, val_labels_est), 
                              'Test': accuracy_score(test_labels, test_labels_est)}, 
                             index=['Short Training'])

In [None]:
accuracies_df

### More Training Epochs
Choosing 5 epochs for training was somewhat arbitrary - and looking at the progress of the loss and the accuracy, we see that both were actually still improving. So let's increase the number of epochs!

Note that the training will thus take longer. If you don't want to wait for it to finish, you can just set `train_from_scratch` to `False`, and the trained parameters will be used instead. Note that this will only work if you choose the exact same model as we did. Also, if you are running this notebook locally (on your own computer), you have to download the folder `pretrained` (which contains the pretrained weights) and put it into the same directory as this notebook.

In [None]:
train_from_scratch = True

In [None]:
mnist_classifier_longer = tf.keras.models.clone_model(mnist_classifier)
mnist_classifier_longer.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
nEpochs = 200

Furthermore, we would like to keep track of the progress of the fitting. While we can obviously look at the output, the `fit` function also returns a *history*, which we will save as `history_longer`:

In [None]:
# define paths:
classi_weights_path_longer = './pretrained/mnist_small_classi_longer.weights.h5'
classi_history_path_longer = './pretrained/mnist_small_classifier_longer.history.h5'

if train_from_scratch:
    history_longer = mnist_classifier_longer.fit(train_images, train_labels_OH,
                                                 epochs = nEpochs, verbose = True)
    # Save the weights:
    mnist_classifier_longer.save_weights(classi_weights_path_longer)

    # Save training history:
    with open(classi_history_path_longer, 'wb') as f:
        pickle.dump(history_longer, f)
else:
    # load previsously computed weights
    mnist_classifier_longer.load_weights(classi_weights_path_longer)

    # load history:
    with open(classi_history_path_longer, 'rb') as f:
        history_longer = pickle.load(f)

Below we define a function to plot the history (you don't need to understand how this is done):

In [None]:
def plot_history(history, logy=False):
    """
    Plot model training history.
    Args:
    - history: tensorflow history object.

    Returns:
    None
    """
    # plt.subplot(2, 1, 1)
    plt.subplot(311)
    plt.plot(history['loss'], label='Training')
    if 'val_loss' in history.keys():
        plt.plot(history['val_loss'], label='Validation')
    plt.legend()
    plt.ylabel('Loss')
    if logy:
        plt.yscale('log')
    plt.grid()

    plt.subplot(312)
    plt.plot(history['accuracy'], label='Training')
    if 'val_accuracy' in history.keys():
        plt.plot(history['val_accuracy'], label='Validation')
    plt.legend()
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    if logy:
        plt.yscale('log')
    plt.grid()    

    plt.subplot(313)
    plt.plot([1-acc for acc in history['accuracy']], label='Training')
    if 'val_accuracy' in history.keys():
        plt.plot([1-acc for acc in history['val_accuracy']], label='Validation')
    plt.legend()
    plt.xlabel('Epoch')
    plt.ylabel('Error Rate')
    if logy:
        plt.yscale('log')
    plt.grid()

    plt.show()

Now, let's plot the history of our long training:

In [None]:
plot_history(history_longer.history)

In [None]:
plot_history(history_longer.history, logy=True)

It looks like the loss on the training data is continuously decreasing, and the model yields perfectly accurate results. Let's double-check:

In [None]:
train_labels_OH_est_longer = mnist_classifier_longer.predict(train_images)
train_labels_est_longer = np.argmax(train_labels_OH_est_longer, 1)
accuracy_score(train_labels, train_labels_est_longer)

Wow, 100% accuracy. But you will of course want to double-check on a new data set - let's see how the model is classifying the validation data:

In [None]:
val_labels_OH_est_longer = mnist_classifier_longer.predict(val_images)
val_labels_est_longer = np.argmax(val_labels_OH_est_longer, 1)
accuracy_score(val_labels, val_labels_est_longer)

A textbook example of overfitting!

Since this is a common behavior, `tensorflow` provides some tools to handle this. In particular, 
* the `compile` function takes as argument `metrics` a metric (or a list of metrics) that will be evalated and printed out after every epoch. For example, we can use 
`mnist_classifier_longer.compile(loss="categorical_crossentropy", optimizer="adam", metrics="accuracy")`
* the `fit` function takes an argument `validation_data`, e.g., we can use `validation_data = (val_images, val_labels_OH)`. The loss and all metrics will be evaluated on the validation data after every epoch.

**EXERCISE**

Adapt the cells above to include `metrics` and `validation_data`. Re-train the model and plot the history. Describe your observations.

**Hint**: In order to train a model from scratch (e.e., after you have added these arguments), you have to compile the model again! Otherwise, the training will continue from the current parameters of the model.

Again, we store the accuracies in a dataframe:

In [None]:
train_labels_OH_est_longer = mnist_classifier_longer.predict(train_images)
train_labels_est_longer = np.argmax(train_labels_OH_est_longer, 1)

val_labels_OH_est_longer = mnist_classifier_longer.predict(val_images)
val_labels_est_longer = np.argmax(val_labels_OH_est_longer, 1)

test_labels_OH_est_longer = mnist_classifier.predict(test_images)
test_labels_est_longer = np.argmax(test_labels_OH_est_longer, 1)

accuracies_df = pd.concat([accuracies_df,
                       pd.DataFrame({'Method': 'Long Training', 
                                     'Training': accuracy_score(train_labels, train_labels_est_longer), 
                                     'Validation': accuracy_score(val_labels, val_labels_est_longer), 
                                     'Test': accuracy_score(test_labels, test_labels_est_longer)}, 
                                    index=['Long Training'])
                      ], axis=0)
accuracies_df

### Early Stopping
As we are looking at the validation performance, we might as well stop the training once the performance does not improve over a given number of epochs. This is exactly what `EarlyStopping` is doing: We look at a quality criterion specified as `monitor`, and stop the training if we have not seen any improvement in the last `patience` many epochs. If `restore_best_weights`, we restore the weights of the best performing model once we have stopped the training (if `restore_best_weights` is `False`, the latest model weights are used - which is usually not recommneded).

In [None]:
mnist_classifier_es = tf.keras.models.clone_model(mnist_classifier)
mnist_classifier_es.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# define paths:
classi_weights_path_es = './pretrained/mnist_small_classifier_es.weights.h5'
classi_history_path_es = './pretrained/mnist_small_classifier_es.history.h5'

if train_from_scratch:
    history_es = mnist_classifier_es.fit(train_images, train_labels_OH, validation_data = (val_images, val_labels_OH),
                                         epochs = nEpochs, verbose = True, 
                                         callbacks = [ EarlyStopping(monitor='val_accuracy', patience=10,
                                                                     verbose=False, restore_best_weights=True)])

    # Save the weights:
    mnist_classifier_es.save_weights(classi_weights_path_es)

    # Save training history:
    with open(classi_history_path_es, 'wb') as f:
        pickle.dump(history_es, f)
else:
    # load previsously computed weights
    mnist_classifier_es.load_weights(classi_weights_path_es)

    # load history:
    with open(classi_history_path_es, 'rb') as f:
        history_es = pickle.load(f)

We see that `EarlyStopping` has ended the training after 36 epochs. Looking at the earlier performances, we see that the maximal `val_accuracy` was obtained in epoch 26 with a value of 0.9100.

In [None]:
plot_history(history_es.history)

Looking at the history of the training with early stopping, we see that the training stopped as soon as the model was going into overfitting - we see that namely the validation loss starts to increase after around 25 epochs. Hence, EarlyStopping has prevented to model from entering a regime where it is overadapting to the data.

### Function for Training and Analysis
In the following, we will look at many more models. To simplify this analysis, we define the following function that will do this for any model. The function will use early stopping for for all trainings:

In [None]:
def train_analyse_model(model, model_name, train_from_scratch, classi_weights_path, classi_history_path, 
                        train_images, train_labels_OH, val_images, val_labels_OH, test_images, test_labels,
                        nEpochs = 100, nPatience = 10):

    # Train model or load pretrained weights:
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    model.summary()
    
    if train_from_scratch:
        print('Training quietly')
        history = model.fit(train_images, train_labels_OH, validation_data = (val_images, val_labels_OH),
                                          epochs = nEpochs, verbose = False, 
                                          callbacks = [ EarlyStopping(monitor='val_accuracy', patience=nPatience,
                                                                      verbose=False, restore_best_weights=True)])
        # Save the weights:
        model.save_weights(classi_weights_path)
    
        # Save training history:
        with open(classi_history_path, 'wb') as f:
            pickle.dump(history, f)
    else:
        # load previsously computed weights
        model.load_weights(classi_weights_path)
    
        # load history:
        with open(classi_history_path, 'rb') as f:
            history = pickle.load(f)

    plt.figure(0)
    plot_history(history.history)
    
    # Evaluate accuracy on training data
    train_labels_OH_est = model.predict(train_images)
    train_labels_est = np.argmax(train_labels_OH_est, 1)
    print('\nAccuracy on training data:', accuracy_score(train_labels, train_labels_est))

    # Evaluate accuracy on validation data
    val_labels_OH_est = model.predict(val_images)
    val_labels_est = np.argmax(val_labels_OH_est, 1)
    print('\nAccuracy on validation data:', accuracy_score(val_labels, val_labels_est))

    # Evaluate accuracy on test data
    test_labels_OH_est = model.predict(test_images)
    test_labels_est = np.argmax(test_labels_OH_est, 1)
    print('Accuracy on testdata:', accuracy_score(test_labels, test_labels_est))

    # plot confusion matrix
    plt.figure(1)
    ConfusionMatrixDisplay.from_predictions(test_labels, test_labels_est, normalize='true', values_format='.2f')
    plt.title('Confusion matrix on Test data')
    plt.show()

    # generate dataframe with accuracy on validation and test data
    return pd.DataFrame({'Method': model_name, 
                         'Training': accuracy_score(train_labels, train_labels_est), 
                         'Validation': accuracy_score(val_labels, val_labels_est), 
                         'Test': accuracy_score(test_labels, test_labels_est)}, 
                        index=[model_name])

We apply this function to evaluate the model with early stopping, and add the performance summary to the overall `accuracies_df`:

In [None]:
accuracies_es =  train_analyse_model(mnist_classifier_es, 'Early Stopping', train_from_scratch, 
                                       classi_weights_path_es, classi_history_path_es, 
                                       train_images, train_labels_OH, val_images, val_labels_OH, test_images, test_labels,
                                       nEpochs = 100, nPatience = 10)

accuracies_df = pd.concat([accuracies_df, accuracies_es], axis=0)

## Avoiding Overfitting
Early stopping stops the model before it is overadapting. In this section, we are looking at means to keep a model from overfitting at all. 

### Drop-Out
Drop-out layers are a common addition to neural networks to prevent them from overfitting. 

#### Drop-Out Rate 50%

In [None]:
mnist_classifier_do50 = tf.keras.Sequential([
    tf.keras.layers.Input(shape = (28, 28, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(300, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

In [None]:
classi_weights_path_do50 = './pretrained/mnist_small_classifier_do50.weights.h5'
classi_history_path_do50 = './pretrained/mnist_small_classifier_do50.history.h5'

accuracies_do50 =  train_analyse_model(mnist_classifier_do50, 'Drop Out 50%', train_from_scratch, 
                                       classi_weights_path_do50, classi_history_path_do50, 
                                       train_images, train_labels_OH, val_images, val_labels_OH, test_images, test_labels,
                                       nEpochs = 100, nPatience = 10)

accuracies_df = pd.concat([accuracies_df, accuracies_do50], axis=0)
accuracies_do50

**EXERCISE**: Vary the model definition to a different drop out rate to investigate the effect on the model performance.

### Weight Regularization

Weight regularization puts a penalty on the absolute value of the infered weights. This will keep the model from infering unnecessary large weights.

#### L1
Note that we have imported the regularizers with the line `from tensorflow.keras.regularizers import L1, L2`, so now we can directly use `L1`:

In [None]:
mnist_classifier_kr1 = tf.keras.Sequential([
    tf.keras.layers.Input(shape = (28, 28, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, activation='relu', kernel_regularizer=L1(0.00025)),
    tf.keras.layers.Dense(100, activation='relu', kernel_regularizer=L1(0.00025)),
    tf.keras.layers.Dense(10, activation='softmax')
])

In [None]:
classi_weights_path_kr1 = './pretrained/mnist_small_classifier_kr1.weights.h5'
classi_history_path_kr1 = './pretrained/mnist_small_classifier_kr1.history.h5'

accuracies_kr1 =  train_analyse_model(mnist_classifier_kr1, 'Kernel Reg. L1',
                                      train_from_scratch, classi_weights_path_kr1, classi_history_path_kr1,
                                      train_images, train_labels_OH, val_images, val_labels_OH, test_images, test_labels,
                                      nEpochs = 100, nPatience = 10)

accuracies_df = pd.concat([accuracies_df, accuracies_kr1], axis=0)
accuracies_kr1

**EXERCISE**: 
* Play around with different regularization parameters (currently 0.00025)
* Adapt the model definition above to an `L2` kernel regularizer

### Activity Regularization

Activity regularization puts a penalty on the extent of the activity. In tensorflow it can be implemented via a separate layer `ActivityRegularization`, with takes parameters `l1` and `l2` for the weight of the corresponding penalties.

#### L1

In [None]:
mnist_classifier_ar1 = tf.keras.Sequential([
    tf.keras.layers.Input(shape = (28, 28, 1)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, activation='relu'),
    tf.keras.layers.ActivityRegularization(l1=0.00025),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.ActivityRegularization(l1=0.00025),
    tf.keras.layers.Dense(10, activation='softmax')
])

In [None]:
classi_weights_path_ar1 = './pretrained/mnist_small_classifier_ar1.weights.h5'
classi_history_path_ar1 = './pretrained/mnist_small_classifier_ar1.history.h5'

accuracies_ar1 =  train_analyse_model(mnist_classifier_ar1, 'Activity Reg. L1', 
                                      train_from_scratch, classi_weights_path_ar1, classi_history_path_ar1, 
                                      train_images, train_labels_OH, val_images, val_labels_OH, test_images, test_labels,
                                      nEpochs = 100, nPatience = 10)

accuracies_df = pd.concat([accuracies_df, accuracies_ar1], axis=0)

**EXERCISE**: 
* Play around with different regularization parameters (currently 0.00025)
* Adapt the model definition above to an `L2` kernel regularizer

## Performance Comparison
Below we compare the performance of different models. Note that **you can do this comparison even if you did not do all the exercises above** - it will simply work with all the models you have evaluated.

In [None]:
accuracies_df_long = accuracies_df.melt(id_vars = 'Method')
accuracies_df_long

In [None]:
accuracies_df_long.rename(columns={'variable': 'Dataset', 'value': 'Accuracy' }, inplace=True)
accuracies_df_long['Accuracy'] = 100*accuracies_df_long['Accuracy']

In [None]:
sns.barplot(data=accuracies_df_long, x='Accuracy', y='Method', hue='Dataset')
plt.legend(loc='lower left')
plt.xlabel('Accuracy [%]')
plt.ylabel('Network Type')
plt.grid()
plt.show()

In [None]:
accuracies_df_long['Error Rate'] = 100-accuracies_df_long['Accuracy'] 

In [None]:
sns.barplot(data=accuracies_df_long, x='Error Rate', y='Method', hue='Dataset')
plt.legend(loc='lower center')
plt.xlabel('Error Rate [%]')
plt.ylabel('Network Type')
plt.grid()
plt.show()

**EXERCISE**: Comment on these results. Which method do you think works best?