# Experiments with Model Architecture using the MNIST Digits Data Set


For this project we are building a number of different models to classify the handwritten digits in the MNIST data set. The performance of various models is compared for both accurcy on the test data and run time. We also compared some of the models using successively smaller training samples to see how accuracy changes with less data.



## Exploratory Data Analysis

## Model Architectures

We will be comparing three types of models Keras sequential models for this classification problem:

1. Baseline Model -- Simple dense network with no hidden layers

1. Deeper Dense models -- comparing fully connected dense networks with 1, 2, 3, or 4 hidden layers and also varying the number of hidden units per layer.

1. Four variations on a convolutional neural network or CNN.

## Baseline Dense Model

This model is about the simplest possible Keras sequential model. It has just an input layer which takes a flattened vector of 784 pixel values, and an output layer of 10 units with softmax activation, since this is a 10 branch classification problem. Initially we will use SGD as the optimizer. The model is shown here for reference.

In [None]:

def create_baseline_model():
    DIGIT_CLASSES = 10
    RESHAPED = 28 * 28 ## 784 pixesl per image
    #build the model
    model = tf.keras.models.Sequential()
    model.add(keras.layers.Dense(DIGIT_CLASSES,
                                 input_shape=(RESHAPED,),
                                 name='dense_layer',
                                 activation='softmax'))
    return model

In [5]:
%run create_baseline_model.py
model = create_baseline_model()
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer (Dense)         (None, 10)                7850      
                                                                 
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________


Details of training this model are in a separate notebook here: ...
    
The results of 5 training sessions with 60000 samples for training and evaluation on the 10000 sample test set are shown below.
The test accuracy was about 0.922 and it was very repeatable across 5 training sessions with std = 0.000249. The average runtime was about 237 seconds. This test was using a GPU.

In [4]:
import pandas as pd
df = pd.read_csv('data/BASELINE_DATA_2022-09-30_18_38.csv')
display(df)
df.describe()

Unnamed: 0,training_time_sec,test_accuracy
0,233.948682,0.9223
1,232.309703,0.9221
2,240.762712,0.9222
3,231.556649,0.9221
4,245.934716,0.9227


Unnamed: 0,training_time_sec,test_accuracy
count,5.0,5.0
mean,236.902493,0.92228
std,6.222522,0.000249
min,231.556649,0.9221
25%,232.309703,0.9221
50%,233.948682,0.9222
75%,240.762712,0.9223
max,245.934716,0.9227


## Deeper Dense Models

Below is the model creation function. We used this to configure from 1 to 4 hidden layers and tried 64, 128, and 192 as the number of units per hidden layer.

In [None]:
def create_dense_model(hidden_layers=1, hidden_units_per_layer=128):
    """
    create a keras sequential model with the specified
    number of hidden_layers and number of units per layer
    """
    DIGIT_CLASSES = 10
    RESHAPED = 28 * 28 ## 784 pixels per image
    model = tf.keras.models.Sequential()
    count = 0
    
    
    count += 1
    model.add(keras.layers.Dense(units=hidden_units_per_layer,
        input_shape=(RESHAPED,),
        name=f'dense_layer_{count}', activation='relu'))
    
    for i in range(1, hidden_layers):
        count += 1
        model.add(keras.layers.Dense(units=hidden_units_per_layer,
           name=f'dense_layer_{count}', activation='relu'))
        
    count += 1        
    model.add(keras.layers.Dense(DIGIT_CLASSES,
       name=f'dense_layer_{count}', activation='softmax'))

    # summary of the model
    model.summary()
    return model

The model summarys for two of these configurations are show here:

In [1]:
%run create_dense_model.py
create_dense_model(hidden_layers=1, hidden_units_per_layer=64)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer_1 (Dense)       (None, 64)                50240     
                                                                 
 dense_layer_2 (Dense)       (None, 10)                650       
                                                                 
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________


<keras.engine.sequential.Sequential at 0x1b5c8d7c910>

In [2]:
create_dense_model(hidden_layers=4, hidden_units_per_layer=192)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer_1 (Dense)       (None, 192)               150720    
                                                                 
 dense_layer_2 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_3 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_4 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_5 (Dense)       (None, 10)                1930      
                                                                 
Total params: 263,818
Trainable params: 263,818
Non-trainable params: 0
_________________________________________________________________


<keras.engine.sequential.Sequential at 0x1b5c8d7cbb0>

## Deeper Dense Model Comparisons

We trained the above model with all 12 combinations of number of hidden layers {1,2,3,4} and number of units per hidden layer {64,128,192}. For each of these combinations we tried 3 variations:

* Optimizer=SGD, 
* Optimizer=Adam, 
* Optimizer=Adam with a Dropout layer added after each hidden layer 

So 36 model variations total. Each of these was run with a training set of 50,000 images, validation set of 10,000 images, and held out test set of 10,000 images. For each run we used early stopping based on validation loss, to avoid overfitting (and avoid wasted training time). All of these runs were done on a laptop and using a GPU.
For each run we captured the elapsed time for training and the accuracy score on the test images.

All of these models were more accurate than the baseline model describe above (test accuracy = 0.922). And the test accuracy of all the models was suprisingly close in a range from 0.966  to 0.980. However there was a huge difference in training times from 10.6 seconds to 312 seconds.

The best and worst models by training time and test accuracy are shown below:

In [2]:
## Which model configurations had the shortest and longest training times?
import pandas as pd
df = pd.read_csv('data/ALL_DENSE_MODELS.csv')

min_row = df.iloc[df['elapsed_time'].argmin(),:]
display(pd.DataFrame(min_row).T)
max_row = df.iloc[df['elapsed_time'].argmax(),:]
display(pd.DataFrame(max_row).T)

Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
17,2,192,10.560391,0.9767,Adam Optimizer


Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
2,1,192,311.766825,0.9762,SGD Optimizer


In [4]:
## Which model configuratons had the best and worst test accuracy?
max_row = df.iloc[df['test_accuracy'].argmax(),:]
display(pd.DataFrame(max_row).T)
min_row = df.iloc[df['test_accuracy'].argmin(),:]
display(pd.DataFrame(min_row).T)

Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
14,1,192,19.091164,0.9803,Adam Optimizer


Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
9,4,64,79.026243,0.9658,SGD Optimizer


## CNN Models