# Experiments with Model Architecture using the MNIST Digits Data Set

Project Repository: https://github.com/albert-kepner/Week6_DL_FinalProject


For this project we are building a number of different models to classify the handwritten digits in the MNIST data set. This is a supervised learning multi-class classification problem. The performance of various models is compared for both accurcy on the test data and training run time. We also compared some of the models using successively smaller training samples to see how accuracy changes with less data.



## Exploratory Data Analysis

This is a well-known data set of monochrome 28 by 28 pixel images of single handwritten digits 0-9. The data set is included in Keras. It is organized as 60,000 training images and 10,000 test images. No data cleaning is needed other than scaling the pixel values as floats in the range [0-1).

The functions we used to load the data set and more details are in a separate notbook here:

https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Exploratory_Data_Analysis.ipynb

## Model Architectures

We will be comparing three types of models Keras sequential models for this classification problem:

1. Baseline Model -- Simple dense network with no hidden layers

1. Deeper Dense models -- comparing fully connected dense networks with 1, 2, 3, or 4 hidden layers and also varying the number of hidden units per layer.

1. Four variations on a convolutional neural network or CNN.

## Baseline Dense Model (no hidden layers)

This model is about the simplest possible Keras sequential model. It has just an input layer which takes a flattened vector of 784 pixel values, and an output layer of 10 units with softmax activation, since this is a 10 branch classification problem. Initially we will use SGD as the optimizer. The model is shown here for reference.

In [1]:

def create_baseline_model():
    DIGIT_CLASSES = 10
    RESHAPED = 28 * 28 ## 784 pixesl per image
    #build the model
    model = tf.keras.models.Sequential()
    model.add(keras.layers.Dense(DIGIT_CLASSES,
                                 input_shape=(RESHAPED,),
                                 name='dense_layer',
                                 activation='softmax'))
    return model

In [2]:
%run create_baseline_model.py
model = create_baseline_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer (Dense)         (None, 10)                7850      
                                                                 
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________


Details of training this model are in a separate notebook here:

https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Baseline_Dense_Model.ipynb
    
The results of 5 training sessions with 60000 samples for training and evaluation on the 10000 sample test set are shown below.
The test accuracy was about 0.922 and it was very repeatable across 5 training sessions with std = 0.000249. The average runtime was about 237 seconds. This test was using a GPU.

In [3]:
import pandas as pd
df = pd.read_csv('data/BASELINE_DATA_2022-09-30_18_38.csv')
display(df)
df.describe()

Unnamed: 0,training_time_sec,test_accuracy
0,233.948682,0.9223
1,232.309703,0.9221
2,240.762712,0.9222
3,231.556649,0.9221
4,245.934716,0.9227


Unnamed: 0,training_time_sec,test_accuracy
count,5.0,5.0
mean,236.902493,0.92228
std,6.222522,0.000249
min,231.556649,0.9221
25%,232.309703,0.9221
50%,233.948682,0.9222
75%,240.762712,0.9223
max,245.934716,0.9227


## Deeper Dense Models

Below is the model creation function. We used this to configure from 1 to 4 hidden layers and tried 64, 128, and 192 as the number of units per hidden layer.

In [4]:
def create_dense_model(hidden_layers=1, hidden_units_per_layer=128):
    """
    create a keras sequential model with the specified
    number of hidden_layers and number of units per layer
    """
    DIGIT_CLASSES = 10
    RESHAPED = 28 * 28 ## 784 pixels per image
    model = tf.keras.models.Sequential()
    count = 0
    
    
    count += 1
    model.add(keras.layers.Dense(units=hidden_units_per_layer,
        input_shape=(RESHAPED,),
        name=f'dense_layer_{count}', activation='relu'))
    
    for i in range(1, hidden_layers):
        count += 1
        model.add(keras.layers.Dense(units=hidden_units_per_layer,
           name=f'dense_layer_{count}', activation='relu'))
        
    count += 1        
    model.add(keras.layers.Dense(DIGIT_CLASSES,
       name=f'dense_layer_{count}', activation='softmax'))

    # summary of the model
    model.summary()
    return model

The model summarys for two of these configurations are show here:

In [5]:
%run create_dense_model.py
create_dense_model(hidden_layers=1, hidden_units_per_layer=64)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer_1 (Dense)       (None, 64)                50240     
                                                                 
 dense_layer_2 (Dense)       (None, 10)                650       
                                                                 
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________


<keras.engine.sequential.Sequential at 0x287b2e6c130>

In [6]:
create_dense_model(hidden_layers=4, hidden_units_per_layer=192)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_layer_1 (Dense)       (None, 192)               150720    
                                                                 
 dense_layer_2 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_3 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_4 (Dense)       (None, 192)               37056     
                                                                 
 dense_layer_5 (Dense)       (None, 10)                1930      
                                                                 
Total params: 263,818
Trainable params: 263,818
Non-trainable params: 0
_________________________________________________________________


<keras.engine.sequential.Sequential at 0x287c43a4310>

## Deeper Dense Model Comparisons

We trained the above model with all 12 combinations of number of hidden layers {1,2,3,4} and number of units per hidden layer {64,128,192}. For each of these combinations we tried 3 variations:

* Optimizer=SGD, 
* Optimizer=Adam, 
* Optimizer=Adam with a Dropout layer added after each hidden layer 

So 36 model variations total. Each of these was run with a training set of 50,000 images, validation set of 10,000 images, and held out test set of 10,000 images. For each run we used early stopping based on validation loss, to avoid overfitting (and avoid wasted training time). All of these runs were done on a laptop and using a GPU.
For each run we captured the elapsed time for training and the accuracy score on the test images.

All of these models were more accurate than the baseline model describe above (test accuracy = 0.922). And the test accuracy of all the models was suprisingly close in a range from 0.966  to 0.980. However there was a huge difference in training times from 10.6 seconds to 312 seconds.

The best and worst models by training time and test accuracy are shown below:

In [7]:
## Which model configurations had the shortest and longest training times?
import pandas as pd
df = pd.read_csv('data/ALL_DENSE_MODELS.csv')

min_row = df.iloc[df['elapsed_time'].argmin(),:]
display(pd.DataFrame(min_row).T)
max_row = df.iloc[df['elapsed_time'].argmax(),:]
display(pd.DataFrame(max_row).T)

Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
17,2,192,10.560391,0.9767,Adam Optimizer


Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
2,1,192,311.766825,0.9762,SGD Optimizer


In [8]:
## Which model configuratons had the best and worst test accuracy?
max_row = df.iloc[df['test_accuracy'].argmax(),:]
display(pd.DataFrame(max_row).T)
min_row = df.iloc[df['test_accuracy'].argmin(),:]
display(pd.DataFrame(min_row).T)

Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
14,1,192,19.091164,0.9803,Adam Optimizer


Unnamed: 0,hidden_layers,hidden_units_per_layer,elapsed_time,test_accuracy,model_type
9,4,64,79.026243,0.9658,SGD Optimizer


More details summarizing model performance of the 36 configurations are here:
    
https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Deeper_Dense_Models_Comparison.ipynb    

Notebooks used to train each type of dense model are here:
    
* https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Deeper_Dense_Models_SGD.ipynb
* https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Deeper_Dense_Models_Adam.ipynb
* https://github.com/albert-kepner/Week6_DL_FinalProject/blob/master/Deeper_Dense_Models_Adam_Dropout.ipynb

## CNN Models

We have compared training time and test accuracy for four variations of a CNN model on this data set.

* Model 1 -- We started with a version of a LeNet CNN from <ins>Deep Learning with Tensorflow 2 and Keras, 2nd Edition</ins> (Reference 3) This model has an input layer which takes 28 by 28 by 1 channel images. It has two Convolution2D layers with (5,5) kernels, interleaved with two MaxPooling2D layers with pool size (2, 2) and strides (2, 2). The Convolution2D layers have valid padding, which means that these layers reduce the output image size by 4 in each dimension. The MaxPooling2D layers reduce the output image size by a factor of 2 in each dimension. Therefore successive layers have dimension sizes as follows:
    - (28 by 28 by 1) input layer
    - (24 by 24 by 20) from Convolution2D with 20 units
    - (12 by 12 by 20) from MaxPooling2D
    - (8 by 8 by 50) from Convolutin2D with 50 units
    - (4 by 4 by 50) from MaxPooling2D
    - Flatten layer 800 inputs to next layer
    - Dense layer 500 units
    - Softmax layer 10 units
* Model 1 has 431,080 trainable parameters according to the Keras summary below.
    
* Model 2 -- From Model 1 changed the Convolution2D layers to use (3, 3) kernels and same paddding. The layers work out as follows:
    - (28 by 28 by 1) input layer
    - (28 by 28 by 40) from Convolution2D
    - (14 by 14 by 40) from MaxPooling2D
    - (14 by 14 by 50) from Convolution2D
    - (7 by 7 by 50) from MaxPooling2d
    - Flatten layer 2450 inputs to next layer
    - Dense layer 500 units
    - Softmax layer 10 units
* Model 2 has 1,248,960 trainable parameters, which is more than Model 1 because the successive layers do not reduce the dimensions as much.

* Model 3 -- From Model 2, added an additional Convolution2D layer with (3,3) kernel and, valid padding, after the last max pooling layer. This layer reduces the vertical and horizontal dimension by 2 from 7 by 7 to 5 by 5. The layers work out as follows:
    - (28 by 28 by 1) input layer
    - (28 by 28 by 40) from Convolution2D
    - (14 by 14 by 40) from MaxPooling2D
    - (14 by 14 by 50) from Convolution2D
    - (7 by 7 by 50) from MaxPooling2d
    - (5 by 5 by 50) from Convolution2D
    - Flatten layer 1250 inputs to next layer
    - Dense layer 500 units
    - Softmax layer 10 units
* Model 3 has 671,510 trainable parameters which is less than Model 2 because of the reduced dimension passed to the first dense layer.

* Model 4 -- This is a variation from Model 1 which had 2 Convolution2D layers with 5 by 5 kernels and valid padding. In this model. In this model, each of the original Convolution2D layers is replaced by two Convolution2 layers with 3 by 3 kernels and valid padding. The two Convolution2D layers with 3 by 3 kernels have the same effect on horizontal and vertical dimensions as a single 5 by 5 kernel layer. Each of two 3 by 3 kernel layers reduces vetical and horizontal dimensions by 2. Therfore the layers work out as follows:
    - (28 by 28 by 1) input layer
    - (26 by 26 by 20) from Convolution2D with 20 units
    - (24 by 24 by 40) from Convolution2D with 50 units
    - (12 by 12 by 20) from MaxPooling2D
    - (10 by 10 by 50) from Convolutin2D with 50 units
    - (8 by 8 by 50) from Convolutin2D with 50 units
    - (4 by 4 by 50) from MaxPooling2D
    - Flatten layer 800 inputs to next layer
    - Dense layer 500 units
    - Softmax layer 10 units
* Model 4 has 459,860 trainable parameters which is similar to Model 1 with 431,080.


Functions to define the 4 models and a Keras summary() of each model follow:

In [9]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets, layers, models, optimizers

In [10]:
def create_CNN_Model1():
    DIGIT_CLASSES = 10
    IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
    INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)
    model = models.Sequential()
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(20, (5, 5), activation='relu',
        input_shape=INPUT_SHAPE))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(50, (5, 5), activation='relu'))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # Flatten => RELU layers
    model.add(layers.Flatten())
    model.add(layers.Dense(500, activation='relu'))
    # a softmax classifier
    model.add(layers.Dense(DIGIT_CLASSES, activation="softmax"))
    return model
  
model1 = create_CNN_Model1()
model1.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 24, 24, 20)        520       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 12, 12, 20)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 8, 8, 50)          25050     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 4, 4, 50)         0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 800)               0         
                                                                 
 dense (Dense)               (None, 500)              

In [11]:
def create_CNN_Model2():
    DIGIT_CLASSES = 10
    IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
    INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)
    model = models.Sequential()
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(40, (3, 3), activation='relu', padding='same',
        input_shape=INPUT_SHAPE))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(50, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # Flatten => RELU layers
    model.add(layers.Flatten())
    model.add(layers.Dense(500, activation='relu'))
    # a softmax classifier
    model.add(layers.Dense(DIGIT_CLASSES, activation="softmax"))
    return model

model2 = create_CNN_Model2()
model2.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 28, 28, 40)        400       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 14, 14, 40)       0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 14, 14, 50)        18050     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 7, 7, 50)         0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 2450)              0         
                                                                 
 dense_2 (Dense)             (None, 500)              

In [12]:
def create_CNN_Model3():
    DIGIT_CLASSES = 10
    IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
    INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)
    model = models.Sequential()
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(40, (3, 3), activation='relu', padding='same',
        input_shape=INPUT_SHAPE))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # CONV => RELU => POOL
    model.add(layers.Convolution2D(50, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # CONV => RELU
    model.add(layers.Convolution2D(50, (3, 3), activation='relu', padding='valid'))
    # Flatten => RELU layers
    model.add(layers.Flatten())
    model.add(layers.Dense(500, activation='relu'))
    # a softmax classifier
    model.add(layers.Dense(DIGIT_CLASSES, activation="softmax"))
    return model

model3 = create_CNN_Model3()
model3.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_4 (Conv2D)           (None, 28, 28, 40)        400       
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 14, 14, 40)       0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 14, 14, 50)        18050     
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 7, 7, 50)         0         
 2D)                                                             
                                                                 
 conv2d_6 (Conv2D)           (None, 5, 5, 50)          22550     
                                                                 
 flatten_2 (Flatten)         (None, 1250)             

In [13]:
def create_CNN_Model4():
    DIGIT_CLASSES = 10
    IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
    INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)
    model = models.Sequential()
    # CONV => RELU => CONV => RELU => POOL
    model.add(layers.Convolution2D(20, (3,3), activation='relu',
        input_shape=INPUT_SHAPE))
    model.add(layers.Convolution2D(50, (3,3), activation='relu'))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # CONV => RELU => CONV => RELU => POOL
    model.add(layers.Convolution2D(50, (3,3), activation='relu'))
    model.add(layers.Convolution2D(50, (3,3), activation='relu'))
    model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # Flatten => RELU layers
    model.add(layers.Flatten())
    model.add(layers.Dense(500, activation='relu'))
    # a softmax classifier
    model.add(layers.Dense(DIGIT_CLASSES, activation="softmax"))
    return model
  
model4 = create_CNN_Model4()
model4.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_7 (Conv2D)           (None, 26, 26, 20)        200       
                                                                 
 conv2d_8 (Conv2D)           (None, 24, 24, 50)        9050      
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 12, 12, 50)       0         
 2D)                                                             
                                                                 
 conv2d_9 (Conv2D)           (None, 10, 10, 50)        22550     
                                                                 
 conv2d_10 (Conv2D)          (None, 8, 8, 50)          22550     
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 4, 4, 50)         0         
 2D)                                                  