Build and train a CNN+MLP deep learning model with Keras with two followings for MNIST dataset:
```
1. Conv2D(32, kernel_size=(3, 3), activation='relu')
2. Conv2D(64, kernel_size=(3, 3), activation='relu')
3. MaxPooling2D(pool_size=(2, 2))
4. Dense(128, activation='relu')
5. Dense(num_classes, activation='softmax')
```
Also build another model with BatchNormalization and Dropout. 
Compare these two models performance for test data

## Import Packages

In [29]:
import keras
from keras import backend as K
# CNN and MLP architecture
from keras.models import Sequential
from keras.layers import (
    Dense,
    Conv2D,
    MaxPooling2D,
    UpSampling2D,
    Dropout,
    Flatten,
    BatchNormalization
)
from keras.models import Model
from keras.optimizers import SGD
from keras.initializers import RandomNormal
# MNIST
from keras.datasets import mnist
# Data normalization
from sklearn.preprocessing import StandardScaler
# Mathematical Computation and Timing
import numpy as np
import time

## 1 - Data Preparation

In [2]:
# Image Dimensions
img_rows, img_cols = 28, 28

# Splitting Data between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# store the number of labels 
num_classes = len(np.unique(y_train))

# Reshaping Data
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

# Displaying Resulting Dimensions
print(f'Shape of X_train: {x_train.shape}')
print(f'Shape of X_test: {x_test.shape}')

Shape of X_train: (60000, 28, 28, 1)
Shape of X_test: (10000, 28, 28, 1)


## 2 - Data Normalization

In [3]:
x_train = x_train/np.max(x_train)
x_test = x_test/np.max(x_train)

## 3 - One Hot Encoding

In [4]:
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

## 4 - Define Model (No Batch Normalization/Dropout)

In [6]:
# Instanitate a model using the Sequential API
fully_connected = Sequential()

# Convolutional Layers
fully_connected.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                           input_shape=(28, 28, 1)))
fully_connected.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
fully_connected.add(MaxPooling2D(pool_size=(2, 2)))  # no learning params
fully_connected.add(Flatten())

# MLP Layers
fully_connected.add(Dense(128, activation='relu'))
fully_connected.add(Dense(num_classes, activation='softmax'))

# Compile Model
fully_connected.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

# Print Summary
fully_connected.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 9216)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               1179776   
_________________________________________________________________
dense_4 (Dense)              (None, 10)                1290      
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________


## Train the model
(No Batch Normalization or Dropout here)

In [31]:
def time_fitting(fitter):
    """Record and display the time taken to train the model.
    
       Parameters:
       fitter(function): executed to train the model
       
       Returns: None
    
    """
    start = time.time()
    fitter()
    end = time.time()
    print(f'Fitting Time: {end-start} miliseconds')
    return None

In [32]:
time_fitting(lambda:
                fully_connected.fit(x_train, y_train,
                    epochs=3, batch_size=100,
                    validation_data=(x_test, y_test),
                    verbose=0)
            )

Fitting Time: 179.19434523582458 miliseconds


## Define Model with Batch Normalization & Dropout

In [12]:
# Define rate of dropout
drop_rate = 0.25

# Instanitate a model using the Sequential API
partially_connected = Sequential()

# Convolutional Layer 1
partially_connected.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                           input_shape=(28, 28, 1)))
partially_connected.add(BatchNormalization())
partially_connected.add(Dropout(rate=drop_rate))

# Convolutional Layer 2
partially_connected.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
partially_connected.add(BatchNormalization())
partially_connected.add(MaxPooling2D(pool_size=(2, 2)))
partially_connected.add(Dropout(rate=drop_rate))

# Flatten the samples to go into MLP
partially_connected.add(Flatten())

# MLP Layer 1
partially_connected.add(Dense(128, activation='relu'))
partially_connected.add(BatchNormalization())
partially_connected.add(Dropout(rate=drop_rate))

# MLP Output layer
partially_connected.add(Dense(num_classes, activation='softmax'))

# Compile Model
partially_connected.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

# Print Summary
partially_connected.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_7 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
batch_normalization_4 (Batch (None, 26, 26, 32)        128       
_________________________________________________________________
dropout_4 (Dropout)          (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
batch_normalization_5 (Batch (None, 24, 24, 64)        256       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 12, 12, 64)        0         
__________

## Train the Updated Model
(meaning we have Batch Normalization and Dropout)

In [33]:
time_fitting(lambda:
                 partially_connected.fit(x_train, y_train,
                    epochs=3, batch_size=100,
                    validation_data=(x_test, y_test),
                    verbose=0)
            )

Fitting Time: 395.78104400634766 miliseconds


## Compare Models

In [26]:
def evaluate_model(model, x_test, y_test, signifier):
    """
    Display the results of evaluating the Convolutional Neural Network.
    
    Parameters:
    model(Sequential): the CNN + MLP neural network 
    x_test(np.array): the testing inputs of the MNIST dataset
    y_test(np.array): one hot vector of testing outputs from MNIST dataset
    signifier(str): clarifies which model is being tested: 'fully-connected'
                    or 'partially-connected'
    
    Returns: None 

    """
    # compute loss and accuracy
    loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
    # convert accuracy to percentage
    accuracy = round(accuracy*100, 2)
    # print the loss and accuracy
    print(f'Loss of model with {signifier} connections: {loss}')
    print(f'Accuracy of model with {signifier} connections: {accuracy}%')
    return None

### 1: Testing the Model without Batch Normalization or Dropout

In [34]:
evaluate_model(fully_connected,
               x_test, y_test,
               'fully-connected')

Loss of model with fully-connected connections: 0.15895133051872254
Accuracy of model with fully-connected connections: 99.0%


### 2: Testing the Model with Batch Normalization or Dropout

In [35]:
evaluate_model(partially_connected,
               x_test, y_test,
               'partially-connected')

Loss of model with partially-connected connections: 1.954911483001709
Accuracy of model with partially-connected connections: 87.76%


## Final Conclusion

**Which Model Performs Better?**

As you can see above, the CNN model trained with full connections and no batch normalization achieved a lower loss and higher accuracy than the one that was.

However even though the CNN that included dropout and batch normalization was not as accurate, it does carry some positive trade-offs. For instance, it still has an accuracy of 80%, which is respectable enough in the context of classifying hand-written digits. Additionally, this model carries less chance of being overfitted, which may make it more trustworthy to use in production. Secondly, because the model used batch normalization it was able to converge faster than the one that did not. This may suggest it uses computing resources more efficiently than the other model.