# Image Classification with Convolutional Neural Networks (CNN) Using MNIST Dataset
#### Ayra Qutub
#### 1708104
#### ECE 449 Lab D31


---


### **Introduction**
The objective of this lab was to build and train a Convolutional Neural Network (CNN) to perform image classification using the MNIST dataset, which contains 28x28 grayscale images of handwritten digits (0-9).

CNNs are a powerful tool for image classification due to their ability to learn spatial hierarchies of features. The CNN architecture used in this report has a convolutional layer followed by a pooling layer, another set of the same, and then the final connection layer.

The key tasks were to explore different hyperparameters, including the number of filters and learning rates, and to find the best-performing model using hyperperameter exploration. Once the optimal hyperparameters were identified, the model was trained on the entire dataset and evaluated on a test set.

The CNN is formatted as an automated pipeline. It includes the expected steps of data ingestion, data preprocessing, model training, and model evaluation. Before training the final model, hyperperameter exploration is conducted to assess the combination of hyperperameters that result in the greatest accuracy. Each of these steps was modularized for better clarity and usability.

In [None]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers, models
from sklearn.model_selection import StratifiedKFold
import tensorflow as tf
import numpy as np

### **Data Ingestion**

We are working with the Modified National Institute of Standards and Technology (MNIST) database in this lab. This is a large database of handwritten numbers. It contains 60,000 training images and 10,000 testing images. The NIST has already split the data into the train and test sets. The function `data_ingestion` preserves this split and ingests the data to be used further in the pipeline.

In [None]:
def data_ingestion():
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    return X_train, y_train, X_test, y_test

### **Data Preprocessing**
Here, the data is processed using One-Hot Encoding. This converts the labels into vectors like `[0,0,1,0,...,0]` for multi-class classification. Additionally, the input is reshaped to a 4D array to be compatible with the CNN layers, where 1 represents the grayscale channel.

In [None]:
def data_preprocessing(X_train, y_train, X_test, y_test):

  # convert class labels into one-hot encoding
  y_train_encoded = to_categorical(y_train)
  y_test_encoded = to_categorical(y_test)

  # reshape the data to be in the form (samples, 28, 28, 1) for grayscale images
  X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
  X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)

  return X_train, y_train_encoded, X_test, y_test_encoded

## **Build Model**
The model is a CNN with its architecture based on that of the one outlined in [1], with alternating convolution and pooling layers.

The convolution layers use a ReLU activation function. This is the function `f(x)= { x | x > 0; 0 | otherwise }`. These layers extract features from the input images with filters. The filters are variables which will be inputted at the time of running. This makes it easy to automatically test different filters and decide on the best one to create our model.

The pooling layers downsample the feature maps, reducing their dimensionality while retaining important features. These use a MAX() pooling function. This means that in the downsampling, the pooling preserves the greatest value within each region.

The alternating convolution and pooling layers are followed by a fully connected layer. This maps the final feature maps to the output classes (digits 0-9) using a single dense layer and a softmax layer.

In [None]:
def build_model(filters, learning_rate):
    model = models.Sequential()
    # Alternating convolution and pooling layers

    # First convolution layer
    model.add(layers.Conv2D(filters=filters, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
    # First pooling layer
    model.add(layers.MaxPooling2D(pool_size=(2, 2)))

    # Second convolution layer
    model.add(layers.Conv2D(filters=filters, kernel_size=(3, 3), activation='relu'))
    # Second pooling layer
    model.add(layers.MaxPooling2D(pool_size=(2, 2)))

    # Flattening and fully connected layer
    model.add(layers.Flatten())
    model.add(layers.Dense(10, activation='softmax'))

    # Compile the model
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

### **Hyperperameter Exploration**
The hyperperameters in this CNN are filters (also called kernels, designed to detect specific patterns or features in the input data) and learning rate (controls how much to change the model in response to the estimated error). These are the parameters responsible for directly influencing the model structure, functions, and performance. We can optimize the performance and accuracy of a model by choosing the correct hyperperameters. These hyperperameters are determined through hyperperameter exploration.

To conduct this, the model is trained and tested for each combination of hyperperameters. The accuracy is recorded and the hyperperameter combination that results in the highest accuracy is saved. This will be used for the final model.

The model is trained and validated using a Stratified K-Fold technique with 5 folds. The folds are creating by seperating the dataset into K (5) proportionate strata. These are combined into folds by taking the first stratum from each class and combining them into the first fold, the second stratum from each class into the second fold, and so on. This way, the folds reflect the dataset’s original class distribution. During validation, one fold serves as the test set while the other are used for training. This is iterated for each fold.

In [None]:
def hyperperameter_exploration(X_train, y_train_encoded, filter_options, learning_rate_options):
  # Initialize Stratified KFold
  skf = StratifiedKFold(n_splits=5)

  # To track the best combination
  results = {}
  best_accuracy = 0
  best_combination = None

  for filters in filter_options:
      for lr in learning_rate_options:
          fold_no = 1
          fold_accuracies = []
          for train_index, val_index in skf.split(X_train, y_train):
              print(f"Training fold {fold_no} with {filters} filters and {lr} learning rate")

              X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
              y_train_fold, y_val_fold = y_train_encoded[train_index], y_train_encoded[val_index]

              # Build and train the model with the current fold
              model = build_model(filters, lr)
              model.fit(X_train_fold, y_train_fold, validation_data=(X_val_fold, y_val_fold), epochs=5)

              # Evaluate the model
              val_loss, val_acc = model.evaluate(X_val_fold, y_val_fold)
              print(f"Fold {fold_no} - Validation accuracy: {val_acc}")
              fold_accuracies.append(val_acc)
              fold_no += 1
          avg_acc = np.mean(fold_accuracies)
          print(f"Average validation accuracy for {filters} filters and {lr} learning rate: {avg_acc}")
          print("-" * 50)
          # Store the result in a dictionary
          results[(filters, lr)] = avg_acc

          # Track the best combination
          if avg_acc > best_accuracy:
              best_accuracy = avg_acc
              best_combination = (filters, lr)

  # After all combinations are tested
  return best_combination, best_accuracy

### **Model Training**
After completing hyperperameter exploration, the final model is trained with the hyperperameters that gave the highest accuracy; the model is trained on the entire dataset for 10 epochs.

In [None]:
def model_training(best_combination, X_train, y_train_encoded):
  best_model = build_model(*best_combination)
  best_model.fit(X_train, y_train_encoded, epochs=10)
  return best_model

### **Model Evaluation**
The final model is evaluated on the test set to measure its generalization performance. This evaluates the model’s performance on unseen data, giving the final accuracy and loss values.

In [None]:
def model_evaluation(best_model, X_test, y_test_encoded):
  loss, accuracy = best_model.evaluate(X_test, y_test_encoded)
  return loss, accuracy

### **Pipeline**
As a final step, all of the above is automated and put into a pipeline function, which, in this case, we will be calling the main function. This does all the steps we set out to do: It takes in the data, processes it, conducts hyperperameter exploration, trains a model with the selected hyperperameters, and then evaluates the model.

In [None]:
if __name__ == "__main__":
  X_train, y_train, X_test, y_test = data_ingestion()
  X_train, y_train_encoded, X_test, y_test_encoded = data_preprocessing(X_train, y_train, X_test, y_test)
  filter_options = [16, 32]
  learning_rate_options = [0.001, 0.01]
  best_combination, best_accuracy = hyperperameter_exploration(X_train, y_train_encoded, filter_options, learning_rate_options)
  print(f"\nBest combination: {best_combination} with accuracy: {best_accuracy}")
  best_model = model_training(best_combination, X_train, y_train_encoded)
  loss, accuracy = model_evaluation(best_model, X_test, y_test_encoded)
  print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

Training fold 1 with 16 filters and 0.001 learning rate


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 24ms/step - accuracy: 0.7728 - loss: 2.8144 - val_accuracy: 0.9400 - val_loss: 0.2134
Epoch 2/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 17ms/step - accuracy: 0.9638 - loss: 0.1275 - val_accuracy: 0.9671 - val_loss: 0.1192
Epoch 3/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 17ms/step - accuracy: 0.9746 - loss: 0.0826 - val_accuracy: 0.9758 - val_loss: 0.0839
Epoch 4/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 16ms/step - accuracy: 0.9804 - loss: 0.0603 - val_accuracy: 0.9740 - val_loss: 0.0965
Epoch 5/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 16ms/step - accuracy: 0.9834 - loss: 0.0512 - val_accuracy: 0.9750 - val_loss: 0.0890
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9746 - loss: 0.0853
Fold 1 - Validation accuracy: 0.9750000238418579
Training fold 2 w

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 16ms/step - accuracy: 0.7163 - loss: 2.5858 - val_accuracy: 0.8814 - val_loss: 0.4277
Epoch 2/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 17ms/step - accuracy: 0.9198 - loss: 0.2762 - val_accuracy: 0.9247 - val_loss: 0.2628
Epoch 3/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 17ms/step - accuracy: 0.9258 - loss: 0.2542 - val_accuracy: 0.9405 - val_loss: 0.2041
Epoch 4/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 16ms/step - accuracy: 0.9288 - loss: 0.2520 - val_accuracy: 0.9335 - val_loss: 0.2211
Epoch 5/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 16ms/step - accuracy: 0.9279 - loss: 0.2511 - val_accuracy: 0.9279 - val_loss: 0.2428
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9333 - loss: 0.2336
Fold 1 - Validation accuracy: 0.9279166460037231
Training fold 2 with 16 fil

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 27ms/step - accuracy: 0.8373 - loss: 1.5079 - val_accuracy: 0.9649 - val_loss: 0.1291
Epoch 2/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 24ms/step - accuracy: 0.9743 - loss: 0.0906 - val_accuracy: 0.9770 - val_loss: 0.0813
Epoch 3/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 25ms/step - accuracy: 0.9800 - loss: 0.0654 - val_accuracy: 0.9758 - val_loss: 0.0968
Epoch 4/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 25ms/step - accuracy: 0.9842 - loss: 0.0523 - val_accuracy: 0.9797 - val_loss: 0.0738
Epoch 5/5
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 25ms/step - accuracy: 0.9845 - loss: 0.0462 - val_accuracy: 0.9786 - val_loss: 0.0771
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9797 - loss: 0.0763
Fold 1 - Validation accuracy: 0.9785833358764648
Training fold 2 with 32 fil

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 25ms/step - accuracy: 0.7983 - loss: 2.7288 - val_accuracy: 0.9324 - val_loss: 0.2306
Epoch 2/5
[1m  10/1500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m30s[0m 20ms/step - accuracy: 0.9469 - loss: 0.2171

### **Results**
After performing stratified cross-validation, the best hyperparameter combination was determined to be **32** filters with a learning rate of **0.001**, which achieved the highest average validation accuracy across the folds.

After final training, the model was evaluated on the test set. The accuracy of the model, rounded to the nearest thousandth, is **0.985**.


### **Conclusion**
In this lab, we successfully built a CNN for image classification on the MNIST dataset. By experimenting with different hyperparameters using stratified cross-validation, we identified the best-performing model, which was able to generalize well to the test data, achieving about 98.5% accuracy. This lab demonstrated the effectiveness of CNNs in image classification tasks and provided practical experience in hyperparameter tuning and model evaluation.

---

The MNIST dataset is a well-known benchmark in the field of machine learning and deep learning, with many models achieving high accuracy on the task of digit classification. The results obtained from the CNN built here can be compared with those reported in recent papers and current models for image classification.

One such model is the Ensemble Network, as demonstrated in [2]. This paper proposes a novel architecture called EnsNet, designed to enhance the performance of Convolutional Neural Networks (CNNs) in image classification tasks. EnsNet combines a base CNN with multiple Fully Connected SubNetworks (FCSNs). The key idea is to split the feature maps generated by the last convolutional layer of the base CNN into disjoint subsets, which are then assigned to the FCSNs. Each FCSN is trained independently, and the final prediction is determined through majority voting between the base CNN and the FCSNs. This approach uses ensemble learning, introducing diversity among the learners by training FCSNs on different subsets of the feature maps.

The EnsNet CNN achieved a state-of-the-art error rate of 0.16% (99.84% accuracy) on the MNIST dataset using this complex ensemble method.
The lab CNN achieved around 98.5% accuracy, which is lower, as expected from a simpler architecture without ensemble learning or extensive regularization.

Another recent paper, [3], explores techniques to enhance the performance of basic CNNs. This is done through data augmentation, dropout, and early stopping. The paper demonstrates that even plain CNNs, when combined with effective regularization and optimization techniques, can compete with more advanced architectures like residual networks.

The study applies these regularization methods to several datasets, including MNIST, and achieves state-of-the-art performance on it 99.83% accuracy. In contrast, the CNN implemented in the lab is simpler, with only 2 convolutional layers and no regularization, achieving 98.5% accuracy. While the lab CNN performs well for its simplicity, the enhanced model shows that introducing regularization and optimization techniques significantly boosts performance and generalization, especially on larger datasets.

While current state-of-the-art techniques outperform our model, these approaches require more complex architectures and computational resources. Given the simplicity of our model and the absence of advanced techniques like data augmentation or ensemble learning, achieving 98.5% is a strong result that demonstrates the effectiveness of CNNs even with basic configurations. This result is particularly impressive considering that our model was developed through basic stratified cross-validation and hyperparameter tuning, without the use of any advanced optimization techniques.

### **Resources**
[1] Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner: Gradient Based Learning Applied to Document Recognition, Proceedings of IEEE, 86(11):2278–2324, 1998.

[2] D. Hirata and N. Takahashi, "Ensemble learning in CNN augmented with fully connected subnetworks," arXiv preprint arXiv:2003.08562, Mar. 2020. [Online]. Available: https://arxiv.org/abs/2003.08562

[3] Y. S. Assiri, "Stochastic Optimization of Plain Convolutional Neural Networks with Simple Methods," arXiv preprint arXiv:2009.08589, Mar. 2020. [Online]. Available: https://arxiv.org/abs/2001.08856.