# Handwritten Digit Classification on the MNIST Dataset


## Introduction

This project applies the universal workflow of machine learning from *Deep Learning with Python* (Chapter 4.5, 1st edition) to the MNIST handwritten digit dataset. The goal is to build, analyze, and improve a neural network that can recognize digits from images.

The task is to develop a machine learning model that can accurately classify images of handwritten digits (0–9) from the MNIST dataset. This is a supervised learning problem, where the model is trained on labeled examples of images and their corresponding digit classes. Each input is a 28×28 grayscale image, and the model outputs a label indicating which digit (0–9) it predicts for that image. The MNIST dataset is a widely used benchmark in machine learning and computer vision, consisting of 70,000 images in total: 60,000 for training and 10,000 for testing.

MNIST is an ideal choice for this project because it is relatively small, easy to load and preprocess, and has been extensively studied, which makes it well suited for learning and comparing different model configurations.

In this project, the models are restricted to Keras/TensorFlow Sequential architectures built only from Dense and Dropout layers, without using more complex layers such as convolutional layers. This constraint reflects the course requirements and encourages a focus on understanding the core ideas of fully connected neural networks and regularization, rather than relying on more advanced architectures.

The main metric for success in this project is classification accuracy on a held‑out test set, measuring the proportion of correctly classified digit images. The goal is to build a model that achieves high accuracy while following the DLWP universal workflow, rather than necessarily reaching state‑of‑the‑art performance. Additional metrics such as precision, recall, and F1‑score will be considered during evaluation to provide a more detailed picture of model performance, particularly if class imbalance or specific error types become relevant.

The workflow for this project follows the universal machine learning process described in Deep Learning with Python: defining the problem and dataset, choosing a success metric, and deciding on an evaluation protocol. After that, the data will be prepared and split, a baseline model will be built and evaluated, a larger model will be trained to intentionally overfit, and finally regularization techniques and hyperparameter tuning will be applied based on validation performance to improve generalization.

In this project, the universal workflow of machine learning from Deep Learning with Python is followed step by step. The problem and dataset are first defined, along with the success metric and evaluation protocol. The data is then loaded, preprocessed, and split into training, validation, and test sets. A simple baseline model is built and evaluated, followed by a larger model that is allowed to overfit in order to explore the model’s capacity. Finally, regularization techniques and hyperparameter tuning are applied to improve generalization, and the best model is evaluated on the test set and discussed in terms of its strengths, limitations, and possible extensions.

## 1. Problem definition and dataset

The problem addressed in this project is the classification of handwritten digits from the MNIST dataset. The MNIST dataset consists of 70,000 grayscale images of handwritten digits (0–9), each of size 28×28 pixels. It is split into a training set of 60,000 images and a test set of 10,000 images. This is a supervised multiclass classification problem, where the goal is to train a model that takes an image as input and outputs the correct digit label.

MNIST is a good choice for this project because it is widely used in the machine learning community, making it easy to compare results with existing work. It is also relatively small and straightforward to load and preprocess, which allows us to focus on the modeling aspects rather than on complex data handling. Additionally, the task of digit classification is a well‑defined and intuitive problem that serves as a gentle introduction to image classification and neural networks.

## 2. Measure of Success

Our mains metric for success in this project is classification accuracy on a held‑out test set, which measures the proportion of correctly classified digit images. 

$$
Accuracy = \frac{Number\ of\ Correct\ Predictions}{Total\ Number\ of\ Predictions}
$$

Because MNIST is a relatively simple and well‑studied benchmark, many models achieve very high performance on this dataset. For this reason, this project considers a test accuracy above 99% as the target for a successful model, rather than treating lower accuracies (e.g. 90–95%) as sufficient. Additional metrics such as precision, recall, and F1‑score will be considered during evaluation to provide a more detailed picture of model performance, particularly if class imbalance or specific error types become relevant.

## 3. Evaluation Protocol

The original MNIST dataset provides 60,000 training images and 10,000 test images. In this project, the 60,000 training images are further split into 50,000 training samples and 10,000 validation samples. The 50,000 training samples are used to fit the parameters of the neural network, while the 10,000 validation samples are used to monitor performance during development and to guide choices such as model architecture, number of epochs, and regularization strength.

The 10,000 test images are kept completely separate from the training and validation process and are only used once, at the end of the project, to obtain an unbiased estimate of the final model’s generalization performance.
This train/validation/test setup helps prevent overfitting to the test set and ensures that any improvements observed on the validation set reflect genuine improvements in the model, rather than adaptation to a single held‑out benchmark.


## 4. Data Loading and Preprocessing

MNIST dataset is available in Keras, which makes it easy to load and preprocess. The images are grayscale and have pixel values in the range [0, 255]. For better performance, we will normalize the pixel values to the range [0, 1] by dividing by 255. Additionally, since we are using a fully connected neural network, we will flatten the 28×28 images into 784‑dimensional vectors.

In [20]:
import numpy as np
from tensorflow import keras

# 1) Load MNIST
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
print(f"x_train shape: {x_train.shape}, y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape}, y_test shape: {y_test.shape}")

x_train shape: (60000, 28, 28), y_train shape: (60000,)
x_test shape: (10000, 28, 28), y_test shape: (10000,)


To provide the model with the correct format for training, we will need to "flatten" the 28×28 images into 784‑dimensional vectors. This can be done using the `reshape` method in NumPy or by using the `Flatten` layer in Keras. This process transforms each 2D image into a 1D vector, which is the expected input format for a fully connected neural network.

In [21]:
# 2) Flatten images: (num_samples, 28, 28) -> (num_samples, 784)
num_train = x_train.shape[0]
num_test = x_test.shape[0]

x_train = x_train.reshape(num_train, 28 * 28)
x_test = x_test.reshape(num_test, 28 * 28)

print(f"x_train shape: {x_train.shape}, x_test shape: {x_test.shape}")

x_train shape: (60000, 784), x_test shape: (10000, 784)


Normalization of pixel values is crucial for training neural networks effectively, as it helps to ensure that the input data is on a similar scale, which can improve convergence during training. By dividing the pixel values by 255, we scale them to the range [0, 1], which is more suitable for the activation functions used in neural networks.

In [22]:
# 3) Convert to float and scale to [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
print(f"Pixel value range after normalization: [{x_train.min()}, {x_train.max()}]")

Pixel value range after normalization: [0.0, 1.0]


Next, we will convert the class labels (0–9) into one‑hot encoded vectors using `keras.utils.to_categorical`. This function takes a vector of class indices and returns a matrix of one‑hot encoded label vectors, which is the required format for training a neural network with categorical crossentropy loss.

In [23]:
# 4) One-hot encode labels: 0 -> [1,0,0,0,0,0,0,0,0,0], etc.
num_classes = 10

y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
y_test_categorical = keras.utils.to_categorical(y_test, num_classes)

print(y_train.shape, y_train_categorical.shape)
print(y_test.shape, y_test_categorical.shape)


(60000,) (60000, 10)
(10000,) (10000, 10)


Validation set is created by splitting the original training set of 60,000 images into 50,000 training samples and 10,000 validation samples. The validation set is used to select model architectures and hyperparameters, while the separate 10,000‑image test set is used only once at the end to obtain an unbiased estimate of the final model’s performance.

In [24]:
# 5) Create training / validation split from the original training set
x_train_final = x_train[:50000]
y_train_final = y_train_categorical[:50000]

x_val = x_train[50000:]
y_val = y_train_categorical[50000:]

print(x_train_final.shape, y_train_final.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test_categorical.shape)

(50000, 784) (50000, 10)
(10000, 784) (10000, 10)
(10000, 784) (10000, 10)


## 5. Baseline model: small dense network

The baseline model is a simple fully connected neural network with one hidden layer of 64 units and ReLU activation, followed by an output layer of 10 units with softmax activation for multiclass classification. This architecture is chosen for its simplicity and ability to learn non‑linear relationships in the data, while still being small enough to train quickly and serve as a useful baseline for comparison with more complex models.

We start by importing "layers" from Keras, which provides the building blocks for constructing our neural network. The `Dense` layer is used to create fully connected layers, the first one with 64 units and ReLU activation, and the second one with 10 units and softmax activation for outputting class probabilities. The `Sequential` model is used to stack these layers in a linear fashion.

Finally, we compile the model with the `RMSPROP` optimizer, categorical crossentropy loss (suitable for multiclass classification), and accuracy as the metric to monitor during training. This setup allows us to train the model effectively and evaluate its performance based on accuracy.

In [25]:
from keras import layers
from keras import models

model_baseline = keras.Sequential([
    layers.Dense(64, activation="relu", input_shape=(28 * 28,)),
    layers.Dense(10, activation="softmax")
])

model_baseline.compile(
    optimizer="rmsprop",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

model_baseline.summary()

Training the baseline and storing the history of training and validation accuracy allows us to analyze the model's learning process and identify potential issues such as underfitting or overfitting. By comparing the training and validation accuracy, we can gain insights into how well the model is generalizing to unseen data and whether it is learning meaningful patterns from the training set.

In [26]:
# history_baseline = model_baseline.fit(
#     x_train_final,
#     y_train_final,
#     epochs=5,
#     batch_size=128,
#     validation_data=(x_val, y_val)
# )

history_baseline = model_baseline.fit(
    x_train_final,
    y_train_final,
    epochs=20,          # changed from 5 to 20
    batch_size=128,
    validation_data=(x_val, y_val)
)



Epoch 1/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8849 - loss: 0.4231 - val_accuracy: 0.9279 - val_loss: 0.2474
Epoch 2/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9358 - loss: 0.2267 - val_accuracy: 0.9447 - val_loss: 0.1985
Epoch 3/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9503 - loss: 0.1746 - val_accuracy: 0.9567 - val_loss: 0.1540
Epoch 4/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9590 - loss: 0.1419 - val_accuracy: 0.9616 - val_loss: 0.1388
Epoch 5/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9653 - loss: 0.1197 - val_accuracy: 0.9644 - val_loss: 0.1230
Epoch 6/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9702 - loss: 0.1032 - val_accuracy: 0.9667 - val_loss: 0.1122
Epoch 7/20
[1m391/391[0m 

Storing the training history also enables us to visualize the learning curves, which can help in diagnosing problems with the model and guiding further improvements. For example, if the training accuracy is high but the validation accuracy is low, it may indicate that the model is overfitting to the training data, suggesting that we may need to apply regularization techniques or gather more data. Conversely, if both training and validation accuracy are low, it may indicate underfitting, suggesting that we may need a more complex model or longer training time.

In [27]:
train_acc = history_baseline.history["accuracy"]
val_acc = history_baseline.history["val_accuracy"]

print("Training accuracy per epoch:", train_acc)
print("Validation accuracy per epoch:", val_acc)


Training accuracy per epoch: [0.884880006313324, 0.9358000159263611, 0.9503200054168701, 0.9590200185775757, 0.9652799963951111, 0.9702399969100952, 0.9733999967575073, 0.9765599966049194, 0.9791399836540222, 0.9816799759864807, 0.983020007610321, 0.9846600294113159, 0.9862599968910217, 0.9872000217437744, 0.9886599779129028, 0.9894999861717224, 0.9896799921989441, 0.9909999966621399, 0.9916599988937378, 0.9925000071525574]
Validation accuracy per epoch: [0.9279000163078308, 0.9447000026702881, 0.9567000269889832, 0.9616000056266785, 0.9643999934196472, 0.96670001745224, 0.9682999849319458, 0.9693999886512756, 0.9706000089645386, 0.9707000255584717, 0.9707000255584717, 0.9732999801635742, 0.9718999862670898, 0.9736999869346619, 0.9718999862670898, 0.9736999869346619, 0.9731000065803528, 0.973800003528595, 0.9733999967575073, 0.9760000109672546]


### Baseline performance compared to a trivial baseline

Since MNIST is a 10-class classification problem, a trivial model that predicts digit labels uniformly at random would be expected to achieve an accuracy of around 10%. In contrast, the baseline dense neural network trained in this project reaches approximately 99.6% training accuracy and 97.4% validation accuracy after 5 epochs. This shows that the model is learning meaningful structure in the data and performing far better than chance.

However, because MNIST is a relatively simple benchmark where well-designed models can exceed 99% test accuracy, this baseline is still treated as a starting point rather than a satisfactory final model. Later sections will increase model capacity and apply regularization with the aim of closing the gap between the current 97.4% validation accuracy and the 99% target.

| Model                          | Accuracy (approx.) |
|--------------------------------|--------------------|
| Random guess (10 classes)      | ~10%               |
| Baseline dense network (val)   | ~97.4%             |


In [28]:
train_acc = history_baseline.history["accuracy"]
val_acc = history_baseline.history["val_accuracy"]
print(f"Training Accuracy: {train_acc[-1]}, Validation Accuracy: {val_acc[-1]}")

Training Accuracy: 0.9925000071525574, Validation Accuracy: 0.9760000109672546


## 6. Overfitting model: larger dense network
lorem ipsum

In [29]:
from tensorflow import keras
from tensorflow.keras import layers

model_big = keras.Sequential([
    layers.Dense(256, activation="relu", input_shape=(28 * 28,)),
    layers.Dense(256, activation="relu"),
    layers.Dense(10, activation="softmax")
])

model_big.compile(
    optimizer="rmsprop",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

model_big.summary()


In [30]:
history_big = model_big.fit(
    x_train_final,
    y_train_final,
    epochs=20,          # more epochs than baseline
    batch_size=128,
    validation_data=(x_val, y_val)
)


Epoch 1/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9165 - loss: 0.2808 - val_accuracy: 0.9633 - val_loss: 0.1314
Epoch 2/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.9663 - loss: 0.1101 - val_accuracy: 0.9688 - val_loss: 0.1014
Epoch 3/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.9777 - loss: 0.0706 - val_accuracy: 0.9624 - val_loss: 0.1254
Epoch 4/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.9838 - loss: 0.0510 - val_accuracy: 0.9714 - val_loss: 0.0989
Epoch 5/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.9884 - loss: 0.0371 - val_accuracy: 0.9769 - val_loss: 0.0792
Epoch 6/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.9907 - loss: 0.0288 - val_accuracy: 0.9813 - val_loss: 0.0708
Epoch 7/20
[1m391/391

In [31]:
train_acc_big = history_big.history["accuracy"]
val_acc_big = history_big.history["val_accuracy"]

print("Final training accuracy (big model):", train_acc_big[-1])
print("Final validation accuracy (big model):", val_acc_big[-1])


Final training accuracy (big model): 0.9995999932289124
Final validation accuracy (big model): 0.9829000234603882


To explore the effect of model capacity, a larger dense network with two hidden layers of 256 ReLU units each was trained for 20 epochs. This model achieved approximately 99.96% training accuracy and 98.29% validation accuracy, improving on the baseline validation accuracy of 97.39%. The near-perfect training accuracy indicates that the larger network has sufficient capacity to almost completely fit the training data, while the smaller gap between training and validation accuracy suggests that some overfitting is occurring but that the model still generalizes reasonably well to unseen validation examples.

## 7. Regularization and hyperparameter tuning

lorem ipsum

## 8. Final model evaluation on the test set

lorem ipsum

## 9. Discussion and conclusions

lorem ipsum

## 10. References and code credits

lorem ipsum