## MNIST Classification Project

### Goal
The objective of this project is to build, train, evaluate, and persist a neural network model for handwritten digit classification using the MNIST dataset. This includes:
- Designing a customizable Keras model with regularization
- Training the model with early stopping and best model checkpointing
- Evaluating performance on the test set
- Saving and loading the model


### Dataset: MNIST
MNIST (Modified National Institute of Standards and Technology) is a classic dataset of handwritten digits (0–9). It is commonly used for benchmarking classification algorithms.

#### Size:
- 60,000 training images
- 10,000 test images
- Each image: 28 × 28 pixels
- Grayscale (1 channel)
- Flattened to a 784-dimensional vector for fully connected networks


In [2]:
import numpy as np
import tensorflow.keras as K
from tensorflow.keras.datasets import mnist
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.utils import normalize

In [3]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

### Checking Dataset Shapes
Before training, it's important to understand the shape of the data

In [5]:
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

x_train shape: (60000, 28, 28)
y_train shape: (60000,)
x_test shape: (10000, 28, 28)
y_test shape: (10000,)


### Normalizing the Dataset
Normalization is an essential preprocessing step that scales pixel values between 0 and 1. This helps in faster and more stable training of the model.

In [7]:
x_train = normalize(x_train.astype('float32'), axis=1)
x_test = normalize(x_test.astype('float32'), axis=1)

### Reshaping the Dataset
Before feeding the data into a neural network, it's essential to flatten the 2D images (28x28 pixels) into 1D vectors (784 values per image). This step ensures compatibility with fully connected layers.

In [9]:
x_train = x_train.reshape(x_train.shape[0], -1)
x_test = x_test.reshape(x_test.shape[0], -1)

### One-Hot Encoding
One-hot encoding is a technique used to convert categorical labels into a binary matrix. In the context of classification problems like MNIST, each label (which represents a digit between 0 and 9) is converted into a vector where only one element is "1", and all other elements are "0".

#### Why is it needed?
In machine learning, especially for classification tasks with neural networks, the output labels must be in a format that can be used for training the model. Neural networks typically use softmax activation in the output layer for multi-class classification, which produces a probability distribution across the classes (i.e., a vector of probabilities that sum to 1).

For this to work correctly, we need to represent the labels in a format that matches the output. One-hot encoding is a natural fit for this because it ensures each label corresponds to a specific class (position) in the output vector.

In [11]:
import tensorflow.keras as K


def one_hot(labels, classes=None):
    """
        function that converts a label vector into a one-hot matrix

        :param labels: labels
        :param classes: nbr of classes

        :return: one-hot matrix, shape(labels,classes)
    """
    return K.utils.to_categorical(labels, classes)

In [12]:
y_train_one_hot = one_hot(y_train, 10)
y_test_one_hot = one_hot(y_test, 10)

### Model Overview & Architecture

This model is a fully connected neural network designed for classification tasks, like the MNIST digit classification problem.

#### **High-Level Architecture**:

1. **Input Layer**: 
   - The model accepts input data with `nx` features. For MNIST, each input image is flattened into a 784-dimensional vector (28x28 pixels).
   
2. **Hidden Layers**:
   - The network consists of multiple **Dense layers**, each with a specified number of neurons (e.g., 128, 64, 32). Each hidden layer uses the **ReLU activation function**, which introduces non-linearity and helps the model learn complex patterns.
   - **L2 Regularization** is applied to the weights to prevent overfitting by penalizing large weights.
   - **Dropout** is used in all but the final hidden layer, with a specified probability (`keep_prob`), to reduce overfitting by randomly deactivating neurons during training.

3. **Output Layer**:
   - A **Dense layer** with 10 neurons (for the 10 MNIST classes) and a **Softmax activation function** is used to output a probability distribution over the classes.

#### **Model Flow**:
- The input is passed through multiple hidden layers with ReLU activations, regularization, and dropout.
- Finally, the output layer predicts the class probabilities, which are used to determine the most likely class (digit).
  
This architecture is typical for classification tasks, allowing the model to learn from data and generalize well on unseen test data.

In [14]:
def build_model(nx, layers, activations, lambtha, keep_prob):
    """
        function that builds a neural network with the Keras library
    """
    inputs = K.Input(shape=(nx,))

    x = inputs
    for i in range(len(layers)):
        # add Dense layer
        x = K.layers.Dense(layers[i],
                           activation=activations[i],
                           kernel_regularizer=K.regularizers.L2(lambtha))(x)

        # apply Dropout except last layer
        if i != len(layers) - 1 and keep_prob is not None:
            x = K.layers.Dropout(1 - keep_prob)(x)

    # create model
    model = K.Model(inputs, x)

    return model


In [15]:
# Model parameters
nx = x_train.shape[1]  # input dimension
layers = [128, 64, 10]  # hidden layer and output sizes
activations = ['relu', 'relu', 'softmax']  # activations for each hidden layer
lambtha = 0.01  # L2 regularization strength
keep_prob = 0.8  # dropout probability

model = build_model(nx, layers, activations, lambtha, keep_prob)
model.summary()

### Model Optimization Overview

The `optimize_model()` function configures the model with the **Adam optimizer** for efficient training:

1. **Adam Optimizer**:
   - **Alpha** (`learning_rate`): Controls the step size for weight updates.
   - **Beta1** (`beta_1`): Momentum term, typically set to 0.9.
   - **Beta2** (`beta_2`): Controls the moving average of squared gradients, typically set to 0.999.

2. **Compilation**:
   - **Loss**: **Categorical cross-entropy** is used for multi-class classification.
   - **Metrics**: Tracks **accuracy** during training.

This setup optimizes the model's weights, improving convergence and performance.

In [17]:
#!/usr/bin/env python3
"""
    Optimize
"""

def optimize_model(network, alpha, beta1, beta2):
    """
    """
    Adam_optimizer = K.optimizers.Adam(learning_rate=alpha,
                                       beta_1=beta1,
                                       beta_2=beta2)

    network.compile(optimizer=Adam_optimizer,
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

In [18]:
# Compile model
alpha = 0.001  # learning rate
beta1 = 0.9  # Adam's beta1
beta2 = 0.999  # Adam's beta2

optimize_model(model, alpha, beta1, beta2)

### Model Training Overview

The `train_model()` function trains a neural network using **mini-batch gradient descent** with the following features:

1. **Early Stopping**:
   - Monitors validation loss to stop training if performance does not improve, avoiding overfitting.
   - **Patience** defines how many epochs to wait before stopping.

2. **Learning Rate Decay**:
   - Reduces the learning rate over epochs using the inverse-time decayh}}
     \]
   - Helps the model converge more smoothly as trainin
  
$$
     \text{lr} = \frac{\alpha}{1 + \text{decay\_rate} \times \text{epoch}}
$$g progresses.

3. **Save Best Model**:
   - **ModelCheckpoint** saves the model with the best validation loss during training to avoid losing the best-performing model.

4. **Training**:
   - The model is trained with the provided data, labels, and training settings like batch size, epochs, and shuffle.

This function ensures efficient training, with flexibility to adjust the learning rate, stop early if needed, and save the best model.

In [20]:
def train_model(network, data, labels, batch_size,
                epochs, validation_data=None, early_stopping=False,
                patience=0, learning_rate_decay=False, alpha=0.1,
                decay_rate=1, save_best=False, filepath=None,
                verbose=True, shuffle=False):
    """
        Function that trains a model using mini-batch gradient descent
    """
    callback = []
    if early_stopping is True and validation_data is not None:
        early_stop = K.callbacks.EarlyStopping(monitor='val_loss',
                                               patience=patience)

        # add to callback list
        callback.append(early_stop)

    if learning_rate_decay and validation_data:
        # function calculate new learning rate
        def scheduler(epochs):
            lr = alpha / (1 + decay_rate * epochs)
            return lr

        inv_time_decay = K.callbacks.LearningRateScheduler(
            scheduler,
            verbose=1)

        # add to callback list
        callback.append(inv_time_decay)

    # save best model
    if save_best:
        save_best_model = K.callbacks.ModelCheckpoint(
            filepath=filepath,
            monitor='val_loss',
            save_best_only=True
        )

        callback.append(save_best_model)

    history = network.fit(x=data,
                          y=labels,
                          epochs=epochs,
                          batch_size=batch_size,
                          validation_data=validation_data,
                          callbacks=[callback],
                          verbose=verbose,
                          shuffle=shuffle)

    return history

In [21]:
batch_size = 32
epochs = 5


history = train_model(model, 
                      x_train, y_train_one_hot, 
                      batch_size, 
                      epochs, 
                      validation_data=(x_test, y_test_one_hot), 
                      early_stopping=True, patience=3, save_best=True, 
                      filepath='best_model.keras')

Epoch 1/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 8ms/step - accuracy: 0.7605 - loss: 1.7812 - val_accuracy: 0.8460 - val_loss: 1.1485
Epoch 2/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 8ms/step - accuracy: 0.8428 - loss: 1.1678 - val_accuracy: 0.8573 - val_loss: 1.1014
Epoch 3/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 9ms/step - accuracy: 0.8458 - loss: 1.1510 - val_accuracy: 0.8570 - val_loss: 1.0985
Epoch 4/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 8ms/step - accuracy: 0.8463 - loss: 1.1438 - val_accuracy: 0.8626 - val_loss: 1.0853
Epoch 5/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 8ms/step - accuracy: 0.8481 - loss: 1.1382 - val_accuracy: 0.8632 - val_loss: 1.0808


### Model Testing Overview

The `test_model()` function evaluates the performance of a trained neural network on test data:

- **Purpose**: It computes the loss and accuracy of the model on the given test data (`data`) and corresponding labels (`labels`).
- **Parameters**:
  - `network`: The trained Keras model to be tested.
  - `data`: The input data used for testing.
  - `labels`: The true labels corresponding to the test data.
  - `verbose`: Controls the display of progress (default is `True`).
  
The function uses Keras' `evaluate()` method, which returns the loss and accuracy metrics based on the model's performance on the test set.

In [23]:

#!/usr/bin/env python3
"""
    Test neural network
"""

import tensorflow.keras as K


def test_model(network, data, labels, verbose=True):
    """
        function that tests a neural network
    """
    return network.evaluate(x=data,
                            y=labels,
                            verbose=verbose)


In [24]:

# Evaluate the model
loss, accuracy = test_model(model, x_test, y_test_one_hot)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8388 - loss: 1.1441
Test Loss: 1.0807803869247437
Test Accuracy: 0.8632000088691711


### Model Saving and Loading

- **`save_model()`**: Saves the entire model (architecture, weights, optimizer state) to a file.
  - **Parameters**: `network` (model), `filename` (file path).
  
- **`load_model()`**: Loads a previously saved model from a file.
  - **Parameters**: `filename` (file path).

In [26]:
def save_model(network, filename):
    """
        function that saves an entire model
    """
    network.save(filename)


def load_model(filename):
    """
        function that loads an entire model
    """
    return K.models.load_model(filename)


In [27]:
save_model(model, 'mnist_model.keras')
loaded_model = load_model('mnist_model.keras')

In [28]:
# Test the loaded model
loss, accuracy = test_model(loaded_model, x_test, y_test_one_hot)
print(f"Test Loss after loading model: {loss}")
print(f"Test Accuracy after loading model: {accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8388 - loss: 1.1441
Test Loss after loading model: 1.0807803869247437
Test Accuracy after loading model: 0.8632000088691711
