<a href="https://colab.research.google.com/github/divyamjariwal/Deep-Learning-Lab/blob/main/Untitled15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement an Artificial Neural Network (ANN) from scratch using NumPy to classify the MNIST dataset. The implementation should include data loading and preprocessing (normalization, flattening, one-hot encoding), initialization of network parameters (weights and biases), forward and backward propagation with ReLU and Softmax activation functions, and a training loop using gradient descent. Finally, evaluate the model's performance on the test set and visualize the training loss and accuracy over epochs. Summarize the ANN implementation, architecture, training process, and final performance on the MNIST dataset.

## Load and Preprocess MNIST Data

### Subtask:
Load the MNIST dataset, normalize pixel values, flatten images, and one-hot encode labels for both training and testing sets. This step will prepare the data for the neural network.


**Reasoning**:
First, I'll import the necessary libraries, then load the MNIST dataset, normalize pixel values, flatten the images, and one-hot encode the labels for both training and testing sets as per the instructions. This prepares the data for the neural network.



In [1]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np

# 1. and 2. Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 3. Normalize pixel values
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# 4. Flatten images
x_train = x_train.reshape(x_train.shape[0], -1)
x_test = x_test.reshape(x_test.shape[0], -1)

# 5. and 6. One-hot encode labels
num_classes = 10
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

print("MNIST data loaded and preprocessed successfully.")
print(f"x_train shape: {x_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_test shape: {y_test.shape}")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
MNIST data loaded and preprocessed successfully.
x_train shape: (60000, 784)
y_train shape: (60000, 10)
x_test shape: (10000, 784)
y_test shape: (10000, 10)


## Initialize Neural Network Parameters

### Subtask:
Initialize the weights and biases for all layers of the neural network using NumPy. This will involve defining the architecture (e.g., number of layers, neurons per layer) and randomly initializing parameters.


### Implement `initialize_parameters` function

This function will set up the neural network's architecture by defining the number of layers and neurons, and then initializing the weights with small random values and biases with zeros for each layer.

**Reasoning**:
Now I will implement the `initialize_parameters` function as described, define the network architecture, and then call this function to initialize the weights and biases for the neural network. This fulfills the subtask requirements.



In [2]:
def initialize_parameters(layer_dims):
    """
    Initializes parameters for the neural network.

    Arguments:
    layer_dims -- python list containing the dimensions of each layer in our network

    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    np.random.seed(3) # for reproducibility
    parameters = {}
    L = len(layer_dims) # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

    return parameters

# 1. Define the neural network architecture
input_size = x_train.shape[1] # 784 features for MNIST images
hidden_layer_size = 128 # Example hidden layer size
output_size = num_classes # 10 classes for MNIST

# Define the architecture as a list of layer dimensions
# (input_layer_size, hidden_layer_1_size, ..., output_layer_size)
layer_dims = [input_size, hidden_layer_size, output_size]

# 6. Call this function to get the initial parameters
parameters = initialize_parameters(layer_dims)

print("Parameters initialized successfully.")
print(f"Shape of W1: {parameters['W1'].shape}")
print(f"Shape of b1: {parameters['b1'].shape}")
print(f"Shape of W2: {parameters['W2'].shape}")
print(f"Shape of b2: {parameters['b2'].shape}")

Parameters initialized successfully.
Shape of W1: (128, 784)
Shape of b1: (128, 1)
Shape of W2: (10, 128)
Shape of b2: (10, 1)


## Implement Activation Functions

### Subtask:
Define essential activation functions such as ReLU for hidden layers and Softmax for the output layer, along with their respective derivatives, which are crucial for backpropagation.


**Reasoning**:
I will define the `relu`, `relu_backward`, and `softmax` functions as specified in the instructions to implement the required activation functions and their derivatives.



In [3]:
def relu(Z):
    """
    Implement the ReLU activation function.

    Arguments:
    Z -- Output of the linear layer, of any shape

    Returns:
    A -- Post-activation parameter, of the same shape as Z
    cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
    """
    A = np.maximum(0, Z)
    return A, Z

def relu_backward(dA, cache):
    """
    Implement the backward propagation for a single ReLU unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    Z = cache
    dZ = np.array(dA, copy=True)  # initializing dZ as a copy of dA
    dZ[Z <= 0] = 0
    return dZ

def softmax(Z):
    """
    Implement the Softmax activation function.

    Arguments:
    Z -- Output of the linear layer, of any shape

    Returns:
    A -- Post-activation parameter, of the same shape as Z
    cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
    """
    # Subtract max for numerical stability
    exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True))
    A = exp_Z / np.sum(exp_Z, axis=0, keepdims=True)
    return A, Z

print("Activation functions (ReLU and Softmax) and their derivatives defined successfully.")

Activation functions (ReLU and Softmax) and their derivatives defined successfully.


## Implement Forward Propagation

### Subtask:
Write the function for the forward pass, which takes input data and calculates the outputs of each layer sequentially, leading to the final prediction.


**Reasoning**:
I will define the `linear_forward`, `linear_activation_forward`, and `forward_propagation` functions as specified in the instructions to implement the forward pass for the neural network. This will involve computing the linear transformation and applying the appropriate activation function for each layer.



In [4]:
def linear_forward(A_prev, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: (size of current layer, size of previous layer)
    b -- bias vector: (size of current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter
    cache -- a python dictionary containing "A_prev", "W" and "b" ; stored for computing the backward pass efficiently
    """
    Z = np.dot(W, A_prev) + b

    assert(Z.shape == (W.shape[0], A_prev.shape[1]))
    cache = (A_prev, W, b)

    return Z, cache

def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: (size of current layer, size of previous layer)
    b -- bias vector: (size of current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "relu" or "softmax"

    Returns:
    A -- the output of the activation function, also called the post-activation value
    cache -- a python dictionary containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """

    if activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    elif activation == "softmax":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = softmax(Z)

    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

def forward_propagation(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SOFTMAX computation

    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters()

    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_relu_forward() (it's caches[l-1] for layer l)
                the cache of linear_softmax_forward() (it's caches[L-1])
    """

    caches = []
    A = X
    L = len(parameters) // 2  # number of layers in the neural network

    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], activation = "relu")
        caches.append(cache)

    # Implement LINEAR -> SOFTMAX. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "softmax")
    caches.append(cache)

    assert(AL.shape == (parameters['W' + str(L)].shape[0], X.shape[1]))

    return AL, caches

print("Forward propagation functions (linear_forward, linear_activation_forward, forward_propagation) defined successfully.")

Forward propagation functions (linear_forward, linear_activation_forward, forward_propagation) defined successfully.


## Implement the Cost Function

### Subtask:
Implement a function to compute the categorical cross-entropy cost, which quantifies the difference between the predicted probabilities and the true labels. This cost function will be used to evaluate the model's performance during training.

## Implement the Cost Function

### Subtask:
Implement a function to compute the categorical cross-entropy cost, which quantifies the difference between the predicted probabilities and the true labels. This cost function will be used to evaluate the model's performance during training.

## Implement the Cost Function

### Subtask:
Implement a function to compute the categorical cross-entropy cost, which quantifies the difference between the predicted probabilities and the true labels. This cost function will be used to evaluate the model's performance during training.

**Reasoning**:
I will define the `compute_cost` function to calculate the categorical cross-entropy cost, as this is essential for evaluating the model's performance and guiding the optimization process during training.



In [5]:
def compute_cost(AL, Y):
    """
    Implement the cost function using categorical cross-entropy.

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (number of classes, number of examples)
    Y -- true "label" vector (one-hot encoded), shape (number of classes, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    m = Y.shape[1]

    # Compute loss from AL and Y
    # To avoid log(0) which is undefined, we clip AL values to a small epsilon
    AL = np.clip(AL, 1e-10, 1 - 1e-10)
    cost = -np.sum(Y * np.log(AL)) / m

    cost = np.squeeze(cost)      # To make sure cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())

    return cost

print("Cost function (compute_cost) defined successfully.")

Cost function (compute_cost) defined successfully.


## Implement Backward Propagation

### Subtask:
Implement the backward pass functions to compute the gradients of the cost with respect to the weights and biases for each layer. This will involve defining functions for `linear_backward`, `relu_backward`, `softmax_backward` (implicitly handled by cross-entropy derivative), and `backward_propagation`.

**Reasoning**:
I will implement the `linear_backward` and `backward_propagation` functions as specified, which are crucial for calculating gradients during the backward pass of the neural network. The `relu_backward` function is already defined.



In [6]:
def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = dZ.shape[1]

    dW = 1./m * np.dot(dZ, A_prev.T)
    db = 1./m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.

    Arguments:
    dA -- post-activation gradient for current layer l
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "relu" or "softmax"

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache

    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
    elif activation == "softmax":
        # dA for softmax is already dAL - Y, so we use it directly as dZ
        # In our case, the derivative of softmax with respect to Z is already
        # incorporated into the `dA` provided by the cost function's backward pass.
        # So, dZ for softmax is directly dA.
        dZ = dA

    dA_prev, dW, db = linear_backward(dZ, linear_cache)

    return dA_prev, dW, db

def backward_propagation(AL, Y, caches):
    """
    Implement the backward propagation for the entire network.

    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (one-hot encoded) (output of load_data())
    caches -- list of caches containing:
                every cache of linear_relu_forward() (it's caches[l-1] for layer l)
                the cache of linear_softmax_forward() (it's caches[L-1])

    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # ensure Y is the same shape as AL

    # Initializing the backpropagation
    # dAL for softmax is (AL - Y), as the derivative of categorical cross-entropy with softmax is AL - Y
    dAL = AL - Y

    # Lth layer (SOFTMAX -> LINEAR) gradients
    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "softmax")

    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 1)], current_cache, activation = "relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

print("Backward propagation functions (linear_backward, linear_activation_backward, backward_propagation) defined successfully.")

Backward propagation functions (linear_backward, linear_activation_backward, backward_propagation) defined successfully.


## Implement the Training Loop

### Subtask:
Implement the training loop using gradient descent. This involves repeatedly performing forward propagation, computing the cost, executing backward propagation to get gradients, and updating the parameters (weights and biases) for a specified number of epochs.

## Implement the Training Loop

### Subtask:
Implement the training loop using gradient descent. This involves repeatedly performing forward propagation, computing the cost, executing backward propagation to get gradients, and updating the parameters (weights and biases) for a specified number of epochs.

## Implement the Training Loop

### Subtask:
Implement the training loop using gradient descent. This involves repeatedly performing forward propagation, computing the cost, executing backward propagation to get gradients, and updating the parameters (weights and biases) for a specified number of epochs.

**Reasoning**:
I will define the `update_parameters` function, which is a crucial component of the training loop for adjusting the network's weights and biases based on the calculated gradients.



In [7]:
def update_parameters(parameters, grads, learning_rate):
    """
    Updates parameters using gradient descent update rule.

    Arguments:
    parameters -- python dictionary containing your parameters
    grads -- python dictionary containing your gradients, output of L_model_backward
    learning_rate -- the learning rate, a scalar

    Returns:
    parameters -- python dictionary containing your updated parameters
    """

    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter
    for l in range(1, L + 1):
        parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * grads["dW" + str(l)]
        parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * grads["db" + str(l)]
    return parameters

# Define a helper function to calculate accuracy
def predict(X, parameters):
    """
    This function is used to predict the results of a  L-layer neural network.

    Arguments:
    X -- data set of examples you would like to label
    parameters -- parameters of the trained model

    Returns:
    predictions -- predictions for the given dataset X
    """

    m = X.shape[1]
    n = len(parameters) // 2 # number of layers in the neural network

    # Forward propagation
    probas, caches = forward_propagation(X, parameters)

    # Convert probabilities to actual predictions
    predictions = np.argmax(probas, axis=0)

    return predictions

print("Parameter update function (update_parameters) and prediction function (predict) defined successfully.")

Parameter update function (update_parameters) and prediction function (predict) defined successfully.


**Reasoning**:
I will define the main `L_layer_model` function that orchestrates the entire training process. This function will integrate forward propagation, cost computation, backward propagation, and parameter updates over a specified number of epochs, and also track the loss and accuracy.



In [8]:
def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost = False, print_accuracy = False):
    """
    Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SOFTMAX.

    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    Y -- true "label" vector (one-hot encoded), of shape (number of classes, number of examples)
    layers_dims -- list containing the input size and each layer's size, of length (number of layers + 1).
    learning_rate -- learning rate of the gradient descent update rule
    num_iterations -- number of iterations to run the optimization loop
    print_cost -- if True, it prints the cost every 100 steps
    print_accuracy -- if True, it prints the accuracy every 100 steps

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    costs -- list of costs over iterations
    train_accuracies -- list of training accuracies over iterations
    """

    np.random.seed(1) # for reproducibility
    costs = []                         # keep track of cost
    train_accuracies = []              # keep track of training accuracy

    # Parameters initialization.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SOFTMAX.
        AL, caches = forward_propagation(X, parameters)

        # Compute cost.
        cost = compute_cost(AL, Y)

        # Backward propagation.
        grads = backward_propagation(AL, Y, caches)

        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print the cost and accuracy every 100 training examples
        if print_cost and i % 100 == 0 or i == num_iterations - 1:
            print(f"Cost after iteration {i}: {cost}")
            costs.append(cost)
        if print_accuracy and i % 100 == 0 or i == num_iterations - 1:
            predictions_train = predict(X, parameters)
            true_labels_train = np.argmax(Y, axis=0)
            train_accuracy = np.mean(predictions_train == true_labels_train) * 100
            print(f"Train accuracy after iteration {i}: {train_accuracy:.2f}%")
            train_accuracies.append(train_accuracy)

    return parameters, costs, train_accuracies

print("L_layer_model (training loop) function defined successfully.")


L_layer_model (training loop) function defined successfully.


## Evaluate Model Performance

### Subtask:
After training, evaluate the performance of the neural network on the test dataset. Calculate metrics such as accuracy to assess how well the model generalizes to unseen data.


**Reasoning**:
I will call the `L_layer_model` function to train the neural network using the preprocessed training data and the defined architecture. This step will execute the forward and backward propagation cycles, update parameters, and track cost and accuracy over epochs, which is essential before evaluating on the test set.



In [10]:
print(f"Shape of x_train before transpose: {x_train.shape}")
print(f"Shape of y_train before transpose: {y_train.shape}")

# Transpose x_train and y_train for the model
X_train_T = x_train.T
Y_train_T = y_train.T

print(f"Shape of X_train_T after transpose: {X_train_T.shape}")
print(f"Shape of Y_train_T after transpose: {Y_train_T.shape}")

# 2. Call the L_layer_model function
# The layer_dims were already defined in a previous step as [input_size, hidden_layer_size, output_size]
parameters, costs, train_accuracies = L_layer_model(X_train_T, Y_train_T, layer_dims, learning_rate = 0.01, num_iterations = 1500, print_cost = True, print_accuracy = True)

print("\nTraining complete. Model parameters, costs, and training accuracies obtained.")

Shape of x_train before transpose: (60000, 784)
Shape of y_train before transpose: (60000, 10)
Shape of X_train_T after transpose: (784, 60000)
Shape of Y_train_T after transpose: (10, 60000)
Cost after iteration 0: 2.302442957095392
Train accuracy after iteration 0: 8.95%
Cost after iteration 100: 2.289347268205952
Train accuracy after iteration 100: 42.59%
Cost after iteration 200: 2.26486110878442
Train accuracy after iteration 200: 47.60%
Cost after iteration 300: 2.213510124411725
Train accuracy after iteration 300: 49.38%
Cost after iteration 400: 2.1127549827421968
Train accuracy after iteration 400: 55.84%
Cost after iteration 500: 1.9407679408976544
Train accuracy after iteration 500: 64.93%
Cost after iteration 600: 1.6982791686007352
Train accuracy after iteration 600: 69.38%
Cost after iteration 700: 1.4362668526053433
Train accuracy after iteration 700: 71.08%
Cost after iteration 800: 1.215117278670913
Train accuracy after iteration 800: 73.61%
Cost after iteration 900: 1