In [9]:
import numpy as np

The following function initializes parameters for a **2-layer Neural Network**.

In [7]:
def initialize_parameters(nx, nh, ny):
    """
    Inputs:
    nx --- size of the input layer
    nh --- size of the hidden layer
    ny --- size of the output layer
    By "size" we mean the number of units in a particular layer
    
    Outputs:
    parameters --- python dictionary containing initialized parameters:
        W1 - weight matrix of shape (nh, nx)
        b1 - bias vector of shape (nh, 1)
        W2 - weight matrix of shape (ny, nh)
        b2 - bias vector of shape (ny, 1)
    """
    
    np.random.seed(1) # to make the results of the function call reproducible
    
    # initializing weight parameters with random values
    # and bias parameters with zeros
    W1 = np.random.randn(nh, nx) * 0.01
    b1 = np.zeros((nh, 1))
    W2 = np.random.randn(ny, nh) * 0.01
    b2 = np.zeros((ny, 1))
    # we multiply matrices of weight parameters with 0.01 to avoid "neuron saturation"
    # if weights are too large, neurons can become saturated
    # this means that they become less sensitive to changes in the input
    # slowing down learning during training
    
    # storing the initialized parameters in a dictionary
    # to facilitate convenient retrieval at a later time
    parameters = {
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2
    }
    
    return parameters
    

In [8]:
parameters = initialize_parameters(3,2,1)

print("W1 = " + str(parameters["W1"]))

W1 = [[ 0.01624345 -0.00611756 -0.00528172]
 [-0.01072969  0.00865408 -0.02301539]]


The following function initializes paramateres for an **L-layer Neural Network**

In [15]:
def initialize_parameters_deep(layers):
    """
    Inputs:
    layers - list containing the number of units for each layer in the network
    
    Outputs:
    parameters - dictionary containing parameters for every layer in the network
        Wl - weight matrix of shape (layers[l], layers[l-1])
        b1 - bias column vector of shape (slayers[l], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers[l], layers[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layers[l], 1))
        
        assert(parameters['W' + str(l)].shape == (layers[l], layers[l-1]))
        assert(parameters['b' + str(l)].shape == (layers[l], 1))
    
    return parameters

In [16]:
parameters = initialize_parameters_deep([5,4,3])

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 = [[ 0.01788628  0.0043651   0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865  0.00884622  0.00881318  0.01709573  0.00050034]
 [-0.00404677 -0.0054536  -0.01546477  0.00982367 -0.01101068]]
b1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[-0.01185047 -0.0020565   0.01486148  0.00236716]
 [-0.01023785 -0.00712993  0.00625245 -0.00160513]
 [-0.00768836 -0.00230031  0.00745056  0.01976111]]
b2 = [[0.]
 [0.]
 [0.]]


The following function performs **linear part of the forward propagation step** by computing the linear combination of input features (activations of the previous) layer and corresponding weights.

In [17]:
def linear_forward(A, W, b):
    """
    Inputs:
        A - matrix of activations from the previous layer
            (size of the previous layer, number of training examples)
        W - weights matrix
            (size of the current layer, size of previous layer)
        b - bias vector
            (size of the current layer, 1)
            
    Outputs:
        Z - pre-activation parameter (linear combination of weights and input features)
            (size of the current layer, number of trainig examples)
        cache - tuple containing "A", "W" and "b"; storeing values for efficient computation of the backward pass
    """
    Z = np.dot(W, A) + b
    
    cache = (A, W, b)
    
    return Z, cache

The following function combines both Linear and Activation functions. It accepts activations from the previous layer and parameters specific to the current layer to compute the linear combination component of the forward propagation step. Subsequently, it invokes the activation function of the current layer to determine the activation for that layer. To optimize the efficiency of the backward pass, the computed values within the function are stored in a cache. 

In [20]:
def linear_activation_forward(A_prev, W, b, activation):
    """
    Inputs:
        A_prev - activations from the previous layer
            (size of the previous layer, number of training examples)
        W - weights matrix
            (size of the current layer, size of the previous layer)
        b - bias vector 
            (size of the current layer, 1)
        activation - the name of the activation function of the current layer
        
    Outputs:
        A - post-activation value
            (the size of the current layer, number of training examples)
        cache - tuple containing linear_cache and activation_cache
    """
    
    if activation == "sigmoid":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
        
    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
        
    cache = (linear_cache, activation_cache)
    
    return A, cache

The function performs the forward propagation step from the initial (input) layer to the final layer in a Neural Network. It utilizes a for loop to iterate through each layer of the network. At each iteration, the function applies the activation function specific to the current layer to compute post-activation values and cache. The activation function used for the first `L-1` layers is ReLU, while the last layer employs the Sigmoid function.

In [21]:
def L_model_forward(X, parameters):
    """
    Inputs:
    X - input features (activations of the layer zero)
        (size of the layer zero, number of training examples)
    parameters - output of initialize_parameters_deep(), contains 
                 parameter values for each layer 
    
    Outputs:
    AL - post-activation value from the output layer
    caches - list of caches from linear_activation_forward()
    """
    caches = []
    A = X
    L = len(parameters) // 2  # number of layers in NN
    # dictionary "parameters" has 2 parameter values (weights and bias term) for each layer
    # so diving its length by 2 gives the number of layers in NN
    
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters["W" + str(l)], parameters["b" + str(l)], "relu")
        caches.append(cache)
        
    AL, cache = linear_activation_forward(A, parameters["W" + str(L)], parameters["b" + str(l)], "sigmoid")
    caches.append(cache)
    
    return AL, caches

This function computes the cost to quantify the error between true labels and predicted labels. By calculating the derivatives of the cost function with respect to the parameters, we can determine the magnitude and direction of the required parameter changes. These changes are then used to update the model, enabling it to generate more accurate results.

In [22]:
def compute_cost(AL, Y):
    """
    Inputs:
    AL - the post-activation value of the output layer
        (1, number of training examples)
    Y - the vector containing true labels for the training examples
        (1, number of training examples)
        
    Outputs:
    cost - cross-entropy cost
    """
    m = Y.shape[1]
    
    cost = -(np.dot(Y, np.log(AL).T) + np.dot(1-Y, np.log(1-AL).T)) / m
    
    cost = np.squeeze(cost)
    
    return cost

The next function implements the linear part of backward propagation for a single layer.

In [24]:
def linear_backward(dZ, cache):
    """
    Inputs:
    dZ - gradient of the cost with respect to the current layer's linear output
        (size of the current layer, number of training examples)
    cache - cached values (A_prev, W, b) from the forward pass
    
    Returns:
    dA_prev - gradient of the cost with respect to the activation of the previous layer
        (size of the previous layer, number of training examples)
    dW - gradient of the cost with respect to weights of the current layer
        (size of the current layer, size of the previous layer)
    db - gradient of the cost with respect to the bias of the current layer
        (size of the current layer, 1)
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]
    
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis = 1, keepdims = True) / m
    dA_prev = np.dot(W.T, dZ)
    
    return  dA_prev, dW, db

$$dW_{i,j} = \frac{1}{m} \sum_m dZ[i][m] * A_{\text{prev}}[m][j]$$
It is worth mentioning that $dZ[i][m]$ and $A_{\text{prev}}[m][j]$ represent the same input feature of a given training example. Summing the products of these values across different training examples and subsequently dividing by $m$ ensures that the derivative of the cost with respect to the weight of each input feature takes into account every training example of that feature.

This function combines the operations of `linear_backward()` (dZ → dA_prev, dW, db) with the backpropagation steps of the activation functions `sigmoid_backward()` and `relu_backward()` (dA → dZ).

In [25]:
def linear_activation_backward(dA, cache, activation):
    """
    Inputs:
    dA - post-activation gradient
        (size of the current layer, m training examples)
    cache - tuple of values (linear_cache, activation cache)
    activation - the name of activation function used in this layer
    
    Outputs:
    dA_prev - gradient of the cost with respect to the activations of the previous layer
    dW, db - gradients of the cost with respect to parameters
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    return dA_prev, dW, db

The following function implements backpropagation for the [LINEAR->RELU]  ×
  (L-1) -> LINEAR -> SIGMOID model.

In [26]:
def L_model_backward(AL, Y, caches):
    """
    Inputs:
    AL - probability vector, output of the forward propagation (L_model_forward())
    Y - true "label" vector (containing 0 if non-cat, 1 if cat)
    caches - list of caches
                
    Outputs:
    grads - A dictionary with the gradients
    """
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) 
    
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    current_cache = caches[L-1]
    dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dAL, current_cache, "sigmoid")
    grads["dA" + str(L-1)] = dA_prev_temp
    grads["dW" + str(L)] = dW_temp
    grads["db" + str(L)] = db_temp
    
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dA_prev_temp, current_cache, "relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

The last function's purpose is to update the value of each layer's parameters based on the computed gradients. 

In [27]:

def update_parameters(params, grads, learning_rate):
    """
    Inputs:
    params -  dictionary containing  parameters 
    grads -  dictionary containing gradients, output of L_model_backward
    
    Ouputs:
    parameters -  dictionary containing  updated parameters 
    """
    parameters = params.copy()
    L = len(parameters) // 2 


    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db" + str(l+1)]
    return parameters