# Neural Network base
In this lab, you will learn the basement of a neural network. You will build up several base modules of a neural network and finally your own deep neural network!

**Outline**
- [1 - Initialization parameters](#1)
- [2 - Forward Propagation Module](#2)
    - [2.1 - Linear Forward](#2-1)
    - [2.2 - Activation Function](#2-2)
    - [2.3 - L-layer linear_activation_forward](#2-3)
- [3 - Cost Function](#3)
- [4 - Backward Propagation Module](#4)
    - [4.1 - Linear Backward](#4-1)
    - [4.2 - Linear-Activation Backward](#4-2)
    - [4.3 - L-layer Backward](#4-3)
    - [4.4 - Update Parameters](#4-4)
- [5 - Build up your own network](#5)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score

<a name='1'></a>
## 1 - Initialization parameters
Weight initialization is an important design choice when developing deep learning neural network models.

Neural network models are fit using an optimization algorithm called stochastic gradient descent that incrementally changes the network weights to minimize a loss function, hopefully resulting in a set of weights for the mode that is capable of making useful predictions. This optimization algorithm requires a starting point in the space of possible weight values from which to begin the optimization process. Weight initialization is a procedure to set the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.

In [None]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer

    Returns:
    parameters -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

    np.random.seed(1) #please check the useage of random seed and try different seeds

    W1 = np.random.randn(n_h,n_x)*0.01
    b1 = np.zeros((n_h,1)) # here we set bias value as zero, please try different values
    W2 = np.random.randn(n_y,n_h)*0.01
    b2 = np.zeros((n_y,1))

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters

In [None]:
parameters = initialize_parameters(3,2,1)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

<a name='2'></a>
## 2 - Forward Propagation Module
Forward propagation is where input data is fed through a network, in a forward direction, to generate an output. The data is accepted by hidden layers and processed, as per the activation function, and moves to the successive layer.

<a name='2-1'></a>
### 2.1 - Linear Forward Propagation
The linear forward module (vectorized over all the examples) computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$

where $A^{[0]} = X$.

In [None]:
def linear_forward(A, W, b):
    """
    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """


    Z = np.dot(W,A) + b
    cache = (A, W, b)

    return Z, cache

In [None]:
np.random.seed(1)
t_A = np.random.randn(5,1)
t_W = np.random.randn(5,5)
t_b = np.random.randn(5,1)
t_Z, t_linear_cache = linear_forward(t_A, t_W, t_b)
print("Z = " + str(t_Z))

<a name='2-2'></a>
### 2.2 - Activation Function
The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

In this notebook, you will use two activation functions:

- **Sigmoid**: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$.


- **ReLU**: The mathematical formula for ReLu is $A = RELU(Z) = max(0, Z)$.


In [None]:
def sigmoid(Z):
    Z_activation = 1 / (1 + np.exp(-Z))
    cache = (Z)
    return Z_activation , cache

def relu(Z):
    Z_activation = np.maximum(Z , 0)
    cache = (Z)
    return Z_activation , cache

In [None]:
def activation_forward(A_prev, W, b, activation):
    """
    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value
    cache -- a python tuple containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """

    if activation == "sigmoid":
        Z, linear_cache = linear_forward(A_prev, W, b)

        A, activation_cache = sigmoid(Z)

    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)

    cache = (linear_cache, activation_cache)

    return A, cache

In [None]:
t_A = np.random.randn(5,1)
t_W = np.random.randn(5,5)
t_b = np.random.randn(5,1)

t_Z, t_linear_cache = activation_forward(t_A , t_W , t_b, activation = "sigmoid")
print("With sigmoid: Z = " + str(t_Z))

t_Z, t_linear_cache = activation_forward(t_A , t_W , t_b, activation = "relu")
print("With relu: Z = " + str(t_Z))

<a name='2-3'></a>
### 2.3 - L-layer linear_activation_forward
For even more convenience when implementing the  𝐿 -layer Neural Net, you will need a function that replicates the previous one (linear_activation_forward with **RELU**)  **𝐿−1**  times, then follows that with **one** linear_activation_forward with **SIGMOID**.

In [None]:
def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()

    Returns:
    AL -- activation value from the output (last) layer
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L of them, indexed from 0 to L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2   # number of layers in the neural network

    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    # The for loop starts at 1 because layer 0 is the input
    for l in range(1, L):
        A_prev = A
        A, cache = activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], activation = "relu")
        caches.append(cache)

    AL, cache = activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
    caches.append(cache)

    return AL, caches

In [None]:
t_X = np.random.randn(3,1)
t_parameters = initialize_parameters(3,2,3)
t_AL, t_caches = L_model_forward(t_X, t_parameters)
print(t_AL.shape)
print("AL = " + str(t_AL))

<a name='3'></a>
## 3 - Cost function
A cost function is a measure of error between what value your model predicts and what the value actually is.

Compute the cross-entropy cost $J$, using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))$$

In [None]:
def compute_cost(AL, Y):

    """
    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- MSE cost
    """

    m = Y.shape[0]

    cost = (-np.dot(Y,np.log(AL.T))-np.dot((1-Y),np.log(1-AL.T)))/m
    cost = np.squeeze(cost)

    return cost

In [None]:
t_Y = np.array([1 , 0 , 1])
t_AL = np.array([0.9 , 0.18 , 0.89])
t_cost = compute_cost(t_AL, t_Y)

print("Cost: " + str(t_cost))

<a name='4'></a>
## 4 - Backward Propagation Module
Backpropagation is a process involved in training a neural network. It involves taking the error rate of a forward propagation and feeding this loss backward through the neural network layers to fine-tune the weights.

Just as you did for the forward propagation, you'll implement functions for backpropagation. Remember that backpropagation is used to calculate the gradient of the loss function with respect to the parameters.

Similarly to forward propagation, you're going to build the backward propagation in three steps:
1. LINEAR backward
2. LINEAR -> **ACTIVATION** backward where ACTIVATION computes the derivative of either the ReLU or sigmoid activation
3. [LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID backward (whole model)

<a name='4-1'></a>
### 4.1 -  Linear Backward
For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose you have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. You want to get $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$.

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$.

Here are the formulas you need:
$$ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}$$
$$ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} $$


$A^{[l-1] T}$ is the transpose of $A^{[l-1]}$.

In [None]:
def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]


    dW = np.dot(dZ, A_prev.T)/m
    db = np.sum(dZ, axis=1, keepdims=True)/m
    dA_prev = np.dot(W.T, dZ)


    return dA_prev, dW, db

In [None]:
t_dZ = np.random.randn(5,1)
t_linear_cache = (np.random.randn(5,1),np.random.randn(5,5),np.random.randn(5,1))
# t_linear_cache = (A_prev, W, b)
t_dA_prev, t_dW, t_db = linear_backward(t_dZ, t_linear_cache)

print("dA_prev: " + str(t_dA_prev))
print("dW: " + str(t_dW))
print("db: " + str(t_db))

<a name='4-2'></a>
### 4.2 -  Linear-Activation Backward
Next, you will create a function that merges the two helper functions: **`linear_backward`** and the backward step for the activation **`linear_activation_backward`**.

- **`sigmoid_backward`**: Implements the backward propagation for SIGMOID unit. You can call it as follows:

```python
dZ = sigmoid_backward(dA, activation_cache)
```

- **`relu_backward`**: Implements the backward propagation for RELU unit. You can call it as follows:

```python
dZ = relu_backward(dA, activation_cache)
```

If $g(.)$ is the activation function,
`sigmoid_backward` and `relu_backward` compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}). \tag{11}$$


In [None]:
def sigmoid_backward(dA, Z): # Z is the activation_cache
    sig , catch = sigmoid(Z)
    return dA * sig * (1 - sig)

def relu_backward(dA, Z):
    dZ = np.array(dA, copy = True)
    dZ[Z <= 0] = 0;
    return dZ;

In [None]:
def linear_activation_backward(dA, cache, activation):
    """

    Arguments:
    dA -- post-activation gradient for current layer l
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache

    if activation == "relu":

        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)

    elif activation == "sigmoid":

        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)

    return dA_prev, dW, db

In [None]:
t_dAL = np.random.randn(5,1)
t_linear_cache = (np.random.randn(5,1),np.random.randn(5,5),np.random.randn(5,1))
t_activation_cache =(np.random.randn(5,1))

t_dA_prev, t_dW, t_db = linear_activation_backward(t_dAL, (t_linear_cache,t_activation_cache), activation = "sigmoid")
print("With sigmoid: dA_prev = " + str(t_dA_prev))
print("With sigmoid: dW = " + str(t_dW))
print("With sigmoid: db = " + str(t_db))

t_dA_prev, t_dW, t_db = linear_activation_backward(t_dAL, (t_linear_cache,t_activation_cache), activation = "relu")
print("With relu: dA_prev = " + str(t_dA_prev))
print("With relu: dW = " + str(t_dW))
print("With relu: db = " + str(t_db))

<a name='4-3'></a>
### 4.3 -  L-layer Backward
Recall that when you implemented the `L_model_forward` function, at each iteration, you stored a cache which contains (X,W,b, and z). In the back propagation module, you'll use those variables to compute the gradients. Therefore, in the `L_model_backward` function, you'll iterate through all the hidden layers backward, starting from layer $L$. On each step, you will use the cached values for layer $l$ to backpropagate through layer $l$.

**Initializing backpropagation**:

To backpropagate through this network, you know that the output is:
$A^{[L]} = \sigma(Z^{[L]})$. Your code thus needs to compute `dAL` $= \frac{\partial \mathcal{L}}{\partial A^{[L]}}$.
To do so, use this formula (derived using calculus which, again, you don't need in-depth knowledge of:
```python
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
```

After that, you will have to use a `for` loop to iterate through all the other layers using the LINEAR->RELU backward function. You should store each dA, dW, and db in the grads dictionary. To do so, use this formula :

$$grads["dW" + str(l)] = dW^{[l]}\tag{15} $$

For example, for $l=3$ this would store $dW^{[l]}$ in `grads["dW3"]`.

In [None]:
def L_model_backward(AL, Y, caches):
    """
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])

    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
    current_cache = caches[L-1]
    dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
    grads["dA" + str(L-1)] = dA_prev_temp
    grads["dW" + str(L)] = dW_temp
    grads["db" + str(L)] = db_temp

    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dA_prev_temp, current_cache, activation = "relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l+1)] = dW_temp
        grads["db" + str(l+1)] = db_temp

    return grads

In [None]:
t_X = np.random.randn(3,1)
t_Y_assess = np.random.randn(3,1)
t_parameters = initialize_parameters(3,2,3)
t_AL, t_caches = L_model_forward(t_X, t_parameters)

grads = L_model_backward(t_AL, t_Y_assess, t_caches)

print("dA0 = " + str(grads['dA0']))
print("dA1 = " + str(grads['dA1']))
print("dW1 = " + str(grads['dW1']))
print("dW2 = " + str(grads['dW2']))
print("db1 = " + str(grads['db1']))
print("db2 = " + str(grads['db2']))


<a name='4-4'></a>
### 4.4 -  Update_parameters

In [None]:
def update_parameters(params, grads, learning_rate):
    """
    Update parameters using gradient descent

    Arguments:
    params -- python dictionary containing your parameters
    grads -- python dictionary containing your gradients, output of L_model_backward

    Returns:
    parameters -- python dictionary containing your updated parameters
                  parameters["W" + str(l)] = ...
                  parameters["b" + str(l)] = ...
    """
    parameters = params.copy()
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.

    for l in range(L):
        # parameters["W" + str(l+1)] = ...
        # parameters["b" + str(l+1)] = ...
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
    return parameters

In [None]:
t_X = np.random.randn(3,1)
t_Y_assess = np.random.randn(3,1)
t_AL, t_caches = L_model_forward(t_X, t_parameters)

t_parameters = initialize_parameters(3,2,3)
grads = L_model_backward(t_AL, t_Y_assess, t_caches)

t_parameters = update_parameters(t_parameters, grads, 0.1)
print ("W1 = "+ str(t_parameters["W1"]))
print ("b1 = "+ str(t_parameters["b1"]))
print ("W2 = "+ str(t_parameters["W2"]))
print ("b2 = "+ str(t_parameters["b2"]))

<a name='5'></a>
## 5 - Build up your own network

### Now you have implemented all the functions required for building a deep neural network, it's time to use these blocks to build up your own neural network!

In [None]:
# putting these things together
def train(X, Y, nn_hiden, epochs, learning_rate):
    params_values = initialize_parameters(X.shape[0] , nn_hiden, 1)
    cost_history = []
    accuracy_history = []

    for i in range(epochs):
        Y_hat, cashe = L_model_forward(X, params_values)
        cost = compute_cost(Y_hat.flatten(), Y)
        cost_history.append(cost)

        ###one simplified example to display the accuracy score of classification
        Y_hat_bin = np.where(Y_hat>0.5,1,0)
        accuracy = precision_score(Y , Y_hat_bin.flatten(),average='micro')
        accuracy_history.append(accuracy)

        grads_values = L_model_backward(Y_hat, Y, cashe)
        params_values = update_parameters(params_values, grads_values, learning_rate)
        if (i+1) %100==0:
            print(f"Epoch #{i+1}: train loss: {cost}; precision score: {accuracy}")

    return params_values, cost_history, accuracy_history

## Apply our neural network on a sample data

### Data generation

In [None]:
X = np.array([[21.04,5,0.5,90], [14.16,3,1,80], [8.52,2,0.5,70], [7.52,2.3,1,80]])
y = np.array([0, 0, 1 , 1])

In [None]:
params_values, cost_history, accuracy_history = train(X, y, 5, 5000 , 0.001)

### You can calculate the training accuracy by using the methods below:
[Classfication metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)