## 4.1 Pipeline of L-layer nerual network
- **Parameter initialization** for all $L$-layers.
- **Forward propagation**. [LINEAR->RELU] for $L-1$ layers and [LINEAR->SIGMOID] at $L$ layer. **Cache some values**.
- **Compute the loss.
- **Backward propagation**. [LINEAR->RELU] for $L-1$ layers and [LINEAR->SIGMOID] at $L$ layer. **Use the cached value**.
- **Update the parameters**.

### 4.1.1 Parameter initialization for all $L$-layers

In [None]:
def initialize_parameters_deep(layer_dims):
    """
    layer_dims: python array (list) containing the dimensions of each layer.
    """
    
    parameters = {}
    L = len(layer_dims) # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
    return parameters

### 4.1.2 - Forward propagation with cached value

The vectorized linear forward computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}$$
$$A^{[l]} = g(Z^{[l]}) = g(W^{[l]} A^{[l-1]} + b^{[l]})$$

where $A^{[0]} = X$. 

Traditionally, the **[LINEAR->ACTIVATION]** is viewed as **a single layer** in the neural network. 

In [None]:
def sigmoid(Z):
    """
    sigmoid activation.
    """
    
    A = 1/(1+np.exp(-Z))
    cache = Z
    
    return A, cache

def relu(Z):
    """
    ReLU activation.
    """
    
    A = np.maximum(0,Z)
    cache = Z
    
    return A, cache

def linear_forward(A, W, b):
    """
    Linear part of a layer's forward propagation.
    """
    Z = np.dot(W, A) + b
    cache = (A, W, b)    
    return Z, cache

def linear_activation_forward(A_prev, W, b, activation):
    """
    Forward propagation for the LINEAR->ACTIVATION layer.
    """
    
    if activation == "sigmoid":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    
    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    
    cache = (linear_cache, activation_cache)

    return A, cache

def L_model_forward(X, parameters):
    """
    Forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID
    """

    caches = []
    A = X
    L = len(parameters) // 2 # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], "relu")
        caches.append(cache)
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], "sigmoid")
    caches.append(cache)
    
    return AL, caches

### 4.1.3 Compute the loss
The cross-entropy cost $J$ : $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))$$

In [None]:
def compute_cost(AL, Y):
    """
    Compute the cross-entropy cost.
    """
    
    m = Y.shape[1]

    # Compute loss from aL and y.
    cost = -1.0 / m * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1-Y, np.log(1-AL)))        
    cost = np.squeeze(cost)
    
    return cost

### 4.1.4 Backward propagation

**Goal**: input $ dA^{[l]} $, output $ dA^{[l-1]}$.


**Step 1**: derivative of cost with respect to AL
```
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
```


**Step 2**: input $ dA^{[l]} $, output $dZ^{[l]}$.

If $g(.)$ is the activation function, 
`sigmoid_backward` and `relu_backward` compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]})$$


**Step 3**: input $dZ^{[l]}$, output $ dW^{[l]}$, $db^{[l]}$ and $ dA^{[l-1]}$.

For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose that the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$ has been computed. Then:
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} $$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}$$


In [None]:
def sigmoid_backward(dA, cache):
    """
    Backward propagation for sigmoid.
    cache: 'Z'
    """
    
    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    return dZ

def relu_backward(dA, cache):
    """
    Backward propagation for ReLU.
    cache: 'Z'
    """
    
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    
    return dZ

def linear_backward(dZ, cache):
    """
    Linear portion of backward propagation for a single layer (layer l).
    """
    
    A_prev, W, b = cache
    m = A_prev.shape[1]
    
    dW = 1.0 / m * np.dot(dZ, A_prev.T)
    db = 1.0 / m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

def linear_activation_backward(dA, cache, activation):
    """
    Backward propagation for the LINEAR->ACTIVATION layer.
    """
    
    linear_cache, activation_cache = cache
    m = dA.shape[1]
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
       
    return dA_prev, dW, db

def L_model_backward(AL, Y, caches):
    """
    Backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group.
    """
    
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

    # Initializing the backpropagation
    dAL = -(np.divide(Y, AL) - np.divide(1-Y, 1-AL))
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')
    
    for l in reversed(range(L - 1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, 'relu')
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp        

    return grads

### 4.1.5 Update the parameters

Gradient descent for updating parameters: 
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

In [None]:
def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent.
    """
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
      
    return parameters

## 4.2 Bias vector `b` is broadcasting

**Each row of $W$ represents a neuron.** And each neuron has a bias $b$. 

When compute $W X + b$, it carries out **broadcasting to $m$ samples**. For example, if: 

$$ W = \begin{bmatrix}
    j  & k  & l\\
    m  & n & o \\
    p  & q & r 
\end{bmatrix}\;\;\; X = \begin{bmatrix}
    a  & b  & c\\
    d  & e & f \\
    g  & h & i 
\end{bmatrix} \;\;\; b =\begin{bmatrix}
    s  \\
    t  \\
    u
\end{bmatrix}\tag{2}$$

Then $WX + b$ will be:

$$ WX + b = \begin{bmatrix}
    (ja + kd + lg) + s  & (jb + ke + lh) + s  & (jc + kf + li)+ s\\
    (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\
    (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u
\end{bmatrix}\tag{3}  $$

## 4.3 ReLU and it's derivative

In [None]:
import numpy as np

def relu(z):
    """
    Implement the RELU function.
    """
    
    a = np.maximum(0, z)
    return a


def relu_backward(dA, cache):
    """
    Backward propagation for ReLU.
    cache: 'Z'
    """
    
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    
    return dZ