# Build an Optimization algorithm

## Table of Contents

* [1. MiniBatch Neural Network Model](#chapter1)
    * [1.1 Initialize parameters](#section_1_1)
    * [1.2  Forward propagation](#section_1_2)
    * [1.3  Compute Cost](#section_1_3)
    * [1.4  Backward propagation](#section_1_4)
    * [1.5  Minibatch](#section_1_5)
* [2. Optimization Algorithms](#chapter2)
    * [2.1 Without optimization](#section_2_1)
    * [2.2 Momentum optimization](#section_2_2)
    * [2.3 Adam optimization](#section_2_3)
* [3. Model](#chapter3)
    * [3.1 Model built](#section_3_1)
    * [3.2 Test on dataset](#section_3_2)

In [1]:
# Packages
import copy
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss


%matplotlib inline

# 1. Build a neural network model with Minibatch <a class="anchor" id="chapter1"></a>

## 1.1 Initialize parameters <a class="anchor" id="section_1_1"></a>

In [10]:
def initialize_parameters(layers_dims,n_input,n_output):
    """
    Compute the initialization of the parameters in our Neural Network

    - Arguments:
    layers_dims: array containing the dimension of the hidden layers
    n_input: numbers of features in the input layer
    n_ouput: numbers of nodes in the output layer
    type_init: "zeros","random","he" type of initialization

    - Return:
    parameters: dictionnary containing of the parameters of our neural network
    """
    np.random.seed(1)
    
    # init
    parameters = {}

    # add the output layer to the array
    layers_dims.append(n_output)

    # number of layers
    L = len(layers_dims)

    for i in range(L):
        
        # if i==0 take n_x features
        if i ==0:
            layer_prev = n_input
        else:
            layer_prev = layers_dims[i-1]

        # default init
        parameters["W" + str(i+1)] = np.random.randn(layers_dims[i],layer_prev) * np.sqrt(2/layer_prev)
        parameters["b" + str(i+1)] = np.zeros((layers_dims[i],1))


    return parameters

## 1.2 Forward propagation <a class="anchor" id="section_1_2"></a>

In [4]:
def activation_function(Z,activation_name):

    if activation_name.lower() == "sigmoid":

        A = 1/(1+np.exp(-Z))
    
    elif activation_name.lower() == "relu":

        A = np.maximum(0,Z)

    elif activation_name.lower() == "tanh":

        A = np.tanh(Z)
    
    else:
        # By default relu
        A = np.maximum(0,Z)

    return A

In [5]:
def forward_propagation(X,parameters,activation_name="relu"):

    """
    Compute the activation function
    
    Arguments:
    activation_name -- name of the activation function choosen
    Z -- items

    Returns:
    activation -- activation value
    """
    # init cache
    caches = []
    cache_layer = {}

    # layer
    L = len(parameters)//2

    # setting A_prev to X
    A_prev = X

    for i in range(1,L+1):

        # getting parameters
        W = parameters["W" + str(i)]
        b= parameters["b" + str(i)]

        # linear result
        Z = np.dot(W,A_prev) + b

        if i==L:
            # last layer -  sigmoid 
            A = activation_function(Z,"sigmoid")
        else:
            A = activation_function(Z,"relu")

        # adding to the cache
        cache = {"W" : W, "b":b,"A":A,"Z":Z,"A_prev": A_prev}

        # adding layer cache
        caches.append(cache)

        # setting A_prev
        A_prev = A

    return A, caches

## 1.3 Compute cost<a class="anchor" id="section_1_3"></a>

In [6]:
def cost_function(AL,y):

    """
    Compute the log loss L(y_pred,y)

    -- Arguments:
    y : true labels of the dataset
    AL : result of the forward propagation 

    -- Returns:
    cost : Log loss cost

    """
    # m examples
    m = y.shape[1]

    epsilon = 1e-15

    cost = (np.multiply(y,np.log(AL+epsilon)) + np.multiply(1-y,np.log(1-AL+epsilon)))
    cost = -(1/m) *np.nansum(cost)
    
    return cost

## 1.4 Backward propagation  <a class="anchor" id="section_1_4"></a>

In [7]:
def backward_activation(dA,Z,function_name="relu"):

    """
    Compute dZ for the backward propagation

    -- Arguments:
    dA : derivative of A
    Z : linear activation
    function_name: name of the activation_function

    -- Returns:
    dZ: derivative of Z
    """

    if function_name.lower() == "sigmoid":
        # sigmoid
        s = 1/(1+np.exp(-Z))

        # derivative sigmoid
        dG = s*(1-s)

        # dZ
        dZ = dA * dG

    elif function_name.lower() == "relu":

        # relu
        r = np.maximum(0,Z)

        # derivative relu
        dG = np.int64(r>0)

        # dZ
        dZ = np.multiply(dA,dG)

    elif function_name.lower() == "tanh":

        # tanh
        th = np.tanh(Z)

        # derivative tanh
        dG = 1-np.power(th,2)

        # dZ
        dZ = dA * dG

    else:
        # by default relu
        r = np.maximum(0,Z)

        # derivative relu
        dG = np.int64(r>0)

        # dZ
        dZ = np.multiply(dA,dG)

    return dZ

In [8]:
def backward_propagation(AL,y,caches,function_name="relu"):

    # gradients
    gradients = {}

    # numbers of layers
    L = len(caches)

    # number of examples
    m = y.shape[1]

    # dAL
    dAL = - (np.divide(y, AL) - np.divide(1 - y, 1 - AL))
    
    # getting cache layer L
    current_cache = caches[-1]
    WL = current_cache['W']
    ZL = current_cache['Z']
    A_prev = current_cache['A_prev']

    dZL = backward_activation(dAL,ZL,"sigmoid")

    dW_temp = (1/m)*np.dot(dZL,A_prev.T)
    db_temp = (1/m) * np.sum(dZL,axis=1, keepdims=True)
    dA_prev_temp = np.dot(WL.T,dZL)
    
    # compute the gradient
    gradients["dW" + str(L)] = dW_temp
    gradients["db" + str(L)] = db_temp


    for i in reversed(range(L-1)):

        # getting cache layer L
        current_cache = caches[i]
        W = current_cache['W']
        Z = current_cache['Z']
        A_prev = current_cache['A_prev']

        dZ = backward_activation(dA_prev_temp,Z,function_name)

        dW_temp = (1/m)*np.dot(dZ,A_prev.T)
        db_temp = (1/m) * np.sum(dZ,axis=1, keepdims=True)
        dA_prev_temp = np.dot(W.T,dZ)
        
        # compute the gradient
        gradients["dW" + str(i+1)] = dW_temp
        gradients["db" + str(i+1)] = db_temp


    return gradients


## 1.5 Minibatch <a class="anchor" id="section_1_5"></a>

- Use random minibatches to accelerate convergence and improve optimization

# 2. Optimization Algorithms <a class="anchor" id="chapter2"></a>

## 2.1 Without optimization algorithm  <a class="anchor" id="section_2_1"></a>

In [9]:
def update_parameters(parameters,gradients,learning_rate):

    # copy 
    params = copy.deepcopy(parameters)
    
    # nb layer
    L = len(parameters)//2

    for i in range(L):
        params["W" + str(i+1)] = params["W" + str(i+1)] - learning_rate * gradients["dW" + str(i+1)]
        params["b" + str(i+1)] = params["b" + str(i+1)] - learning_rate * gradients["db" + str(i+1)]

    return params

## 2.2 Momentum  <a class="anchor" id="section_2_2"></a>

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations. 

Momentum takes into account the past gradients to smooth out the update. The 'direction' of the previous gradients is stored in the variable $v$. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of $v$ as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill. 


$$ \begin{cases}
v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\
W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}}
\end{cases}\tag{3}$$

$$\begin{cases}
v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\
b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} 
\end{cases}\tag{4}$$

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate. All parameters should be stored in the `parameters` dictionary.

## 2.3 Adam  <a class="anchor" id="section_2_3"></a>

***Recap :***

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum. 

**How does Adam work?**
1. It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction). 
2. It calculates an exponentially weighted average of the squares of the past gradients, and  stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction). 
3. It updates parameters in a direction based on combining information from "1" and "2".

The update rule is, for $l = 1, ..., L$: 

$$\begin{cases}
v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\
v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\
s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\
s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_2)^t} \\
W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}
\end{cases}$$
where:
- t counts the number of steps taken of Adam 
- L is the number of layers
- $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages. 
- $\alpha$ is the learning rate
- $\varepsilon$ is a very small number to avoid dividing by zero

# 3. Model  <a class="anchor" id="chapter3"></a>