# Initialization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec
train_X, test_X, train_y, test_y = load_dataset()

#Three main types of initialisation

Zeros initialization -- setting initialization = "zeros" in the input argument.
Random initialization -- setting initialization = "random" in the input argument. This initializes the weights to large random values.
He initialization -- setting initialization = "he" in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015.

In [2]:
def model(X, y, learning_rate=0.01, num_iterations=1500, print_cost=True, initialization='he'):
    grads = {}
    costs = []
    m = X.shape[1]
    layer_dims = [X.shape[0], 10, 5, 1]
    if initialization =='zero':
        parameters = intialize_parameters_zeroes(layer_dims)
    elif initialization =='random':
        parameters = initialize_parameters_random(layer_dims)
    elif initialization =='he':
        parameters = initialize_parameters_he(layer_dims)
        
    for i in range(0, num_iterations):
        a3, cache = forward_propagation(X, rameters)
        cost = compute_loss(a3, Y)
        grads = backward_propagation(X, Y, cache)
        parameters = update_parameters(parameters, grads, learning_rate)
        if print_cost && i%1000==0:
            print("Cost after iteration {}: {}".format(i, cost))
            costs.append(cost)
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

Zero initialization

In [None]:
def initialize_parameters_zeros(layer_dims):
    parameters = {}
    L = len(layer_dims)
    for l in range(1, L):
        parameters['W'+str(l)] = np.zeros(layer_dims[l], layer_dims[l-1])
        parameters['b'+str(l)] = np.zeros(layer_dims[l], 1)
    return parameters

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with  for every layer, and the network is no more powerful than a linear classifier such as logistic regression.

Random initialization

In [None]:
def initialize_parameters_random(layer_dims):
    
    #With high multiples of randomly generated w
    
    np.random.seed(2)
    parameters ={}
    L = len(layer_dims)
    for l in range(1, L):
        parameters['W'+str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*10
        parameters['b'+str(l)] = np.zeros((layer_dims[l], 1))
    return parameters

Observations:

The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when , the loss goes to infinity.
Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.
In summary:

Initializing weights to very large random values does not work well.
Hopefully intializing with small random values does better.

'HE' initialisation
"He Initialization"; this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights  of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)

In [1]:
def initialize_parameters_he(layer_dims):
    parameters = {}
    np.random.seed(4)
    L = len(layer_dims)
    for i in range(1, L):
        parameters['W'+str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*math.sqrt(2./layer_dims[l-1])
        parameters['b'+str(l)] = np.zeros((layer_dims[l], 1))*math.sqrt(2./layers_dims[l-1])
    return parameters

Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don't intialize to values that are too large
He initialization works well for networks with ReLU activations.