# Problem 5 - Weight Initialization, Dead Neurons, Leaky ReLU

Read the two blogs, one by Andre Pernunicic and other by Daniel Godoy on weight initialization. You will reuse the code at github repo linked in the blog for explaining vanishing and exploding gradients. You can use the same 5 layer neural network model as in the repo and the same dataset.

• Andre Perunicic. Understand neural network weight initialization. Available at https://intoli.com/blog/neural-network-initialization/
• Daniel Godoy. Hyper-parameters in Action Part II — Weight Initializers.
• Initializers - Keras documentation. https://keras.io/initializers/.
• Lu Lu et al. Dying ReLU and Initialization: Theory and Numerical Examples .

# 1. 
Explain vanishing gradients phenomenon using standard normalization with different values of standard deviation as given in the reference. Train the model with tanh and sigmoid activation functions. Next, show how Xavier (aka Glorot normal) initialization of weights helps in dealing with this problem. You should plot the gradients at each of the 5 layers for all 4 experiments to answer this question.

**Explain vanishing gradients phenomenon using standard normalization with different values of standard deviation as given in the reference.**

 The vanishing gradient problem occurs during the training of a neural network. As the model progresses with the training the gradients that comes from the chain of derivative multiplations that happens on the n hidden layer network starts to get smaller and smaller until it reaches zero.

 If a neuron reaches zero as the value for its gradients it means that there will be not learning and the network is considered dead. 

# 2.
 The dying ReLU is a kind of vanishing gradient, which refers to a problem when ReLU neurons become
inactive and only output 0 for any input. In the worst case of dying ReLU, ReLU neurons at a certain
layer are all dead, i.e., the entire network dies and is referred as the dying ReLU neural networks in
Lu et al (reference below). A dying ReLU neural network collapses to a constant function. Show this
phenomenon using any one of the three 1-dimensional functions in page 13 of Lu et al. Use a ReLU
network with 10 hidden layers, each of width 2 (hidden units per layer). Use minibatch of 64 and draw
training data uniformly from [sqrt(− 7),sqrt() 7]. Perform 1000 independent training simulations each with
3,000 training points. Out of these 1000 simulations, what fraction resulted in neural network collapse. Is your answer close to over 90% as was reported in Lu et al.? 

### Answer: 
Yes, the rate was around 93% of dead neurons which is very similar to the paper's result. 

In [1]:
import numpy as np 
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import tensorflow as tf
"""Suppose f (x) = |x| is a target function we want to approximate using a ReLU network. 
Since |x|=ReLU(x)+ReLU(−x), a 2-layer ReLU network of width 2 can exactly represent |x|. 
However, when we train a deep ReLU network, we frequently observe"""
np.random.seed(1999)
num_simulations= 1000
mini_batch_size = 64

def function_fx(x):
    return abs(x)

# should we draw the 3k trainig points or use the same training points of for all of the simulations.
def generating_data():
    training_set = []
    for i in range(3000):
        x_y = []
        x_y.append(np.random.uniform(-np.sqrt(7),np.sqrt(7)))
        x_y.append(function_fx(x_y[0]))
        training_set.append(x_y)

    train_x = np.array(training_set)[:,0].reshape(-1,1)
    train_y = np.array(training_set)[:,1].reshape(-1,1)

    test_set = []
    for i in range(500):
        x_y = []
        x_y.append(np.random.uniform(-np.sqrt(7),np.sqrt(7)))
        x_y.append(function_fx(x_y[0]))
        test_set.append(x_y)
        
    test_x = np.array(test_set)[:,0].reshape(-1,1)
    test_y = np.array(test_set)[:,1].reshape(-1,1)
    return train_x,train_y,test_x,test_y
    
increment = 0
for sim in range(num_simulations+1): # doing 10 times to test it 
    
    train_x,train_y,test_x,test_y = generating_data()
    
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'relu',bias_initializer='zeros'))
    model.add(tf.keras.layers.Dense(units = 1))
    model.compile(optimizer='sgd', loss='mse')
    # This builds the model for the first time:
    model.fit(train_x, train_y, batch_size=64, epochs=2)
    y_predict = model.predict(test_x)
    result = np.all(y_predict == y_predict[0])
    if result:
        increment+=1

    print("-------------------",increment,'/',sim)

print(increment/sim)
    


Epoch 1/2
Epoch 2/2
------------------- 1 / 0
Epoch 1/2
Epoch 2/2
------------------- 2 / 1
Epoch 1/2
Epoch 2/2
------------------- 3 / 2
Epoch 1/2
Epoch 2/2
------------------- 4 / 3
Epoch 1/2
Epoch 2/2
------------------- 5 / 4
Epoch 1/2
Epoch 2/2
------------------- 6 / 5
Epoch 1/2
Epoch 2/2
------------------- 7 / 6
Epoch 1/2
Epoch 2/2
------------------- 8 / 7
Epoch 1/2
Epoch 2/2
------------------- 9 / 8
Epoch 1/2
Epoch 2/2
------------------- 10 / 9
Epoch 1/2
Epoch 2/2
------------------- 11 / 10
Epoch 1/2
Epoch 2/2
------------------- 12 / 11
Epoch 1/2
Epoch 2/2
------------------- 13 / 12
Epoch 1/2
Epoch 2/2
------------------- 14 / 13
Epoch 1/2
Epoch 2/2
------------------- 15 / 14
Epoch 1/2
Epoch 2/2
------------------- 16 / 15
Epoch 1/2
Epoch 2/2
------------------- 17 / 16
Epoch 1/2
Epoch 2/2
------------------- 17 / 17
Epoch 1/2
Epoch 2/2
------------------- 18 / 18
Epoch 1/2
Epoch 2/2
------------------- 19 / 19
Epoch 1/2
Epoch 2/2
------------------- 20 / 20
Epoch 1/2
E

We got a dead neuron ration of 94.7 % which matches the results in the paper.

# 3.
 Instead of ReLU consider Leaky ReLU activation as defined below: 􏰀 z ifz>0
 References:
φ(z)= 0.01z ifz≤0.
Run the 1000 training simulations in part 2 with Leaky ReLU activation and keeping everything else same. Again calculate the fraction of simulations that resulted in neural network collapse. Did Leaky ReLU help in preventing dying neurons ?

### Answer: 
Yes, with the leaky Relu the rate of dead neuron is almost 0, which is a great improvement. 

In [18]:

increment = 0
for sim in range(num_simulations+1): # doing 10 times to test it 
    train_x,train_y,test_x,test_y = generating_data()
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(units = 2,activation='leaky_relu',kernel_initializer = tf.initializers.RandomNormal(stddev=0.01)))
    model.add(tf.keras.layers.Dense(units = 2,activation='leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation='leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 2,activation = 'leaky_relu'))
    model.add(tf.keras.layers.Dense(units = 1))
    model.compile(optimizer='sgd', loss='mse')
    # This builds the model for the first time:
    model.fit(train_x, train_y, batch_size=64, epochs=2)
    y_predict = model.predict(test_x)

    result = np.all(y_predict == y_predict[0])
    if result:
        increment+=1

    print("-------------------",increment,'/',sim)

print(increment/sim)




Epoch 1/2
Epoch 2/2
------------------- 0 / 0
Epoch 1/2
Epoch 2/2
------------------- 0 / 1
Epoch 1/2
Epoch 2/2
------------------- 0 / 2
Epoch 1/2
Epoch 2/2
------------------- 0 / 3
Epoch 1/2
Epoch 2/2
------------------- 0 / 4
Epoch 1/2
Epoch 2/2
------------------- 0 / 5
Epoch 1/2
Epoch 2/2
------------------- 0 / 6
Epoch 1/2
Epoch 2/2
------------------- 0 / 7
Epoch 1/2
Epoch 2/2
------------------- 0 / 8
Epoch 1/2
Epoch 2/2
------------------- 0 / 9
Epoch 1/2
Epoch 2/2
------------------- 0 / 10
Epoch 1/2
Epoch 2/2
------------------- 0 / 11
Epoch 1/2
Epoch 2/2
------------------- 0 / 12
Epoch 1/2
Epoch 2/2
------------------- 0 / 13
Epoch 1/2
Epoch 2/2
------------------- 0 / 14
Epoch 1/2
Epoch 2/2
------------------- 0 / 15
Epoch 1/2
Epoch 2/2
------------------- 0 / 16
Epoch 1/2
Epoch 2/2
------------------- 0 / 17
Epoch 1/2
Epoch 2/2
------------------- 0 / 18
Epoch 1/2
Epoch 2/2
------------------- 0 / 19
Epoch 1/2
Epoch 2/2
------------------- 0 / 20
Epoch 1/2
Epoch 2/2
---