# Problem 5 - Weight Initialization, Dead Neurons, Leaky ReLU

Read the two blogs, one by Andre Pernunicic and other by Daniel Godoy on weight initialization. You will reuse the code at github repo linked in the blog for explaining vanishing and exploding gradients. You can use the same 5 layer neural network model as in the repo and the same dataset.

• Andre Perunicic. Understand neural network weight initialization. Available at https://intoli.com/blog/neural-network-initialization/
• Daniel Godoy. Hyper-parameters in Action Part II — Weight Initializers.
• Initializers - Keras documentation. https://keras.io/initializers/.
• Lu Lu et al. Dying ReLU and Initialization: Theory and Numerical Examples .

# 1. 
Explain vanishing gradients phenomenon using standard normalization with different values of standard deviation as given in the reference. Train the model with tanh and sigmoid activation functions. Next, show how Xavier (aka Glorot normal) initialization of weights helps in dealing with this problem. You should plot the gradients at each of the 5 layers for all 4 experiments to answer this question.

Explain vanishing gradients phenomenon using standard normalization with different values of standard deviation as given in the reference.

In [20]:
! pip install keras
! pip install tensorflow
! pip install deepreplay
! pip install utils

You should consider upgrading via the '/Users/aragaom/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/aragaom/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/aragaom/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mCollecting utils
  Downloading utils-1.0.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: utils
Successfully installed utils-1.0.1
You should consider upgrading via the '/Users/aragaom/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Godoys tutorial

### The model

In [21]:
from keras.models import Sequential
from keras.layers import Dense

def build_model(n_layers, input_dim, units, activation, initializer):
    if isinstance(units, list):
        assert len(units) == n_layers
    else:
        units = [units] * n_layers
        
    model = Sequential()
    # Adds first hidden layer with input_dim parameter
    model.add(Dense(units=units[0], 
                    input_dim=input_dim, 
                    activation=activation,
                    kernel_initializer=initializer, 
                    name='h1'))
    
    # Adds remaining hidden layers
    for i in range(2, n_layers + 1):
        model.add(Dense(units=units[i-1], 
                        activation=activation, 
                        kernel_initializer=initializer, 
                        name='h{}'.format(i)))
    
    # Adds output layer
    model.add(Dense(units=1, activation='sigmoid', kernel_initializer=initializer, name='o'))
    # Compiles the model
    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])
    return model

In [17]:

from deepreplay.datasets.ball import load_data

X, y = load_data(n_dims=10)

### Normal initializer and sigmoid


In [18]:
from deepreplay.callbacks import ReplayData
from deepreplay.replay import Replay
from deepreplay.plot import compose_plots
from keras.initializers import normal
from matplotlib import pyplot as plt

filename = 'part2_weight_initializers.h5'
group_name = 'sigmoid_stdev_0.01'

# Uses normal initializer
initializer = normal(mean=0, stddev=0.01, seed=13)

# Builds BLOCK model
model = build_model(n_layers=5, input_dim=10, units=100, 
                    activation='sigmoid', initializer=initializer)

# Since we only need initial weights, we don't even need to train the model! 
# We still use the ReplayData callback, but we can pass the model as argument instead
replaydata = ReplayData(X, y, filename=filename, group_name=group_name, model=model)

# Now we feed the data to the actual Replay object
# so we can build the visualizations
replay = Replay(replay_filename=filename, group_name=group_name)

# Using subplot2grid to assemble a complex figure...
fig = plt.figure(figsize=(12, 6))
ax_zvalues = plt.subplot2grid((2, 2), (0, 0))
ax_weights = plt.subplot2grid((2, 2), (0, 1))
ax_activations = plt.subplot2grid((2, 2), (1, 0))
ax_gradients = plt.subplot2grid((2, 2), (1, 1))

wv = replay.build_weights(ax_weights)
gv = replay.build_gradients(ax_gradients)
# Z-values
zv = replay.build_outputs(ax_zvalues, before_activation=True, 
                          exclude_outputs=True, include_inputs=False)
# Activations
av = replay.build_outputs(ax_activations, exclude_outputs=True, include_inputs=False)

# Finally, we use compose_plots to update all
# visualizations at once
fig = compose_plots([zv, wv, av, gv], 
                    epoch=0, 
                    title=r'Activation: sigmoid - Initializer: Normal $\sigma = 0.01$')

ValueError: Unable to create group (name already exists)

### Normal initializer and tanh

### Glorot and sigmoid 

### Glorot and tanh

## other tutorial 

In [31]:
import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from keras import initializers
from keras.datasets import mnist
import functions as u
# from utils import (
#     compile_model,
#     create_mlp_model,
#     get_activations,
#     grid_axes_it,
# )


seed = 10

# Number of points to plot
n_train = 1000
n_test = 100
n_classes = 10

# Network params
n_hidden_layers = 5
dim_layer = 100
batch_size = n_train
epochs = 1

# Load and prepare MNIST dataset.
n_train = 60000
n_test = 10000

(x_train, y_train), (x_test, y_test) = mnist.load_data()
num_classes = len(np.unique(y_test))
data_dim = 28 * 28

x_train = x_train.reshape(60000, 784).astype('float32')[:n_train]
x_test = x_test.reshape(10000, 784).astype('float32')[:n_train]
x_train /= 255
x_test /= 255

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Run the data through a few MLP models and save the activations from
# each layer into a Pandas DataFrame.
rows = []
sigmas = [0.10, 0.14, 0.28]
for stddev in sigmas:
    init = initializers.RandomNormal(mean=0.0, stddev=stddev, seed=seed)
    activation = 'relu'

    model = u.create_mlp_model(
        n_hidden_layers,
        dim_layer,
        (data_dim,),
        n_classes,
        init,
        'zeros',
        activation
    )
    u.compile_model(model)
    output_elts = u.get_activations(model, x_test)
    n_layers = len(model.layers)
    i_output_layer = n_layers - 1

    for i, out in enumerate(output_elts[:-1]):
        if i > 0 and i != i_output_layer:
            for out_i in out.ravel()[::20]:
                rows.append([i, stddev, out_i])

df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])

# Plot previously saved activations from the 5 hidden layers
# using different initialization schemes.
fig = plt.figure(figsize=(12, 6))
axes = u.grid_axes_it(len(sigmas), 1, fig=fig)
for sig in sigmas:
    ax = next(axes)
    ddf = df[df['Standard Deviation'] == sig]
    sns.violinplot(x='Hidden Layer', y='Output', data=ddf, ax=ax, scale='count', inner=None)

    ax.set_xlabel('')
    ax.set_ylabel('')

    ax.set_title('Weights Drawn from $N(\mu = 0, \sigma = {%.2f})$' % sig, fontsize=13)

    if sig == sigmas[1]:
        ax.set_ylabel("ReLu Neuron Outputs")
    if sig != sigmas[-1]:
        ax.set_xticklabels(())
    else:
        ax.set_xlabel("Hidden Layer")

plt.tight_layout()
plt.show()

ValueError: Layer "model" expects 1 input(s), but it received 2 input tensors. Inputs received: [<tf.Tensor: shape=(10000, 784), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>]

# 2.
 The dying ReLU is a kind of vanishing gradient, which refers to a problem when ReLU neurons become
inactive and only output 0 for any input. In the worst case of dying ReLU, ReLU neurons at a certain
layer are all dead, i.e., the entire network dies and is referred as the dying ReLU neural networks in
Lu et al (reference below). A dying ReLU neural network collapses to a constant function. Show this
phenomenon using any one of the three 1-dimensional functions in page 13 of Lu et al. Use a ReLU
network with 10 hidden layers, each of width 2 (hidden units per layer). Use minibatch of 64 and draw
training data uniformly from [sqrt(− 7),sqrt() 7]. Perform 1000 independent training simulations each with
3,000 training points. Out of these 1000 simulations, what fraction resulted in neural network collapse. Is your answer close to over 90% as was reported in Lu et al.? 

In [None]:
"""Suppose f (x) = |x| is a target function we want to approximate using a ReLU network. 
Since |x|=ReLU(x)+ReLU(−x), a 2-layer ReLU network of width 2 can exactly represent |x|. 
However, when we train a deep ReLU network, we frequently observe"""

# 3.
 Instead of ReLU consider Leaky ReLU activation as defined below: 􏰀 z ifz>0
 References:
φ(z)= 0.01z ifz≤0.
Run the 1000 training simulations in part 2 with Leaky ReLU activation and keeping everything else same. Again calculate the fraction of simulations that resulted in neural network collapse. Did Leaky ReLU help in preventing dying neurons ?