# Weight Initialization
First Lets have look at some properties of variance:
## Variance Properties
The variance can also be thought of as the covariance of a random variable with itself:

1) $\operatorname {Var} (X)=\operatorname {Cov} (X,X)$.
 

2) ${\displaystyle {\operatorname {Var} (X)=\operatorname {E} \left[(X-\operatorname {E} [X])^{2}\right]\\[4pt]=\operatorname {E} \left[X^{2}-2X\operatorname {E} [X]+\operatorname {E} [X]^{2}\right]\\[4pt]=\operatorname {E} \left[X^{2}\right]-2\operatorname {E} [X]\operatorname {E} [X]+\operatorname {E} [X]^{2}\\[4pt]=\operatorname {E} \left[X^{2}\right]-\operatorname {E} [X]^{2}}}$

Variance is invariant with respect to changes in a location parameter:

3) $\operatorname {Var} (X+a)=\operatorname {Var} (X).$

If all values are scaled by a constant, the variance is scaled by the square of that constant:

4) $\operatorname {Var} (aX)=a^{2}\operatorname {Var} (X).$


The variance of a sum of two random variables is given by:

5) $\operatorname {Var} (aX+bY)=a^{2}\operatorname {Var} (X)+b^{2}\operatorname {Var} (Y)+2ab\,\operatorname {Cov} (X,Y)$





6) In general, for the sum of ${\displaystyle N}$ random variables $\{X_{1},\dots ,X_{N}\}$, the variance becomes:
$\operatorname {Var} \left(\sum _{i=1}^{N}X_{i}\right)=\sum _{i,j=1}^{N}\operatorname {Cov} (X_{i},X_{j})=\sum _{i=1}^{N}\operatorname {Var} (X_{i})+\sum _{i\neq j}\operatorname {Cov} (X_{i},X_{j})$


7) These results lead to the variance of a linear combination as:


 ${\begin{aligned}\operatorname {Var} \left(\sum _{i=1}^{N}a_{i}X_{i}\right)&=\sum _{i,j=1}^{N}a_{i}a_{j}\operatorname {Cov} (X_{i},X_{j})\\&=\sum _{i=1}^{N}a_{i}^{2}\operatorname {Var} (X_{i})+\sum _{i\not =j}a_{i}a_{j}\operatorname {Cov} (X_{i},X_{j})\\&=\sum _{i=1}^{N}a_{i}^{2}\operatorname {Var} (X_{i})+2\sum _{1\leq i<j\leq N}a_{i}a_{j}\operatorname {Cov} (X_{i},X_{j}).\end{aligned}}$
 
 
8) Sum of uncorrelated variables (Bienaymé formula)

${\displaystyle {\begin{aligned}\operatorname {Var} (X+Y)&=\operatorname {E} \left[X^{2}\right]+2\operatorname {E} [XY]+\operatorname {E} \left[Y^{2}\right]-\left(\operatorname {E} [X]^{2}+2\operatorname {E} [X]\operatorname {E} [Y]+\operatorname {E} [Y]^{2}\right)\\[5pt]&=\operatorname {E} \left[X^{2}\right]+\operatorname {E} \left[Y^{2}\right]-\operatorname {E} [X]^{2}-\operatorname {E} [Y]^{2}\\[5pt]&=\operatorname {Var} (X)+\operatorname {Var} (Y).\end{aligned}}}$


${\displaystyle \operatorname {Var} \left(\sum _{i=1}^{n}X_{i}\right)=\sum _{i=1}^{n}\operatorname {Var} (X_{i}).}$

Refs [1](https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_(Bienaym%C3%A9_formula))

In the followings, we create 50 normal dictribution (`number_of_ditributions=50`), each one containing 1000 samples, and then we start appending them toghether to have new normal distribution, plot the histoggram of the  distribution which again is a normal distribution with $\mu=0$ but with growing $\sigma$ accroding to property number 6.

In [9]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt


x=[]
stds=[]
means=[]

num_cols=5

fig, axes=plt.subplots(nrows=1,ncols=num_cols)
number_of_ditributions=50
for i in np.arange(number_of_ditributions):
        rnd=np.random.randn(1000)
        x.append(rnd)
        means.append(np.mean(rnd))
        stds.append(np.std(rnd))
        if i%(np.round(number_of_ditributions/num_cols) )==0:
                X = np.sum(x, axis=0)
                print("Adding distribution number: ",i)
                print("Variance is :",np.std(X))
                print("Mean is :", np.mean(X),'\n')
                idx=int(i /np.round(number_of_ditributions / num_cols))
                axes[idx].hist(X, 100, range=(-20, 20))

plt.show()

<IPython.core.display.Javascript object>

Adding distribution number:  0
Variance is : 0.9559805129326452
Mean is : 0.017142907353271376 

Adding distribution number:  10
Variance is : 3.3253761215594286
Mean is : 0.13351047692513096 

Adding distribution number:  20
Variance is : 4.582714514935462
Mean is : 0.30497126243477995 

Adding distribution number:  30
Variance is : 5.659061832853905
Mean is : 0.2976153812722262 

Adding distribution number:  40
Variance is : 6.514475221913477
Mean is : 0.4774812504447719 



If we set the weights of a neuron $w_j$ from a normal distribution ($\mu=0$ ,$\sigma^2=1$), in the output of the neuron we would have: $z = \sum_j w_j x_j+b$ 
which is weighted sum of inputs. As it can be seen $\sigma$ start getting beigger, which mean the output of neuron $z$, will more likely be far away from $\mu=0$. 


We can see from this graph that it's quite likely that $|z|$ will be pretty large, then the output $\sigma(z)$ from the hidden neuron will be very close to either 1 or 0.  That means our hidden neuron will have saturated (the dreivative for sigmoid function is almost 0).

Suppose we have a neuron with $n_{in}$ input weights. Then we shall initialize those weights as Gaussian random variables with $\mu=0$ and $\sigma= 1/\sqrt{n_{\rm in}}$

Activation function: $tanh$

coefficient=0.01


<img src='images/tanh_coefficient_0.01.svg' />

Activation function: $tanh$

coefficient=1.0

<img src='images/tanh_coefficient_1.0.svg' />

## Xavier initialization


Activation function: $tanh$

coefficient=each layer devided $input^{0.5} $

<img src='images/tanh_sqrt(input_size).svg' />

## He-Kaiming

Refs: [1](https://arxiv.org/abs/1502.01852)


Activation function: $relu$

coefficient=each layer devided $input^{0.5}/2$

<img src='images/relu_sqrt(input_size_divide_by2).svg' />

Refs: [1](http://neuralnetworksanddeeplearning.com/chap3.html#weight_initialization), [2](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi]), [3](https://arxiv.org/abs/1312.6120) [4](https://arxiv.org/abs/1412.6558), [5](https://arxiv.org/abs/1502.01852), [6](https://arxiv.org/abs/1511.06856), [7](https://arxiv.org/abs/1511.06422)