## Weight Initialization

### Why not identical initialization?

If we initialize every node within layer $l$ with the same parameters:

$ \forall g, h: $

$ \textbf{w}^{[l]}_{g, :} = \textbf{w}^{[l]}_{h, :} $

$ b^{[l]}_g = b^{[l]}_h $

This implies that each node within layer $l$ activates identically:

$ \forall g, h: $

$ z^{[l]}_{s, g} = \textbf{a}^{[l-1]}_{s, :} \cdot \textbf{w}^{[l]T}_{g, :} + b_g^{[l]} $

$ z^{[l]}_{s, h} = \textbf{a}^{[l-1]}_{s, :} \cdot \textbf{w}^{[l]T}_{h, :} + b_h^{[l]} $

$ z^{[l]}_{s, g} = z^{[l]}_{s, h} $

$ a^{[l]}_{s, g} = f(z^{[l]}_{s, g}) = f(z^{[l]}_{s, h}) = a^{[l]}_{s, h} $

This implies that the gradient of all nodes within layer $l$ is identical:

$$ 
\begin{align}
    \forall g, h: & \\
    \frac {\partial C_s} {\partial w^{[l]}_{g, n}} 
    & = \frac {\partial C_s} {\partial a^{[l]}_{s, g}} 
    * f'(z^{[l]}_{s, g})
    * a^{[l-1]}_{s, n} 
    \\
    \frac {\partial C_s} {\partial w^{[l]}_{h, n}} 
    & = \frac {\partial C_s} {\partial a^{[l]}_{s, h}} 
    * f'(z^{[l]}_{s, h})
    * a^{[l-1]}_{s, n} 
    \\
    \frac {\partial C_s} {\partial w^{[l]}_{g, n}} 
    & = \frac {\partial C_s} {\partial w^{[l]}_{h, n}}  
\end{align}
$$

Note that the upstream gradient has to be identical too.
This is true if the upstream layer is also identically initialized.

# Glorot Initialization
Source: Understanding the difficulty of training deep feedforward neural networks (Glorot, X., and Bengio Y., 2010).
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

$$ W_i \sim U\left( -\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}} \right) $$
$$ n_{in} (n_{out}) = \text{count of input (output) connections} $$

In [1]:
from tensorflow.python.ops.init_ops import GlorotUniform
?GlorotUniform

[0;31mInit signature:[0m [0mGlorotUniform[0m[0;34m([0m[0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdtype[0m[0;34m=[0m[0mtf[0m[0;34m.[0m[0mfloat32[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
The Glorot uniform initializer, also called Xavier uniform initializer.

It draws samples from a uniform distribution within [-limit, limit]
where `limit` is `sqrt(6 / (fan_in + fan_out))`
where `fan_in` is the number of input units in the weight tensor
and `fan_out` is the number of output units in the weight tensor.

Args:
  seed: A Python integer. Used to create random seeds. See
    `tf.compat.v1.set_random_seed` for behavior.
  dtype: Default data type, used if no `dtype` argument is provided when
    calling the initializer. Only floating point types are supported.
References:
    [Glorot et al., 2010](http://proceedings.mlr.press/v9/glorot10a.html)
    ([pdf](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf))
[0;31mInit docstring

# He Initialization
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (He, K., et al., 2015)
https://arxiv.org/abs/1502.01852

$$ W_i \sim N\left( 0, \sqrt{\frac{2}{n_{in}}} \right) $$
$$ n_{in} = \text{count of input connections} $$

In [2]:
from tensorflow.python.ops.init_ops import he_normal
?he_normal

[0;31mSignature:[0m [0mhe_normal[0m[0;34m([0m[0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
He normal initializer.

It draws samples from a truncated normal distribution centered on 0
with standard deviation (after truncation) given by
`stddev = sqrt(2 / fan_in)` where `fan_in` is the number of
input units in the weight tensor.

Arguments:
    seed: A Python integer. Used to seed the random generator.

Returns:
    An initializer.

References:
    [He et al., 2015]
    (https://www.cv-foundation.org/openaccess/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html)
    # pylint: disable=line-too-long
    ([pdf](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf))
[0;31mFile:[0m      /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py
[0;31mType:[0m      function
