<h1>Initialization</h1>

# 0. Xavier Initialization

Reference:

[1] Understanding the difficulty of training deep feedforward neural networks. Xaiver Glorot, Yoshua Bengio.

## 0.0 Mean and Variance of Uniform distribution

Assume $X \sim U[a, b]$, 

> $X = \int_{a}^b x dx$

> $\mathbb{E}[x] = \int_{a}^b \frac{1}{b-a} x dx$

> $= \frac{1}{b-a} \int_{a}^b x dx$

> $= \frac{1}{b-a} \frac{1}{2} x^2 \lvert_{a}^{b}$

> $= \frac{1}{b-a} \frac{1}{2} (b^2 - a^2)$

> $= \frac{b+a}{2}$

> $\mathbb{E}[x^2] = \int_{a}^{b} \frac{1}{b-a} x^2 dx$

> $= \frac{1}{b-a} \frac{1}{3} x^3\lvert_{a}^{b} $

> $= \frac{1}{b-a} \frac{1}{3} (b^3 - a^3)$

> $= \frac{b^2 + ab + a^2}{3}$

> $Var[x] = \mathbb{E}[x^2] - \mathbb{E}^2[x]$

> $= \frac{b^2 + ab + a^2}{3} - (\frac{a+b}{2})^2$

> $= \frac{(b-a)^2}{12}$

## 0.1 Uniform Initialization
Initalize bias to be 0, and weight of each layer $W_{ij}$. 

Assume $W$ are drawn from unit distribution, $x_i$ and $w_i$ are dependent, 

> $s = \sum_{i=1}^n w_i x_i$

> $Var[s] = Var [\sum_{i=1}^n w_i x_i]$

> $= \sum_{i=1}^n Var[w_i x_i]$

> $= \sum_{i=1}^n Var[w_i] Var[x_i]$

If we want $Var[s] = Var[x_i]$, $\sum_{i=1}^n Var[w_i] = 1$, $Var[w_i] = \frac{1}{n}$.

> $W_{ij} \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]$.


> $Var[W] = \frac{(\frac{2}{\sqrt{n}})^2}{12} = \frac{1}{3n}$


## 0.2 Xavier Intialization
For a dense ANN using symmetric activation function $f$ with unit derivative at 0 
($f^{'}(0)=1$).

Let $s^{i}$ be the input of layer $i$, $z^{i}$ be the output of layer $i$,

$W^{i}$ and $b^{i}$ are the weights and biases connect the output of layer $i-1$ and input of layer $i$, 

> $s^{i} = z^{i-1}W^{i} + b^{i}$

> $z^{i} = f(s^{i})$

Initialize $b^{i}=0$, 

> $z^{i} = f(z^{i-1}W^i + b^i)$

> $= f(f(z^{i-2}W^{i-1}+b^{i-1})W^i + b^i)$

> $= f(f...f(XW^1))$

Assume that we are in the linear regime, 

> $f^{'}(s_{k}^{i}) \approx 1$

> $Var[z^{i}] = Var[X]\prod_{j=1}^{i-1}n^j Var[W^j]$

$n^j$ is the size of layer $j-1$, where layer $0$ is the input layer.

From a forward-propagation point of view, to keep information flowing, 

let the variance of the output in each layer be consistent,

> $\forall (i, j), Var[z^i] = Var[z^j]$

> $\Rightarrow$

> + $n^{i}Var[W^i] = 1$

From a back-propagation view, to keep error flowing, 

let the variance of the gradient in each layer be consistent, 

> $\frac{\partial{Cost}}{\partial{s^{i}}} = 
\frac{\partial{Cost}}{\partial{s^{i+1}}} 
\frac{\partial{s^{i+1}}}{\partial{z^{i}}} 
\frac{\partial{z^{i}}}{\partial{s^{i}}} $

> $= \frac{\partial{Cost}}{\partial{s^{i+1}}} W^{i} f^{'}(s^i) $

> $= \frac{\partial{Cost}}{\partial{s^{i+1}}} W^{i}$

> $= (\frac{\partial{Cost}}{\partial{s^{i+2}}} W^{i+1}) W^{i}$

> $= \frac{\partial{Cost}}{\partial{s^{L}}} \prod_{j=1}^{L-1}W^{j}$

> $Var[\frac{\partial{Cost}}{\partial{s_k^{i}}}] = 
Var[\frac{\partial{Cost}}{\partial{s_k^{L}}}]
\prod_{j=i}^{L-1} n^{j+1} Var[W^j]
$

> $\forall(i, j), Var[\frac{\partial{Cost}}{\partial{s_k^{i}}}] = Var[\frac{\partial{Cost}}{\partial{s_k^{j}}}]$

> $\Rightarrow$

> + $n^{i+1}Var[W^i] = 1$

Hence, 

> $n^{i}Var[W^i] = 1$

> $n^{i+1}Var[W^i] = 1$

let, 

> $Var[W^i] = \frac{2}{n^i + n^{i+1}}$

If $W \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]$, 

> $Var[W] = \frac{1}{3n}$

To obtain a uniform distribution of variance of weights, 

> $\frac{2}{n^i + n^{i+1}} = \frac{1}{3n}$

> $\Rightarrow$

> $n = \frac{n^i + n^{i+1}}{6}$

+ Conclusion, 
Initialize each layer weight as, 

> $W^{i} \sim U[- \frac{1}{\sqrt{ \frac{n^i + n^{i+1}}{6} }}], 
\frac{1}{\sqrt{ \frac{n^i + n^{i+1}}{6} }}]$

> $W^{i} \sim U[- \frac{\sqrt{6}}{\sqrt{n^i + n^{i+1}}}, 
\frac{\sqrt{6}}{\sqrt{n^i + n^{i+1}}}]$

# 1. He Initialization

Reference:

[1] Delving Deep into Rectifiers- Surpassing Human-Level
Performance on ImageNet Classification. Kaiming He, etc.

Also assume that the elements in $x_l$ are also mutually independent 
and share the same distribution, and $x_l$ and $W_l$ are indpendent of each other.

> $z^i = f(s^i)$

> $s^i = z^{i-1}W^i + b^i$

> $Var[s^i] = n^i Var[z^{i-1} w^i + b^i]$

Initialize $b_i$ to zero, based on the dependence,

> $Var[s^i] = n^i Var[z^{i-1}]Var[w^i]$

> $Var[z^{i-1}] = Var[f(s^{i-1})]$

If the activation function is ReLU, $f(x) = max(0, x)$, only half of x is activated, 
$Var[f] = \frac{1}{2}Var[x]$,

> $Var[s^i] = n^i (\frac{1}{2} Var[s^{i-1}]) Var[w^i] 
= Var[s^{i-1}] (\frac{1}{2} n^i Var[w^i])$

> $\Rightarrow$

> $Var[s^L] = Var[s^1] \prod_{i=2}^{L} (\frac{1}{2} n^i Var[w^i])$

To keep $Var[s^L] = Var[s^1]$, 

> $\frac{1}{2} n^i Var[w^i] = 1$

> $Var[w^i] = \frac{2}{n^i}$

If $w$ are drawn from Gaussian distribution with mean is zero, 
the variance for each layer is $\frac{2}{n^i}$.