# Lab-09-2 Weight Initialization

- Why good initialization?
- RBM / DBN
- Xavier / He initialization
- Code: mnist_nn_xavier
- Code: mnist_nn_deep

## Need to set the initial weight values wisely

- Not all 0's
- Challenging issue
- Hinton et al. (2006) "A Fast Learning Algorithm for Deep Belief Nets" -  Restricted Boltzmann Machine (RBM)

## Restricted Boltzmann Machine

- Restricted: no connections within a layer (한 레이어 안에 있는 노드들 간에는 연결이 없다. 다만 다른 레이어의 노드들에는 fully-connected인 것을 RBM이라 함.)
- KL divergence: compare actual to recreation

이런 머신이 있으면 어떤 입력 x가 들어갔을 때 y를 만들 수 있는 forward가 있고, 또 y -> x 복원하는 backward, encoding/decoding과 같은 것이 가능.

## How can we use RBM to initialize weights?

- Apply the RBM idea on adjacent two layers as a <u>pre-training</u> step (힌튼 교수님이 pre-training이라는 방법 소개)
- Continue the first process to all layers
- This will set weights
- Example: Deep Belief Network
    - Weight initialized by RBM

## Deep Belief Network

- Pre-training (x - h^1 - h^2 - ...)
    1. layer 1 <-> layer 2 간 RBM으로 w 학습 (RBM for x)
    2. layer 2 <-> layer 3 간 RBM으로 w 학습 (RBM for h^1)
    3. layer 3 <-> layer 4 간 RBM으로 w 학습 (RBM for h^2)
    4. ... 마지막까지 하면 마무리
    
- Fine-tuning
    - Pre-training 마무리되면 RBM으로 모든 레이어 weight 학습되었을 것.
    - 그 weight 전체 두고 뉴럴 네트워크 학습해서 y 구하고 G와의 차이 구해서 loss 구해서 back prop -> 네트워크 update 다시 한번 하는데 이걸 fine-tuning 단계라 함.

## Xavier / He initialization

- RBM은 요새 잘 안씀

- No need to use complicated RBM for weight initializations
- Simple methods are OK
    - **Xavier initialization**: X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in International conference on artificial intelligence and statistics, 2010
    - **He initialization**: K. He, X. Zhang, S. Ren, and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification," 2015
    
레이어 특성에 따라 initialzation 따로 해야 한다.

Xavier: RBM보다 덜 복잡함. pre-training/fine-tuning 필요 없음.
- Xavier Normal initialization
    - $W \sim N(0, Var(W))$
    - $Var(W) = \sqrt{\frac{2}{n_{in} + n_{out}}}$
- Xavier Uniform initialization
    - $W \sim U(-\sqrt{-\frac{6}{n_{in} + n_{out}}}, +\sqrt{\frac{6}{n_{in} + n_{out}}})$
    
He: Xavier의 변형. 수식의 형태도 굉장히 비슷함.
- He Normal initialzation
    - $W \sim N(0, Var(W))$
    - $Var(W) = \sqrt{\frac{2}{n_{in}}}$
- He Uniform initialization
    - $W \sim U(-\sqrt{\frac{6}{n_{in}}}, +\sqrt{\frac{6}{n_{in}}})$

## Code: mnist_nn_xavier

In [None]:
def xavier_uniform_(tensor, gain=1):
    """
    .. math::
        a = \text{gain} \times \sqrt{\frac{6}{\text{fan_in} + \text{fan_out}}}
    
    Also known as Glorot initialization.
    
    Args:
        tensor: an n-dimensional `torch.Tensor`
        gain: an optional scaling factor
    
    Examples:
        >>> w = torch.empty(3, 5)
        >>> nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))
    """
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
    std = gain * math.sqrt(2.0 / (fan_in + fan_out))
    a = math.sqrt(3.0) * std   # Calculate uniform bounds from standard deviation
    with torch.no_grad():
        return tensor.uniform_(-a, a)

In [None]:
...

# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True)
relu = torch.nn.ReLU()

# xavier initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)

weight 초기화만 바꿔도 굉장히 성능 빨라짐