Why Normalization  
Internal Covariate Shift(ICS): 数据尺度/分布异常,导致训练困难  
常见的Normalization  
1. Batch Normalization(BN)  
2. Layer Normalization(LN)  
3. Instance Normalization(IN)  
4. Group Normalization(GN)  

不同点在于均值和方差求取方式

1.Layer Normalization  
起因:BN不适用于变长的网络，如RNN  
思路:逐层计算均值和方差  
注意事项:   

1. 不再有running_mean和running_var  
2. gamma和beta为逐元素的

nn.LayerNorm  
主要参数：  
* normalized_shape:该层特征形状  
* eps:分母修正项  
* elementwise_affine:是否需要affine transform

In [1]:
import torch
import torch.nn as nn
import numpy as np

In [8]:
batch_size = 8
num_features = 6
features_shape = (3, 4)
feature_map = torch.ones(features_shape)
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)
ln = nn.LayerNorm([3, 4])
output = ln(feature_maps_bs)
print(ln.weight.shape, feature_maps_bs[0, ...])
print(output[0, ...])

torch.Size([3, 4]) tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[2., 2., 2., 2.],
         [2., 2., 2., 2.],
         [2., 2., 2., 2.]],

        [[3., 3., 3., 3.],
         [3., 3., 3., 3.],
         [3., 3., 3., 3.]],

        [[4., 4., 4., 4.],
         [4., 4., 4., 4.],
         [4., 4., 4., 4.]],

        [[5., 5., 5., 5.],
         [5., 5., 5., 5.],
         [5., 5., 5., 5.]],

        [[6., 6., 6., 6.],
         [6., 6., 6., 6.],
         [6., 6., 6., 6.]]])
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 

2.Instance Normalization  
起因: BN在图像生成(Image Generation)中不适用  
思路: 逐Instance(channel)计算均值和方差

nn.InstanceNorm  
主要参数：  
* num_features:一个样本特征数量（最重要）  
* eps:分母修正项  
* momentum:指数加权平均估计当前mean/var  
* affine:是否需要affine transform  
* track_running_stats:是训练状态，还是测试状态

In [5]:
batch_size = 3
num_features = 3
momentum = 0.3
features_shape = (2, 2)
feature_map = torch.ones(features_shape)
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)
print(feature_maps_bs, feature_maps_bs.shape)
instance_n = nn.InstanceNorm2d(num_features=num_features, momentum=momentum)
for i in range(1):
    outputs = instance_n(feature_maps_bs)
    print(outputs)

tensor([[[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]],


        [[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]],


        [[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]]]) torch.Size([3, 3, 2, 2])
tensor([[[[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.]]]])


3.Group Normalization  
起因: 小batch样本中，BN估计的值不准  
思路: 数据不够，通道来凑  
注意事项: 
1. 不再有running_mean和running_var  
2. gamma和beta为逐通道(channel)的  

应用场景: 大模型(小batch size)任务

nn.GroupNorm  
主要参数:  
* num_groups:分组数  
* num_channels:通道数(特征数)  
* eps:分母修正项  
* affine:是否需要affine transform

In [7]:
batch_size = 2
num_features = 4
num_groups = 2
features_shape = (2, 2)
feature_map = torch.ones(features_shape)
feature_maps = torch.stack([feature_map * (i+1) for i in range(num_features)], dim=0)
feature_maps_bs = torch.stack([feature_maps * (i + 1) for i in range(batch_size)], dim=0)
gn = nn.GroupNorm(num_groups, num_features)
outputs = gn(feature_maps_bs)
print(gn.weight.shape, outputs[0])

torch.Size([4]) tensor([[[-1.0000, -1.0000],
         [-1.0000, -1.0000]],

        [[ 1.0000,  1.0000],
         [ 1.0000,  1.0000]],

        [[-1.0000, -1.0000],
         [-1.0000, -1.0000]],

        [[ 1.0000,  1.0000],
         [ 1.0000,  1.0000]]], grad_fn=<SelectBackward>)


小结:BN、LN、IN和GN都是为了克服ICS