Batch Normalization:批标准化  
批:一批数据，通常为mini-batch  
标准化:0均值，1方差  
优点：可以用更大学习率，加速模型收敛  
可以不用精心设计权值初始化  
可以不用dropout或较小的dropout  
可以不用L2或者较小的weight decay  
可以不用LRN(local response normalization)  

初衷是解决ICS问题，防止梯度消失和梯度爆炸。

In [6]:
import torch
import torch.nn as nn
import numpy as np

In [11]:
class MLP(nn.Module):
    def __init__(self, neural_num, layers=100):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
        self.neural_num = neural_num
    
    
    def forward(self, x):
        for (i, linear), bn in zip(enumerate(self.linears), self.bns):
            x = linear(x)
            x = bn(x)
            x = torch.relu(x)
            if torch.isnan(x.std()):
                print('output is nan in {} layers'.format(i))
                break
            print('layers:{}, std:{}'.format(i, x.std().item()))
        return x
    
    
    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight.data)


In [12]:
neural_num = 256
layer_nums = 100
batch_size = 16
net = MLP(neural_num, layer_nums)
# net.initialize()
inputs = torch.randn((batch_size, neural_num))
output = net(inputs)
print(output)

layers:0, std:0.5789448022842407
layers:1, std:0.5849304795265198
layers:2, std:0.5713231563568115
layers:3, std:0.5765722990036011
layers:4, std:0.57011479139328
layers:5, std:0.5769959092140198
layers:6, std:0.576199471950531
layers:7, std:0.5761939883232117
layers:8, std:0.586577832698822
layers:9, std:0.5730356574058533
layers:10, std:0.5820686221122742
layers:11, std:0.58272385597229
layers:12, std:0.5818347334861755
layers:13, std:0.577144980430603
layers:14, std:0.5806982517242432
layers:15, std:0.5770207047462463
layers:16, std:0.5772503018379211
layers:17, std:0.5789951682090759
layers:18, std:0.5796297192573547
layers:19, std:0.5861504673957825
layers:20, std:0.5751968622207642
layers:21, std:0.5832950472831726
layers:22, std:0.5741185545921326
layers:23, std:0.5780283808708191
layers:24, std:0.5823552012443542
layers:25, std:0.5857341885566711
layers:26, std:0.5807106494903564
layers:27, std:0.5796788334846497
layers:28, std:0.5729328393936157
layers:29, std:0.58140885829925

layers:84, std:0.574193000793457
layers:85, std:0.5727437734603882
layers:86, std:0.5731297135353088
layers:87, std:0.5699641108512878
layers:88, std:0.5749478340148926
layers:89, std:0.5808009505271912
layers:90, std:0.5788522362709045
layers:91, std:0.5733445286750793
layers:92, std:0.5718269944190979
layers:93, std:0.5795019865036011
layers:94, std:0.5751311779022217
layers:95, std:0.5843010544776917
layers:96, std:0.5802408456802368
layers:97, std:0.5779411792755127
layers:98, std:0.5809749960899353
layers:99, std:0.58396977186203
tensor([[0.0000, 0.4899, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.1303, 0.6495, 0.7949,  ..., 0.0777, 0.8714, 0.0000],
        [0.2747, 0.0000, 0.0000,  ..., 0.2426, 1.1777, 0.0909],
        ...,
        [0.0000, 1.4464, 1.2353,  ..., 0.0000, 1.1637, 0.2478],
        [0.0000, 2.0306, 0.1914,  ..., 0.0000, 0.0000, 2.5218],
        [0.3302, 0.0000, 0.0000,  ..., 0.1338, 0.0000, 1.1510]],
       grad_fn=<ReluBackward0>)


_BatchNorm  
* nn.BatchNorm1d  
* nn.BatchNorm2d  
* nn.BatchNorm3d  

参数：  
* num_features:一个样本特征数量（最重要）  
* eps:分母修正项  
* momentum:指数加权平均估计当前mean/var  
* affine:是否需要affine transform  
* track_running_stats:是训练状态，还是测试状态  

主要属性：  
* running_mean:均值  
* running_var:方差  
* weight:affine transform中的gamma  
* bias:affine transform中的beta  
训练：均值和方差采用指数加权平均计算  
测试：当前统计值