### 输出爆炸分析
1. 若随机变量$X,Y$相互独立,则有

    $ E(XY) = E(X)E(Y) $

    $ D(X+Y) = D(X) + D(Y)$

    $D(X) = E(X^2) - \left[E(X)\right]^2$

2. 故可得

    \begin{align}
    D(XY) &= E(X^2Y^2) - [E(X)]^2[E(Y)]^2 \\
    &=  D(X)D(Y) + D(X)[E(Y)]^2 + D(Y)[E(X)]^2
    \end{align}

3. 若$X,Y$服从0均值1标准差的正态分布,则有

    $D(XY)=D(X)D(Y)$

4.  对于由$n$个神经元组成的全连接层有

    $ H^1 = \sum_{i=1}^n X_i w^T_{1i} $

    $D(H^1) = n*(1*1) = n $

5. 同理可得

    $ H^2 = \sum_{i=1}^n H^1_i w^T_{2i} $

    $D(H^2) = n*(n*1) = n^2 $

    $D(H^3) = n^{3}$

    $\vdots$

    $D(H^m) = n^{m}$

In [14]:
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for _ in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)

            print("layer:{}, std:{}".format(i, x.var())) # x的方差按指数级进行增长
            if torch.isnan(x.var()): # 某一层方差为nan时跳出循环
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight) # 整体初始化为标正态分布(标准正态分布的随机抽样也是正态分布)


layer_nums = 100
neural_nums = 256
batch_size = 16
inputs = torch.randn((batch_size, neural_nums))

net = MLP(neural_nums, layer_nums)
net.initialize()

output = net(inputs)
print(output)

layer:0, std:253.48338317871094
layer:1, std:67583.1796875
layer:2, std:17795620.0
layer:3, std:4671546368.0
layer:4, std:1204450099200.0
layer:5, std:310462382080000.0
layer:6, std:8.013987208547533e+16
layer:7, std:2.0335619288321753e+19
layer:8, std:5.019494845985134e+21
layer:9, std:1.1908801481093347e+24
layer:10, std:3.038827151994154e+26
layer:11, std:7.847845850090988e+28
layer:12, std:2.0367696902373843e+31
layer:13, std:5.15286537343349e+33
layer:14, std:1.3578871990112007e+36
layer:15, std:inf
layer:16, std:inf
layer:17, std:inf
layer:18, std:inf
layer:19, std:inf
layer:20, std:inf
layer:21, std:inf
layer:22, std:inf
layer:23, std:inf
layer:24, std:inf
layer:25, std:inf
layer:26, std:inf
layer:27, std:inf
layer:28, std:inf
layer:29, std:inf
layer:30, std:inf
layer:31, std:nan
output is nan in 31 layers
tensor([[       -inf,  1.5891e+38,        -inf,  ..., -5.3846e+37,
                 inf,         inf],
        [-3.7783e+37, -2.6330e+38, -1.6608e+38,  ...,         inf,
     

解决:将权重矩阵$W$的方差初始化为$\frac{1}{n}$,此时有

$ D(H^1) = n*(\frac{1}{n}*1) = 1 $

$ D(H^2) = n*(\frac{1}{n}*1) = 1 $

$ \vdots $

$ D(H^m) = n*(\frac{1}{n}*1) = 1 $

In [15]:
class MLP_1(MLP):
    def initialize(self): # 重写initialize方法
        for m in self.modules():
            if isinstance(m, nn.Linear):
                std = torch.sqrt(torch.tensor(1/self.neural_num, dtype=torch.float32))
                nn.init.normal_(m.weight, std=std)


layer_nums = 100
neural_nums = 256
batch_size = 16
inputs = torch.randn((batch_size, neural_nums))

net = MLP_1(neural_nums, layer_nums)
net.initialize()

output = net(inputs)
print(output)

layer:0, std:0.9735792279243469
layer:1, std:0.99771648645401
layer:2, std:0.9953316450119019
layer:3, std:1.0132945775985718
layer:4, std:0.9622108936309814
layer:5, std:0.9523491859436035
layer:6, std:0.9915835857391357
layer:7, std:0.9701160788536072
layer:8, std:0.9977512955665588
layer:9, std:0.9572206139564514
layer:10, std:0.9705260992050171
layer:11, std:0.9550902843475342
layer:12, std:0.9415923357009888
layer:13, std:0.9503766894340515
layer:14, std:0.9189853668212891
layer:15, std:0.9159332513809204
layer:16, std:0.8976818919181824
layer:17, std:0.8763459324836731
layer:18, std:0.8999841809272766
layer:19, std:0.90842205286026
layer:20, std:0.8796555995941162
layer:21, std:0.8616349101066589
layer:22, std:0.9095585942268372
layer:23, std:0.9559931755065918
layer:24, std:0.9570003151893616
layer:25, std:1.005898118019104
layer:26, std:1.002387523651123
layer:27, std:1.0296342372894287
layer:28, std:1.0083959102630615
layer:29, std:0.9582086205482483
layer:30, std:0.9601848125