# 批量归一化         

`批量归一化（batch normalization）层`——能让较深的神经网络的训练变得更加容易


通常来说，`数据标准化预处理对于浅层模型就足够有效了`。随着模型训练的进行，**当每层中参数更新时，靠近输出层的输出较难出现剧烈变化**。

对**深层神经网络**来说，即使输入数据已做标准化，**训练中模型参数的更新依然很容易造成靠近输出层输出的剧烈变化**。这种计算数值的不稳定性通常令我们难以训练出有效的深度模型。

## 批量归一化层
### 对全连接层做批量归一化

### 对卷积层做批量归一化
对卷积层来说，批量归一化发生在`卷积计算之后、应用激活函数之前`。如果卷积计算输出多个通道，需要`对这些通道的输出分别做批量归一化`，且`每个通道都拥有独立的拉伸和偏移参数，并均为标量`。


### 预测时的批量归一化
`使用批量归一化训练时，我们可以将批量大小设得大一点，从而使批量内样本的均值和方差的计算都较为准确`。将训练好的模型用于预测时，我们希望模型对于任意输入都有确定的输出。因此，`单个样本的输出不应取决于批量归一化所需要的随机小批量中的均值和方差`。一种常用的方法是`通过移动平均估算整个训练数据集的样本均值和方差`，并在预测时使用它们得到确定的输出。可见，和丢弃层一样，`批量归一化层在训练模式和预测模式下的计算结果也是不一样的`。

In [1]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F

import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

D:\Anaconda\envs\torch\lib\site-packages\numpy\.libs\libopenblas.JPIJNSWNNAN3CE6LLI5FWSPHUT2VXMTH.gfortran-win_amd64.dll
D:\Anaconda\envs\torch\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
  stacklevel=1)


In [2]:


def batch_norm(is_training,X,gamma,beta,moving_mean,moving_var,eps,momentum):
    if not is_training:
        X_hat = (X - moving_mean)/torch.sqrt(moving_var + eps)
        
    else:
        assert len(X.shape) in (2,4)
        if len(X.shape) == 2:
            mean = X.mean(dim = 0)
            var = ((X-mean)**2).mean(dim = 0)
            
        else:
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        # 训练模式下用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
            
        moving_mean = momentum * moving_mean + (1.0 - momentum)*mean
        moving_var = momentum * moving_var + (1.0 - momentum)*var
    Y = gamma * X_hat + beta
    
    return Y,moving_mean,moving_var


        
        

接下来，自定义一个`BatchNorm层`。它保存`参与求梯度和迭代的拉伸参数gamma和偏移参数beta`，同时也维护移动平均得到的均值和方差，以便能够在模型预测时被使用。BatchNorm实例所需指定的`num_features参数`对于全连接层来说应为`输出个数`，对于卷积层来说则为`输出通道数`。该实例所需指定的`num_dims`参数对于全连接层和卷积层来说分别为`2和4`。

In [3]:
class BatchNorm(nn.Module):
    def __init__(self,num_features,num_dims):
        super(BatchNorm,self).__init__()
        if num_dims == 2:
            shape = (1,num_features)
        else:
            shape = (1,num_features,1,1)
            
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.zeros(shape)
        
    def forward(self,X):
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var, Module实例的traning属性默认为true, 调用.eval()后设成false
        Y, self.moving_mean, self.moving_var = batch_norm(self.training, 
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y
        

In [4]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            BatchNorm(6, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            BatchNorm(16, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            BatchNorm(120, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            BatchNorm(84, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

training on  cuda


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:11<00:00, 20.13it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 1, loss 0.9939, train acc 0.785, test acc 0.831, time 13.9 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 31.72it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 2, loss 0.4610, train acc 0.863, test acc 0.850, time 9.8 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 30.96it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 3, loss 0.3673, train acc 0.879, test acc 0.861, time 9.9 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 31.87it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 4, loss 0.3315, train acc 0.887, test acc 0.833, time 9.7 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 31.67it/s]


epoch 5, loss 0.3077, train acc 0.894, test acc 0.882, time 9.7 sec


In [5]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            nn.BatchNorm2d(6),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            nn.BatchNorm2d(16),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            nn.BatchNorm1d(120),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            nn.BatchNorm1d(84),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

training on  cuda


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:09<00:00, 25.21it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 1, loss 1.0067, train acc 0.780, test acc 0.815, time 12.2 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:09<00:00, 25.43it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 2, loss 0.4714, train acc 0.860, test acc 0.848, time 12.1 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:09<00:00, 24.85it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 3, loss 0.3752, train acc 0.875, test acc 0.834, time 12.5 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:09<00:00, 25.17it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 4, loss 0.3378, train acc 0.884, test acc 0.836, time 12.3 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:09<00:00, 25.01it/s]


epoch 5, loss 0.3167, train acc 0.889, test acc 0.845, time 12.5 sec
