## 批量归一化
batch normalization层:能够让较深的神经网络的训练变得更加容易.  
标准化处理输入数据使各个特征的分布更近:这往往``更容易训练出有效的模型``.

* 深度网络难以训练的原因?  
    ``数值不稳定``.  
    * 通常来说,``数据标准化预处理``对于浅层模型就足够有效了.随着模型训练的进行,当每层参数更新时,靠近输出层的输出较难出现剧烈变化.  
    - 对于深层神经网络来说,即使输入数据已做标准化,训练中模型参数的更新依然容易造成靠近输出层``输出的剧烈变化``.这种``计算数值的不稳定性``通常令我们难以训练出有效的深度模型.


批量归一化的提出正是为了应对深度模型训练的挑战.  
模型训练时,批量归一化``利用小批量上的均值和标准差``,不断调整神经网络中间输出,使整个神经网络在各层的中间输出的数值更加稳定.

### 批量归一化层
- 对全连接层做批量归一化
- 对卷积层做批量归一化

### 对全连接层做归一化
一般将批量归一化层放在全连接层的放射变换和激活函数之间.  

放射变换:$x=Wu+b$  
批量归一化的全连接层输出:$\phi(BN(x))$  
对于小批量:$\mathbb{B}=\{x^{(1)},...,x^{(m)}\}$

标准化的基础上,批量归一化引入了两个可以学习的模型参数,拉伸(scale)参数$\gamma$和偏移(shift)参数$\beta$.两个参数和$x^{(i)}$形状相同.  
$y^{(i)} \leftarrow \gamma \cdot  \hat{x}^{(i)}+ \beta$


### 对卷积层做批量归一化
如果卷积计算输出多个通道,需要对这些通道的输出分别做批量归一化,且``每个通道都拥有独立的拉伸和偏移参数,并均为标量``.

设小批量中有m个样本。在单个通道上，假设卷积计算输出的高和宽分别为p和q。我们需要对该通道中m×p×q个元素``同时做批量归一化``。对这些元素做标准化计算时，我们使用相同的均值和方差，即``该通道中m×p×q个元素的均值和方差``。

### 批量归一化层在``训练模式``和``预测模式``下的计算结果也是不一样的。



使用批量归一化训练时,将批量大小设得大一点,从而使批量内的均值和方差的计算较为准确.  
将训练好的模型用于预测时,我们希望模型对于任意输入都有确定的输出.  
因此,单个样本的输出不应取决于批量归一化所需要的随机小批量中的均值和方差.  
一种常用的办法是通过``移动平均``估算``整个训练数据集的样本均值和方差``,并在预测时使用它们得到确定的输出.

### 实现批量归一化层

In [1]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
def batch_norm(is_traning, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    if not is_traning:
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况,计算特征维上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差.
            # 这里需要保持X的形状以便后面可以做广播运算.
            # [b, c, h, w] 取b,h,w的均值.取每个样本的对应通道,一共取b个hxw的大小.
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        # 训练模式下使用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta # 拉伸和偏移
    return Y, moving_mean, moving_var
        

#### 自定义一个BatchNorm层.
保存参与求梯度和迭代的拉伸参数gamma和偏移参数beta,同时也维护移动平均得到的均值和方差,以便能够在模型预测时被使用.  
BatchNorm实例所需指定的``num_features``参数对于全连接层来说应为输出个数,对于卷积层来说则为输出通道数.  
该实例所需指定的``num_dims``参数对于全连接层和卷积层来说分别为2和4.

In [3]:
class BatchNorm(nn.Module):
    def __init__(self, num_features, num_dims):
        super(BatchNorm, self).__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # 参与求梯度和迭代的拉伸和偏移参数,分别初始化成0和1
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # 不参与求梯度和迭代的变量,全在内存上初始化为0
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.zeros(shape)
    
    def forward(self, X):
        # 如果X不在内存上,将moving_mean和moving_var复制到X所在显存上
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var, Module实例的training属性默认为true,调用.eval()后设为false
        Y, self.moving_mean, self.moving_var = batch_norm(self.training,
            X, self.gamma, self.beta, self.moving_mean, self.moving_var,
            eps=1e-5, momentum=0.9)
        return Y

### 使用批量归一化层的LeNet

In [4]:
net = nn.Sequential(
    nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
    BatchNorm(6, num_dims=4),
    nn.Sigmoid(),
    nn.MaxPool2d(2, 2), # kernel_size, stride
    nn.Conv2d(6, 16, 5),
    BatchNorm(16, num_dims=4),
    nn.Sigmoid(),
    nn.MaxPool2d(2, 2),
    nn.Flatten(),
    nn.Linear(16*4*4, 120),
    BatchNorm(120, num_dims=2),
    nn.Sigmoid(),
    nn.Linear(120, 84),
    BatchNorm(84, num_dims=2),
    nn.Sigmoid(),
    nn.Linear(84, 10)
)

In [5]:
def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
    net = net.to(device)
    print('train on', device)
    loss = torch.nn.CrossEntropyLoss()
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            # print("y.shape", y.shape) # [128]
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            train_l_sum += l.cpu().item() # loss复制到cpu上
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1

        with torch.no_grad():
            test_acc_sum, n_test = 0.0, 0 # 创建在内存(CPU)
            for X_test, y_test in test_iter:
                net.eval() # 评估模式
                test_acc_sum += (net(X_test.to(device)).argmax(dim=1) == y_test.to(device)).sum().item()  # 对Tensor进行.item()取值后,得到的就是一个Python Scalar.
                net.train() # 训练模式
                n_test += y_test.shape[0]
            test_acc = test_acc_sum / n_test

        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
        % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))

In [6]:
# # 使用Pytorch中nn模块定义的BatchNorm1d和BatchNorm2d类简洁实现
# net = nn.Sequential(
#     nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
#     nn.BatchNorm2d(6),
#     nn.Sigmoid(),
#     nn.MaxPool2d(2, 2), # kernel_size, stride
#     nn.Conv2d(6, 16, 5),
#     nn.BatchNorm2d(16),
#     nn.Sigmoid(),
#     nn.MaxPool2d(2, 2),
#     nn.Flatten(),
#     nn.Linear(16*4*4, 120),
#     nn.BatchNorm1d(120),
#     nn.Sigmoid(),
#     nn.Linear(120, 84),
#     nn.BatchNorm1d(84),
#     nn.Sigmoid(),
#     nn.Linear(84, 10)
# )

In [7]:
mnist_train = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST', train=False, download=True, transform=transforms.ToTensor())

batch_size = 256

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False)


In [8]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

train on cuda
epoch 1, loss 1.0047, train acc 0.784, test acc 0.812, time 4.6 sec
epoch 2, loss 0.4622, train acc 0.864, test acc 0.829, time 4.4 sec
epoch 3, loss 0.3670, train acc 0.878, test acc 0.868, time 4.4 sec
epoch 4, loss 0.3268, train acc 0.888, test acc 0.871, time 4.4 sec
epoch 5, loss 0.3027, train acc 0.895, test acc 0.829, time 4.4 sec
