#批标准
在我们正式进入模型的构建和训练之前，我们会先讲一讲数据预处理和批标准化，因为模型训练并不容易，特别是一些非常复杂的模型，并不能非常好的训练得到收敛的结果，所以对数据增加一些预处理，同时使用批标准化能够得到非常好的收敛结果，这也是卷积网络能够训练到非常深的层的一个重要原因。

## Data preprocessing
At present, the most common method of data preprocessing is centralized and standardized. The centralization is equivalent to correcting the center position of the data. The implementation method is very simple, that is, the corresponding mean is subtracted from each feature dimension, and finally the feature of 0 mean is obtained. Standardization is also very simple. After the data becomes zero mean, in order to make the different feature dimensions have the same scale, the standard deviation can be divided into a standard normal distribution, or it can be converted into - according to the maximum and minimum values. Between 1 and 1, below is a simple icon

![](https://ws1.sinaimg.cn/large/006tKfTcly1fmqouzer3xj30ij06n0t8.jpg)

These two methods are very common. If you remember, we used this method to standardize the data in the neural network part. As for other methods, such as PCA or white noise, it has been used very little.


## Batch Normalization
In the previous data preprocessing, we try to input a normal distribution whose characteristics are irrelevant and satisfy a standard, so that the performance of the model is generally better. But for deep network structures, the nonlinear layer of the network makes the output results relevant and no longer satisfies a standard N(0, 1) distribution, even the center of the output has shifted. This is very difficult for the training of the model, especially the deep model training.

所以在 2015 年一篇论文提出了这个方法，批标准化，简而言之，就是对于每一层网络的输出，对其做一个归一化，使其服从标准的正态分布，这样后一层网络的输入也是一个标准的正态分布，所以能够比较好的进行训练，加快收敛速度。

The implementation of batch normalization is very simple, for a given batch of data, the formula for the $B = \{x_1, x_2, \cdots, x_m\}$ algorithm is as follows

$$
\mu_B = \frac{1}{m} \sum_{i=1}^m x_i
$$
$$
\sigma^2_B = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2
$$
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}
$$
$$
y_i = \gamma \hat{x}_i + \beta
$$

The first and second lines calculate the mean and variance of the data in a batch, and then use the third formula to normalize each data point in the batch. $\epsilon$ is a small constant introduced to calculate stability. Usually take $10^{-5}$, and finally use the weight correction to get the final output. It is very simple. Below we can implement a simple one-dimensional situation, that is, the situation in the neural network.


In [1]:
import sys
sys.path.append('..')

import torch

In [2]:
def simple_batch_norm_1d(x, gamma, beta):
    eps = 1e-5
X_mean = torch.mean(x, dim=0, keepdim=True) # Reserved dimension for broadcast
    x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
    x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
    return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)

Let's verify if the output is normalized for any input.


In [3]:
x = torch.arange(15).view(5, 3)
gamma = torch.ones(x.shape[1])
beta = torch.zeros(x.shape[1])
print('before bn: ')
print(x)
y = simple_batch_norm_1d(x, gamma, beta)
print('after bn: ')
print(y)

before bn: 

  0   1   2
  3   4   5
  6   7   8
  9  10  11
 12  13  14
[torch.FloatTensor of size 5x3]

after bn: 

-1.4142 -1.4142 -1.4142
-0.7071 -0.7071 -0.7071
 0.0000  0.0000  0.0000
 0.7071  0.7071  0.7071
 1.4142  1.4142  1.4142
[torch.FloatTensor of size 5x3]



It can be seen that there are a total of 5 data points, three features, each column representing a different data point of a feature. After batch normalization, each column becomes a standard normal distribution.

There will be a problem at this time, is it to use batch standardization when testing?

The answer is yes, because it is used during training, and the use of the test will definitely lead to deviations in the results, but if there is only one data set in the test, then the mean is not this value, the variance is 0? This is obviously random, so you can't use the test data set to calculate the mean and variance when testing, but instead use the moving average and variance calculated during training.

Below we implement the following batch standardization method that can distinguish between training state and test state.


In [4]:
def batch_norm_1d(x, gamma, beta, is_training, moving_mean, moving_var, moving_momentum=0.1):
    eps = 1e-5
X_mean = torch.mean(x, dim=0, keepdim=True) # Reserved dimension for broadcast
    x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
    if is_training:
        x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
        moving_mean[:] = moving_momentum * moving_mean + (1. - moving_momentum) * x_mean
        moving_var[:] = moving_momentum * moving_var + (1. - moving_momentum) * x_var
    else:
        x_hat = (x - moving_mean) / torch.sqrt(moving_var + eps)
    return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)

Below we use the example of the deep neural network classification mnist dataset from the previous lesson to test whether batch standardization is useful.


In [5]:
import numpy as np
From torchvision.datasets import mnist # import pytorch built-in mnist data
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable

In [6]:
# Download the mnist dataset using built-in functions
train_set = mnist.MNIST('./data', train=True)
test_set = mnist.MNIST('./data', train=False)

def data_tf(x):
    x = np.array(x, dtype='float32') / 255
x = (x - 0.5)
x = x.reshape((-1,)) #拉平
    x = torch.from_numpy(x)
    return x

train_set = mnist.MNIST('.
test_set = mnist.MNIST('./data', train=False, transform=data_tf, download=True)
train_data = DataLoader(train_set, batch_size=64, shuffle=True)
test_data = DataLoader(test_set, batch_size=128, shuffle=False)

In [7]:
class multi_network(nn.Module):
    def __init__(self):
        super(multi_network, self).__init__()
        self.layer1 = nn.Linear(784, 100)
        self.relu = nn.ReLU(True)
        self.layer2 = nn.Linear(100, 10)
        
        self.gamma = nn.Parameter(torch.randn(100))
        self.beta = nn.Parameter(torch.randn(100))
        
        self.moving_mean = Variable(torch.zeros(100))
        self.moving_var = Variable(torch.zeros(100))
        
    def forward(self, x, is_train=True):
        x = self.layer1(x)
        x = batch_norm_1d(x, self.gamma, self.beta, is_train, self.moving_mean, self.moving_var)
        x = self.relu(x)
        x = self.layer2(x)
        return x

In [8]:
net = multi_network()

In [9]:
# define loss function
criterion = nn.CrossEntropyLoss()
Optimizer = torch.optim.SGD(net.parameters(), 1e-1) # Use random gradient descent, learning rate 0.1


For convenience, the training function has been defined in the outside utils.py, the same as the previous training network operation, interested students can go and see


In [10]:
from utils import train
train(net, train_data, test_data, 10, optimizer, criterion)

Epoch 0. Train Loss: 0.308139, Train Acc: 0.912797, Valid Loss: 0.181375, Valid Acc: 0.948279, Time 00:00:07
Epoch 1. Train Loss: 0.174049, Train Acc: 0.949910, Valid Loss: 0.143940, Valid Acc: 0.958267, Time 00:00:09
Epoch 2. Train Loss: 0.134983, Train Acc: 0.961587, Valid Loss: 0.122489, Valid Acc: 0.963904, Time 00:00:08
Epoch 3. Train Loss: 0.111758, Train Acc: 0.968317, Valid Loss: 0.106595, Valid Acc: 0.966278, Time 00:00:09
Epoch 4. Train Loss: 0.096425, Train Acc: 0.971915, Valid Loss: 0.108423, Valid Acc: 0.967563, Time 00:00:10
Epoch 5. Train Loss: 0.084424, Train Acc: 0.974464, Valid Loss: 0.107135, Valid Acc: 0.969838, Time 00:00:09
Epoch 6. Train Loss: 0.076206, Train Acc: 0.977645, Valid Loss: 0.092725, Valid Acc: 0.971420, Time 00:00:09
Epoch 7. Train Loss: 0.069438, Train Acc: 0.979661, Valid Loss: 0.091497, Valid Acc: 0.971519, Time 00:00:09
Epoch 8. Train Loss: 0.062908, Train Acc: 0.980810, Valid Loss: 0.088797, Valid Acc: 0.972903, Time 00:00:08
Epoch 9. Train Loss

Here, both $\gamma$ and $\beta$ are trained as parameters, initialized to a random Gaussian distribution, and both `moving_mean` and `moving_var` are initialized to 0, not updated parameters. After 10 training sessions, we can See how the moving average and moving variance are modified


In [11]:
# 打出 the top 10 items of moving_mean
print(net.moving_mean[:10])

Variable containing:
 0.5505
 2.0835
 0.0794
-0.1991
-0.9822
-0.5820
 0.6991
-0.1292
 2.9608
 1.0826
[torch.FloatTensor of size 10]



It can be seen that these values have been modified during the training process. During the test, we do not need to calculate the mean and variance, and we can directly use the moving average and the moving variance.


For comparison, let's look at the results of not using batch normalization.


In [12]:
no_bn_net = nn.Sequential(
    nn.Linear(784, 100),
    nn.ReLU(True),
    nn.Linear(100, 10)
)

Optimizer = torch.optim.SGD(no_bn_net.parameters(), 1e-1) # Use random gradient descent, learning rate 0.1
train(no_bn_net, train_data, test_data, 10, optimizer, criterion)

Epoch 0. Train Loss: 0.402263, Train Acc: 0.873817, Valid Loss: 0.220468, Valid Acc: 0.932852, Time 00:00:07
Epoch 1. Train Loss: 0.181916, Train Acc: 0.945379, Valid Loss: 0.162440, Valid Acc: 0.953817, Time 00:00:08
Epoch 2. Train Loss: 0.136073, Train Acc: 0.958522, Valid Loss: 0.264888, Valid Acc: 0.918216, Time 00:00:08
Epoch 3. Train Loss: 0.111658, Train Acc: 0.966551, Valid Loss: 0.149704, Valid Acc: 0.950752, Time 00:00:08
Epoch 4. Train Loss: 0.096433, Train Acc: 0.970732, Valid Loss: 0.116364, Valid Acc: 0.963311, Time 00:00:07
Epoch 5. Train Loss: 0.083800, Train Acc: 0.973914, Valid Loss: 0.105775, Valid Acc: 0.968058, Time 00:00:08
Epoch 6. Train Loss: 0.074534, Train Acc: 0.977129, Valid Loss: 0.094511, Valid Acc: 0.970728, Time 00:00:08
Epoch 7. Train Loss: 0.067365, Train Acc: 0.979311, Valid Loss: 0.130495, Valid Acc: 0.960146, Time 00:00:09
Epoch 8. Train Loss: 0.061585, Train Acc: 0.980894, Valid Loss: 0.089632, Valid Acc: 0.974090, Time 00:00:08
Epoch 9. Train Loss

It can be seen that although the final result is the same in both cases, if we look at the previous situation, we can see that the use of batch standardization can converge more quickly, because this is just a small network, so it can be used without batch standardization. Convergence, but for deeper networks, using batch normalization can converge quickly during training


As you can see from the above, we have implemented batch normalization of the 2-dimensional case. The standardization of the 4-dimensional case corresponding to the convolution is similar. We only need to calculate the mean and variance along the dimensions of the channel, but we implement the batch ourselves. Standardization is very tiring, and pytorch of course also has built-in batch-normalized functions for us. One-dimensional and two-dimensional are `torch.nn.BatchNorm1d()` and `torch.nn.BatchNorm2d()`, which are different from our implementation. Pytorch not only uses $\gamma$ and $\beta$ as training parameters, but also `moving_mean` and `moving_var` as parameters.


Let's try the batch standardization under the convolution network to see the effect.


In [None]:
def data_tf(x):
    x = np.array(x, dtype='float32') / 255
x = (x - 0.5)
    x = torch.from_numpy(x)
    x = x.unsqueeze(0)
    return x

train_set = mnist.MNIST('.
test_set = mnist.MNIST('./data', train=False, transform=data_tf, download=True)
train_data = DataLoader(train_set, batch_size=64, shuffle=True)
test_data = DataLoader(test_set, batch_size=128, shuffle=False)

In [78]:
#用批标准
class conv_bn_net(nn.Module):
    def __init__(self):
        super(conv_bn_net, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Conv2d(1, 6, 3, padding=1),
            nn.BatchNorm2d(6),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.BatchNorm2d(16),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2)
        )
        
        self.classfy = nn.Linear(400, 10)
    def forward(self, x):
        x = self.stage1(x)
        x = x.view(x.shape[0], -1)
        x = self.classfy(x)
        return x

net = conv_bn_net()
Optimizer = torch.optim.SGD(net.parameters(), 1e-1) # Use random gradient descent, learning rate 0.1


In [79]:
train(net, train_data, test_data, 5, optimizer, criterion)

Epoch 0. Train Loss: 0.160329, Train Acc: 0.952842, Valid Loss: 0.063328, Valid Acc: 0.978441, Time 00:00:33
Epoch 1. Train Loss: 0.067862, Train Acc: 0.979361, Valid Loss: 0.068229, Valid Acc: 0.979430, Time 00:00:37
Epoch 2. Train Loss: 0.051867, Train Acc: 0.984625, Valid Loss: 0.044616, Valid Acc: 0.985265, Time 00:00:37
Epoch 3. Train Loss: 0.044797, Train Acc: 0.986141, Valid Loss: 0.042711, Valid Acc: 0.986056, Time 00:00:38
Epoch 4. Train Loss: 0.039876, Train Acc: 0.987690, Valid Loss: 0.042499, Valid Acc: 0.985067, Time 00:00:41


In [76]:
#Do not use batch standardization
class conv_no_bn_net(nn.Module):
    def __init__(self):
        super(conv_no_bn_net, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Conv2d(1, 6, 3, padding=1),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2)
        )
        
        self.classfy = nn.Linear(400, 10)
    def forward(self, x):
        x = self.stage1(x)
        x = x.view(x.shape[0], -1)
        x = self.classfy(x)
        return x

net = conv_no_bn_net()
Optimizer = torch.optim.SGD(net.parameters(), 1e-1) # Use random gradient descent, learning rate 0.1


In [77]:
train(net, train_data, test_data, 5, optimizer, criterion)

Epoch 0. Train Loss: 0.211075, Train Acc: 0.935934, Valid Loss: 0.062950, Valid Acc: 0.980123, Time 00:00:27
Epoch 1. Train Loss: 0.066763, Train Acc: 0.978778, Valid Loss: 0.050143, Valid Acc: 0.984375, Time 00:00:29
Epoch 2. Train Loss: 0.050870, Train Acc: 0.984292, Valid Loss: 0.039761, Valid Acc: 0.988034, Time 00:00:29
Epoch 3. Train Loss: 0.041476, Train Acc: 0.986924, Valid Loss: 0.041925, Valid Acc: 0.986155, Time 00:00:29
Epoch 4. Train Loss: 0.036118, Train Acc: 0.988523, Valid Loss: 0.042703, Valid Acc: 0.986452, Time 00:00:29


When we introduce some famous network structures, we will gradually realize the importance of batch standardization. It is very convenient to add batch standardization layer by using pytorch.
