## 批量归一化 — 从0开始
在Kaggle实战我们输入数据做了归一化。在实际应用中，我们通常将输入数据的每个样本或者每个特征进行归一化，就是将均值变为0方差变为1，来使得数值更稳定。

这个对 我们在之前的课程里学过了线性回归和逻辑回归很有效。因为输入层的输入值的大小变化不剧烈，那么输入也不会。但是，对于一个可能有很多层的深度学习模型来说，情况可能会比较复杂。

举个例子，随着第一层和第二层的参数在训练时不断变化，第三层所使用的激活函数的输入值可能由于乘法效应而变得极大或极小，例如和第一层所使用的激活函数的输入值不在一个数量级上。这种在训练时可能出现的情况会造成模型训练的不稳定性。例如，给定一个学习率，某次参数迭代后，目标函数值会剧烈变化或甚至升高。数学的解释是，如果把目标函数 $f$ 根据参数 $w$ 迭代（如 $f(\mathbf{w} - \eta \nabla f(\mathbf{w}))$）进行泰勒展开，有关学习率 $\eta$ 的高阶项的系数可能由于数量级的原因（通常由于层数多）而不容忽略。然而常用的低阶优化算法（如梯度下降）对于不断降低目标函 数的有效性通常基于一个基本假设：在以上泰勒展开中把有关学习率的高阶项通通忽略不计。

为了应对上述这种情况，Sergey Ioffe和Christian Szegedy在2015年提出了批量归一化的方法。简而言之，在训练时给定一个批量输入，批量归一化试图对深度学习模型的某一层所使用的激活函数的输入进行归一化：使批量呈标准正态分布（均值为0，标准差为1）。

批量归一化通常应用于输入层或任意中间层。

简化的批量归一化层
给定一个批量 $B = \{x_{1, ..., m}\}$, 我们需要学习拉升参数 $\gamma$ 和偏移参数 $\beta$。

我们定义：

$\mu_B \leftarrow \frac{1}{m}\sum_{i = 1}^{m}x_i$

$\mu_B \leftarrow \frac{1}{m}\sum_{i = 1}^{m}x_i$

$\hat{x_i} \leftarrow \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

$y_i \leftarrow \gamma \hat{x_i} + \beta \equiv \mbox{BN}_{\gamma,\beta}(x_i)$
批量归一化层的输出是$\{y_i = BN_{\gamma, \beta}(x_i)\}$。

我们现在来动手实现一个简化的批量归一化层。实现时对全连接层和二维卷积层两种情况做了区分。对于全连接层，很明显我们要对每个批量进行归一化。然而这里需要注意的是，对 于二维卷积，我们要对每个通道进行归一化，并需要保持四维形状使得可以正确地广播。

In [1]:
import numpy as np

def pure_batch_norm(X, gamma, beta, eps=1e-5):
    assert len(X.shape) in (2, 4)
    # 全连接: batch_size x feature
    if len(X.shape) == 2:
        # 每个输入维度在样本上的平均和方差
        mean = X.mean(axis=0)
        variance = ((X - mean)**2).mean(axis=0)
    # 2D卷积: batch_size x channel x height x width
    else:
        # 对每个channel算均值和方差，需要保持4D形状使得可以正确地广播
        mean = X.mean(axis=(0,2,3), keepdims=True)
        variance = ((X - mean)**2).mean(axis=(0,2,3), keepdims=True)

    # 均一化
    X_hat = (X - mean) / np.sqrt(variance + eps)
    # 拉升和偏移
    return gamma.reshape(mean.shape) * X_hat + beta.reshape(mean.shape)

下面我们检查一下。我们先定义全连接层的输入是这样的。每一行是批量中的一个实例。



In [2]:
A = np.arange(6).reshape((3,2))

我们希望批量中的每一列都被归一化。结果符合预期。



In [3]:
pure_batch_norm(A, gamma=np.array([1,1]), beta=np.array([0,0]))

array([[-1.22474258, -1.22474258],
       [ 0.        ,  0.        ],
       [ 1.22474258,  1.22474258]])

In [4]:
B = np.arange(18).reshape((1,2,3,3))
print B

[[[[ 0  1  2]
   [ 3  4  5]
   [ 6  7  8]]

  [[ 9 10 11]
   [12 13 14]
   [15 16 17]]]]


结果也如预期那样，我们对每个通道做了归一化。



In [5]:
pure_batch_norm(B, gamma=np.array([1,1]), beta=np.array([0,0]))

array([[[[-1.54919218, -1.16189413, -0.77459609],
         [-0.38729804,  0.        ,  0.38729804],
         [ 0.77459609,  1.16189413,  1.54919218]],

        [[-1.54919218, -1.16189413, -0.77459609],
         [-0.38729804,  0.        ,  0.38729804],
         [ 0.77459609,  1.16189413,  1.54919218]]]])

### 批量归一化层
你可能会想，既然训练时用了批量归一化，那么测试时也该用批量归一化吗？其实这个问题乍一想不是很好回答，因为：

不用的话，训练出的模型参数很可能在测试时就不准确了；
用的话，万一测试的数据就只有一个数据实例就不好办了。
事实上，在测试时我们还是需要继续使用批量归一化的，只是需要做些改动。在测试时，我们需要把原先训练时用到的批量均值和方差替换成整个训练数据的均值和方差。但 是当训练数据极大时，这个计算开销很大。因此，我们用移动平均的方法来近似计算（参见实现中的`moving_mean`和`moving_variance`）。

为了方便讨论批量归一化层的实现，我们先看下面这段代码来理解`Python`变量可以如何修改。

In [6]:
def batch_norm(X, gamma, beta, is_training, moving_mean, moving_variance,
               eps = 1e-5, moving_momentum = 0.9):
    assert len(X.shape) in (2, 4)
    # 全连接: batch_size x feature
    if len(X.shape) == 2:
        # 每个输入维度在样本上的平均和方差
        #mean = X.mean(axis=0)
        #variance = ((X - mean)**2).mean(axis=0)
        mean = tf.reduce_mean(X, axis=0)
        variance = tf.reduce_mean((X - mean)**2, axis=0)
    # 2D卷积: batch_size x channel x height x width
    else:
        # 对每个通道算均值和方差，需要保持4D形状使得可以正确的广播
        #mean = X.mean(axis=(0,2,3), keepdims=True)
        #variance = ((X - mean)**2).mean(axis=(0,2,3), keepdims=True)
        mean = tf.reduce_mean(X, axis=(0,1,2), keep_dims=True)
        variance = tf.reduce_mean((X - mean)**2, axis=(0,1,2), keep_dims=True)
        # 变形使得可以正确的广播


        moving_mean = tf.reshape(moving_mean, mean.shape)
        moving_variance = tf.reshape(moving_variance, mean.shape)

    # 均一化
    def train_update():
        X_hat = (X - mean) / tf.sqrt(variance + eps)
        #!!! 更新全局的均值和方差
        moving_mean_new = moving_momentum * moving_mean + (1.0 - moving_momentum) * mean
        moving_variance_new = moving_momentum * moving_variance + (1.0 - moving_momentum) * variance
        return X_hat, moving_mean_new, moving_variance_new
    def test_update():
        #!!! 测试阶段使用全局的均值和方差
        X_hat = (X - moving_mean) / tf.sqrt(moving_variance + eps)
        return X_hat, moving_mean, moving_variance
    
    X_hat, moving_mean, moving_variance = tf.cond(is_training, train_update, test_update)
    # 拉升和偏移
    return gamma* X_hat + beta, moving_mean, moving_variance

### 定义模型


In [7]:
import tensorflow as tf

weight_scale = .01

# 输出通道 = 20, 卷积核 = (5,5)
# height*width*input_channels*output_channels

c1 = 20
W1 = tf.Variable(tf.random_normal([5,5,1,c1], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
b1 = tf.Variable(tf.constant(0.0, shape=[c1]))

# 第1层批量归一化
gamma1 = tf.Variable(tf.random_normal([c1], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
beta1 = tf.Variable(tf.random_normal([c1], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
moving_mean1 = tf.Variable(tf.constant(0.0, shape=[c1]), trainable=False)
moving_variance1 = tf.Variable(tf.constant(0.0, shape=[c1]), trainable=False)

# 输出通道 = 50, 卷积核 = (3,3)
c2 = 50
W2 = tf.Variable(tf.random_normal([3,3,c1,c2], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
b2 = tf.Variable(tf.constant(0.0, shape=[c2]))

# 第2层批量归一化
gamma2 = tf.Variable(tf.random_normal([c2], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
beta2 = tf.Variable(tf.random_normal([c2], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
moving_mean2 = tf.Variable(tf.constant(0.0, shape=[c2]), trainable=False)
moving_variance2 = tf.Variable(tf.constant(0.0, shape=[c2]), trainable=False)

# 输出维度 = 128
o3 = 128
W3 = tf.Variable(tf.random_normal([1250, o3], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
b3 = tf.Variable(tf.constant(0.0, shape=[o3]), trainable=False)

# 输出维度 = 10
W4 = tf.Variable(tf.random_normal([o3, 10], mean=0.0, stddev=weight_scale, seed=None, dtype=tf.float32))
b4 = tf.Variable(tf.constant(0.0, shape=[10]), trainable=False)

# 注意这里moving_*是不需要更新的
params = [W1, b1, gamma1, beta1,
          W2, b2, gamma2, beta2,
          W3, b3, W4, b4]


下面定义模型。我们添加了批量归一化层。特别要注意我们添加的位置：在卷积层后，在激活函数前。

In [8]:
def net(X, is_training, verbose=False):
    global moving_mean1, moving_variance1, moving_mean2, moving_variance2
    # 第一层卷积
    #'''
    h1_conv = tf.nn.conv2d(X, W1, strides = [1,1,1,1], padding = 'VALID') 
    h1_bn, moving_mean1, moving_variance1 = batch_norm(h1_conv, gamma1, beta1, is_training,
                       moving_mean1, moving_variance1)
    h1_activation = tf.nn.relu(h1_bn)
    h1 = tf.nn.max_pool(h1_activation, [1,2,2,1], [1,2,2,1], padding = 'VALID')
    # 第二层卷积
    h2_conv = tf.nn.conv2d(h1, W2, strides = [1,1,1,1], padding = 'VALID') 
    h2_bn, moving_mean2, moving_variance2 = batch_norm(h2_conv, gamma2, beta2, is_training,
                       moving_mean2, moving_variance2)
    h2_activation = tf.nn.relu(h2_bn)
    h2 = tf.nn.max_pool(h2_activation, [1,2,2,1], [1,2,2,1], padding='VALID')
    h2 = tf.layers.flatten(h2)
    # 第一层全连接
    #'''
    h3_linear = tf.matmul(h2, W3) + b3

    #h3_linear = tf.matmul(X, W3) + b3
    h3 = tf.nn.relu(h3_linear)
    # 第二层全连接
    h4_linear = tf.matmul(h3, W4) + b4
    if verbose:
        print('1st conv block:', h1.get_shape().as_list())
        print('2nd conv block:', h2.get_shape().as_list())
        print('1st dense:', h3.get_shape().as_list())
        print('2nd dense:', h4_linear.get_shape().as_list())
        print('output:', h4_linear)
    return h4_linear, h2_activation


In [9]:
import sys

sys.path.append('../../utils')
import utils

data_dir = '../../data/fashion_mnist'
train_images, train_labels, test_images, test_labels = utils.load_data_fashion_mnist(data_dir, one_hot=True)
print train_images.shape
print train_labels.shape

from tensorflow.contrib.learn.python.learn.datasets.mnist import DataSet
train_dataset = DataSet(train_images, train_labels, one_hot=True)

import numpy as np

learning_rate = 1e-0
max_steps = 1000
batch_size = 256
height = width = 28
num_channels = 1
num_outputs = 10

input_placeholder = tf.placeholder(tf.float32, [None, height, width, num_channels])
is_training = tf.placeholder(tf.bool)
#input_placeholder = tf.placeholder(tf.float32, [None, height*width*num_channels])

gt_placeholder = tf.placeholder(tf.int64, [None, num_outputs])

logits, h2 = net(input_placeholder, is_training)
loss = tf.losses.softmax_cross_entropy(logits=logits,  onehot_labels=gt_placeholder)
acc = utils.accuracy(logits, gt_placeholder)
test_images_reshape = np.reshape(np.squeeze(test_images), (test_images.shape[0], height, width, num_channels))
    
#train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

init = tf.global_variables_initializer()
sess = tf.InteractiveSession()
sess.run(init)

for step in range(max_steps):
    data, label = train_dataset.next_batch(batch_size)
    data = np.reshape(data, (batch_size, height, width, num_channels))
    feed_dict = {input_placeholder: data, gt_placeholder: label, is_training: True}
    h2_, loss_, acc_, _ = sess.run([h2, loss, acc, train_op], feed_dict=feed_dict)
    if step % 10 == 0:
        print("Batch %d, Loss: %f, Train acc %f " % (step, loss_, acc_))

test_loss_, test_acc_ = sess.run([loss, acc], feed_dict={input_placeholder: test_images_reshape / 255.0, gt_placeholder: test_labels, is_training: False})
print ("Test Loss: %f, Test acc %f " % (test_loss_, test_acc_))


Extracting ../data/fashion_mnist/train-images-idx3-ubyte.gz
Extracting ../data/fashion_mnist/train-labels-idx1-ubyte.gz
Extracting ../data/fashion_mnist/t10k-images-idx3-ubyte.gz
Extracting ../data/fashion_mnist/t10k-labels-idx1-ubyte.gz
(60000, 28, 28, 1)
(60000, 10)
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

Batch 0, Loss: 2.302599, Train acc 0.113281 
Batch 10, Loss: 2.302525, Train acc 0.246094 
Batch 20, Loss: 2.302248, Train acc 0.308594 
Batch 30, Loss: 2.300201, Train acc 0.453125 
Batch 40, Loss: 2.253733, Train acc 0.148438 
Batch 50, Loss: 2.114796, Train acc 0.156250 
Batch 60, Loss: 1.646562, Train acc 0.386719 
Batch 70, Loss: 1.364214, Train acc 0.445312 
Batch 80, Loss: 1.178418, Train acc 0.550781 
Batch 90, Loss: 0.971520, Train acc 0.664062 
Batch 100,

In [10]:
c

NameError: name 'c' is not defined