# Batch Normalization - 似乎叫standardization更好

## 介绍

### 概念

Normalization(mean 归一化): $$\frac{x - \bar{x}}{x_{max} - x_{min}} \rightarrow [0,1]$$

Rescaling(min-max 归一化): $$\frac{x - x_{min}}{x_{max} - x_{min}} \rightarrow [0,1]$$

Standardization(标准化): $$\frac{x - \bar{x}}{\sigma} \rightarrow mean=0, std=1$$

标准化是指将数处理成均值为0，标准差为1的分布。标准化处理使得数据的各个特征分布相近，这往往能更容易的训练出有效的模型。

对深层NN来说，即使输入的数据已经标准化了过，但是在训练中模型参数的更新依然很容易造成深隐层的输出的剧烈变化。这种计算数值的不稳定性通常让我们难以训练处有效的深度模型。

Batch Normlization的提出正式为了应对深度模型训练的挑战，在模型训练的时候，BN利用mini-batch上的均值和标准差，不断调整神经网络中间输出，从而使得整个神经网络在各层中间输出的数值更稳定。

### 对全连接层做BN

通常，我们将批量归一化层置于全连接层中的仿射变换和激活函数之间。设全连接层的输入为$\boldsymbol{u}$，权重参数和偏差参数分别为$\boldsymbol{W}$和$\boldsymbol{b}$，激活函数为ϕ。设批量归一化的运算符为$\boldsymbol{BN}$。那么，使用批量归一化的全连接层的输出为：
$$\phi(\text{BN}(\boldsymbol{x})),$$
其中批量归一化输入由$\boldsymbol{x}$仿射变换
$$\boldsymbol{x} = \boldsymbol{W\boldsymbol{u} + \boldsymbol{b}}$$
得到。考虑一个由m个样本组成的小批量，仿射变换的输出为一个新的小批量$\mathcal{B} = \{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)} \}$。它们正是批量归一化层的输入。对于小批量$\mathcal{B}$中任意样本$\boldsymbol{x}^{(i)} \in \mathbb{R}^d, 1 \leq i \leq m$，批量归一化层的输出同样是$\boldsymbol{d}$维向量
$$\boldsymbol{y}^{(i)} = \text{BN}(\boldsymbol{x}^{(i)}),$$
并由以下几步求得。首先，对小批量$\mathcal{B}$求均值和方差：
$$\boldsymbol{\mu}_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i = 1}^{m} \boldsymbol{x}^{(i)},$$
$$\boldsymbol{\sigma}_\mathcal{B}^2 \leftarrow \frac{1}{m} \sum_{i=1}^{m}(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B})^2,$$
其中的平方计算是按元素求平方。接下来，使用按元素开方和按元素除法对$\boldsymbol{x}^{(i)}$标准化：
$$\hat{\boldsymbol{x}}^{(i)} \leftarrow \frac{\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B}}{\sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}},$$
这里$\epsilon > 0$是一个很小的常数，保证分母大于0。

在上面标准化的基础上，**批量归一化层引入了两个可以学习的模型参数**，拉伸（scale）参数$\boldsymbol{\gamma}$和偏移（shift）参数$\boldsymbol{\beta}$。这两个参数和$\boldsymbol{x}^{(i)}$形状相同，皆为d维向量。它们与$\hat{\boldsymbol{x}}^{(i)}$分别做按元素乘法（符号$\odot$）和加法计算：
$${\boldsymbol{y}}^{(i)} \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)} + \boldsymbol{\beta}.$$
至此，我们得到了$\boldsymbol{x}^{(i)}$的批量归一化的输出$\boldsymbol{y}^{(i)}$。 值得注意的是，可学习的拉伸和偏移参数保留了不对$\boldsymbol{x}^{(i)}$做批量归一化的可能：此时只需学出$\boldsymbol{\gamma} = \sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}$和$\boldsymbol{\beta} = \boldsymbol{\mu}_\mathcal{B}$。我们可以对此这样理解：**如果批量归一化无益，理论上，学出的模型可以不使用批量归一化。**

### 对卷积层做BN

对卷积层来说，批量归一化发生在卷积计算之后、应用激活函数之前。如果卷积计算输出多个通道，我们需要对**这些通道的输出分别做批量归一化**，且每个通道都拥有独立的拉伸和偏移参数，并均为标量。设小批量中有 m个样本。在单个通道上，假设卷积计算输出的高和宽分别为 p和 q 。我们需要对该通道中 m×p×q个元素同时做批量归一化。对这些元素做标准化计算时，我们使用相同的均值和方差，即该通道中 m×p×q个元素的均值和方差。

### 预测时的BN

使用批量归一化训练时，我们可以将批量大小设得大一点，从而使批量内样本的均值和方差的计算都较为准确。将训练好的模型用于预测时，我们希望模型对于任意输入都有确定的输出。因此，**单个样本的输出不应取决于批量归一化所需要的随机小批量中的均值和方差**。一种常用的方法是通过移动平均估算整个训练数据集的样本均值和方差，并在预测时使用它们得到确定的输出。可见，和丢弃层一样，批量归一化层在训练模式和预测模式下的计算结果也是不一样的。

即每一个mini-batch进入，都会更新一次该BN层的moving_mean和moving_std。当所有mini-batch都进入之后，全部完成更新后，就有了预测模式里，该BN层的均值和方差。

## 从零实现

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    if not is_training:
        X_hat = (X - moving_mean) / np.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用FC层
            mean = X.mean(axis=0)
            var = ((X - mean)**2).mean(axis=0)
            
        else:
            # 使用Conv2D
            mean = X.mean(axos=(0, 2, 3), keepdims=True)
            var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)
            
        # 标准化
        X_hat = (X - mean) / np.sqrt(var + eps)
        
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
        
    # 拉伸和平移
    Y = gamma * X_hat + beta
    return Y, moving_mean, moving_var

接下来，我们自定义一个BatchNorm层。它保存参与求梯度和迭代的拉伸参数gamma和偏移参数beta，同时也维护移动平均得到的均值和方差，以便能够在模型预测时被使用。BatchNorm实例所需指定的num_features参数对于全连接层来说应为输出个数，对于卷积层来说则为输出通道数。该实例所需指定的num_dims参数对于全连接层和卷积层来说分别为2和4。

In [3]:
class BatchNormalization(tf.keras.layers.Layer):
    def __init__(self, decay=0.9, epsilon=1e-5, **kwargs):
        self.decay = decay
        self.epsilon = epsilon
        super(BatchNormalization, self).__init__(**kwargs)

    def build(self, input_shape):
        self.gamma = self.add_weight(name='gamma',
                                     shape=[input_shape[-1], ],
                                     initializer=tf.initializers.ones,
                                     trainable=True)
        self.beta = self.add_weight(name='beta',
                                    shape=[input_shape[-1], ],
                                    initializer=tf.initializers.zeros,
                                    trainable=True)
        self.moving_mean = self.add_weight(name='moving_mean',
                                           shape=[input_shape[-1], ],
                                           initializer=tf.initializers.zeros,
                                           trainable=False)
        self.moving_variance = self.add_weight(name='moving_variance',
                                               shape=[input_shape[-1], ],
                                               initializer=tf.initializers.ones,
                                               trainable=False)
        # super(BatchNormalization, self).build(input_shape)

    def assign_moving_average(self, variable, value):
        """
        variable = variable * decay + value * (1 - decay)
        """
        delta = variable * self.decay + value * (1 - self.decay)
        return variable.assign(delta)

    # @tf.function
    def call(self, inputs, training):
        if training:
            # tf.nn.moments -> 求出inputs延axes位置的均值和方差
            batch_mean, batch_variance = tf.nn.moments(x=inputs, axes=list(range(len(inputs.shape) - 1)))
            
            # 求出该轮batch对整体mean和variance的影响
            mean_update = self.assign_moving_average(self.moving_mean, batch_mean)
            variance_update = self.assign_moving_average(self.moving_variance, batch_variance)
            self.add_update(mean_update)
            self.add_update(variance_update)
            mean, variance = batch_mean, batch_variance
        else:
            mean, variance = self.moving_mean, self.moving_variance
        output = tf.nn.batch_normalization(inputs,
                                           mean=mean,
                                           variance=variance,
                                           offset=self.beta,
                                           scale=self.gamma,
                                           variance_epsilon=self.epsilon)
        return output

在LeNet里使用BN

In [4]:
net = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=6,kernel_size=5),
    BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
    tf.keras.layers.Conv2D(filters=16,kernel_size=5),
    BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(120),
    BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(84),
    BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(10,activation='sigmoid')
])

In [5]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape((60000, 28, 28, 1)).astype('float32') / 255
x_test = x_test.reshape((10000, 28, 28, 1)).astype('float32') / 255

net.compile(loss='sparse_categorical_crossentropy',
            optimizer=tf.keras.optimizers.RMSprop(),
            metrics=['accuracy'])
history = net.fit(x_train, y_train,
                    batch_size=64,
                    epochs=5,
                    validation_split=0.2)
test_scores = net.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])

Train on 48000 samples, validate on 12000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
10000/10000 - 1s - loss: 0.1107 - accuracy: 0.9640
Test loss: 0.1107352244278416
Test accuracy: 0.964


## 简洁实现

In [6]:
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, BatchNormalization, Activation, MaxPool2D, Flatten, Dense

In [7]:
# 构建网络
# 因为BN应用在层与激活函数之间，所以在定义Conv2D时不指定activation，在应用了BN之后再添加Activation

net = tf.keras.Sequential([
    Conv2D(filters=6, kernel_size=5),
    BatchNormalization(),
    Activation('sigmoid'),
    MaxPool2D(pool_size=2, strides=2),
    Conv2D(filters=16, kernel_size=5),
    BatchNormalization(),
    Activation('sigmoid'),
    MaxPool2D(pool_size=2, strides=2),
    Flatten(),
    Dense(120),
    BatchNormalization(),
    Activation('sigmoid'),
    Dense(84),
    BatchNormalization(),
    Activation('sigmoid'),
    Dense(10, activation='sigmoid')
])

In [8]:
# 训练模型

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape((60000, 28, 28, 1)).astype('float32') / 255
x_test = x_test.reshape((10000, 28, 28, 1)).astype('float32') / 255

net.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.RMSprop(),
              metrics=['accuracy'])
history = net.fit(x_train, y_train,
                    batch_size=64,
                    epochs=5,
                    validation_split=0.2)
test_scores = net.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])

Train on 48000 samples, validate on 12000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
10000/10000 - 1s - loss: 0.0623 - accuracy: 0.9809
Test loss: 0.062268997953645884
Test accuracy: 0.9809


# ResNet - 残差网络

## 介绍

理论上来说，NN越深说明可供调整的参数越多，从而网络对特征的抽象程度要更高，但是实际上不是这样的。梯度消失/爆炸、过拟合、退化问题(如CNN每一层Conv都会对信息进行有损压缩，多次卷积会导致信息丢失，从而使得训练误差变大导致误差问题)。所以在实践中添加过多层之后训练误差往往不升反降。而残差网络就是用于解决这个问题的。

左图中为一各普通的线性网络结构。右图中为残差块，即增加了跨层的数据通道，使得输入可以通过跨层的数据线路更快的向前传播。

注: 在ResNet的后续版本中，残差块的 ConV+BN+ReLU 的结构被改成了 BN+ReLU+ConV

<img src="img/class8_2.1.svg" style="zoom:100%">

## 实现

In [9]:
import tensorflow as tf
from tensorflow.keras import layers, activations

### 残差块

下述残差块为Y = X -> Conv -> BN -> ReLU -> Conv -> BN; output = Y + X

然后对于output中的X可以视情况选择对其做1x1的Conv，也可以不做

In [16]:
class Residual(tf.keras.Model):
    def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
        super(Residual, self).__init__(**kwargs)
        self.conv1 = layers.Conv2D(num_channels,
                                   padding='same',
                                   kernel_size=3,
                                   strides=strides)
        self.conv2 = layers.Conv2D(num_channels, 
                                   kernel_size=3,
                                   padding='same')
        if use_1x1conv:
            self.conv3 = layers.Conv2D(num_channels,
                                       kernel_size=1,
                                       strides=strides)
        else:
            self.conv3 = None
        self.bn1 = layers.BatchNormalization()
        self.bn2 = layers.BatchNormalization()

    def call(self, X):
        Y = activations.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        return activations.relu(Y + X)

# 残差块output的形状
# 因为残差块内Conv为same, 所以输出的形状不变。
blk = Residual(num_channels=3, use_1x1conv=True)
X = tf.random.uniform((4, 6, 6, 3))
blk(X).shape

TensorShape([4, 6, 6, 3])

### 残差网络

`ResNetBlock`可以通过`num_residuals`来决定几个残差块相连。且第一个残差块中跨层连接为1x1Conv且strides=2。

In [17]:
class ResnetBlock(tf.keras.layers.Layer):
    def __init__(self,num_channels, num_residuals, first_block=False,**kwargs):
        super(ResnetBlock, self).__init__(**kwargs)
        self.listLayers=[]
        for i in range(num_residuals):
            if i == 0 and not first_block:
                self.listLayers.append(Residual(num_channels, use_1x1conv=True, strides=2))
            else:
                self.listLayers.append(Residual(num_channels))      

    def call(self, X):
        for layer in self.listLayers.layers:
            X = layer(X)
        return X

In [19]:
class ResNet(tf.keras.Model):
    def __init__(self,num_blocks,**kwargs):
        super(ResNet, self).__init__(**kwargs)
        self.conv=layers.Conv2D(64, kernel_size=7, strides=2, padding='same')
        self.bn=layers.BatchNormalization()
        self.relu=layers.Activation('relu')
        self.mp=layers.MaxPool2D(pool_size=3, strides=2, padding='same')
        self.resnet_block1=ResnetBlock(64,num_blocks[0], first_block=True)
        self.resnet_block2=ResnetBlock(128,num_blocks[1])
        self.resnet_block3=ResnetBlock(256,num_blocks[2])
        self.resnet_block4=ResnetBlock(512,num_blocks[3])
        self.gap=layers.GlobalAvgPool2D()
        self.fc=layers.Dense(units=10,activation=tf.keras.activations.softmax)

    def call(self, x):
        x=self.conv(x)
        x=self.bn(x)
        x=self.relu(x)
        x=self.mp(x)
        x=self.resnet_block1(x)
        x=self.resnet_block2(x)
        x=self.resnet_block3(x)
        x=self.resnet_block4(x)
        x=self.gap(x)
        x=self.fc(x)
        return x

res_net=ResNet([2,2,2,2])

In [20]:
# 查看网络各层的shape
res_net = ResNet([2,2,2,2])
X = tf.random.uniform(shape=(1, 224, 224, 1))
print('input shape:\t', X.shape)
for layer in res_net.layers:
    X = layer(X)
    print(layer.name, 'output shape:\t', X.shape)

input shape:	 (1, 224, 224, 1)
conv2d_70 output shape:	 (1, 112, 112, 64)
batch_normalization_63 output shape:	 (1, 112, 112, 64)
activation_11 output shape:	 (1, 112, 112, 64)
max_pooling2d_7 output shape:	 (1, 56, 56, 64)
resnet_block_12 output shape:	 (1, 56, 56, 64)
resnet_block_13 output shape:	 (1, 28, 28, 128)
resnet_block_14 output shape:	 (1, 14, 14, 256)
resnet_block_15 output shape:	 (1, 7, 7, 512)
global_average_pooling2d_3 output shape:	 (1, 512)
dense_9 output shape:	 (1, 10)


In [87]:
# 获取数据+训练模型+验证模型

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train.reshape((60000, 28, 28, 1)).astype('float32') / 255
x_test = x_test.reshape((10000, 28, 28, 1)).astype('float32') / 255

res_net.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# res_net.fit(x_train, y_train,
#             batch_size=64,
#             epochs=5,
#             validation_split=0.2)

# 因为CPU训练要很久，所以用GPU训练完了把weights拿过来
res_net.load_weights('files/class8_2_weights.h5')

# 评估模型
test_scores = res_net.evaluate(x_test, y_test, verbose=2)

10000/10000 - 6s - loss: 0.3930 - accuracy: 0.8598


# DenseNet - 稠密网络

## 介绍

DenseNet是ResNet的一种变体，其主要区别如下图。图左为ResNet, 图右为DenseNet。ResNet的跨层连接使用相加，而DenseNet的跨层连接使用从cat。

DenseNet的主要构成是稠密块(dense block)和过渡层(transition layer)。前者定义了输入和输出是如何连接的，后者用来控制通道数，使之不过大。

<img src="img/class8_3.1.svg" style="zoom:100%">

## 实现

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import BatchNormalization, Conv2D, Dropout, Activation, concatenate, AveragePooling2D

### BottleNeck

In [3]:
# 在`BottleNeck`中实现 “BN+Activation+Conv”结构。(即2.1中提到的ResNet改良版的结构)

class BottleNeck(tf.keras.layers.Layer):
    def __init__(self, growth_rate, drop_rate):
        super(BottleNeck, self).__init__()
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.conv1 = tf.keras.layers.Conv2D(filters=4 * growth_rate,
                                            kernel_size=(1, 1),
                                            strides=1,
                                            padding="same")
        self.bn2 = tf.keras.layers.BatchNormalization()
        self.conv2 = tf.keras.layers.Conv2D(filters=growth_rate,
                                            kernel_size=(3, 3),
                                            strides=1,
                                            padding="same")
        self.dropout = tf.keras.layers.Dropout(rate=drop_rate)

        self.listLayers = [self.bn1,
                           tf.keras.layers.Activation("relu"),
                           self.conv1,
                           self.bn2,
                           tf.keras.layers.Activation("relu"),
                           self.conv2,
                           self.dropout]

    def call(self, x):
        y = x
        for layer in self.listLayers.layers:
            y = layer(y)
        y = tf.keras.layers.concatenate([x,y], axis=-1)
        return y

### DenseBlock

In [4]:
# DenseNet由Bottleneck组成，每块使用相同的输出通道数

class DenseBlock(tf.keras.layers.Layer):
    def __init__(self, num_layers, growth_rate, drop_rate=0.5):
        super(DenseBlock, self).__init__()
        self.num_layers = num_layers
        self.growth_rate = growth_rate
        self.drop_rate = drop_rate
        self.listLayers = []
        for _ in range(num_layers):
            self.listLayers.append(BottleNeck(growth_rate=self.growth_rate, drop_rate=self.drop_rate))

    def call(self, x):
        for layer in self.listLayers.layers:
            x = layer(x)
        return x

In [6]:
# 下例中，定义一个有2个BottleNeck相连，且每个BottleNeck的growth_rate为10的DenseBlock
# 使用通道数为3的输入的时候，会得到通道数为 3 + 2 * 10 = 23 的输出。卷积块的通道数控住了输出通道数相当于输入通道数的增长
# 因此也被称为增长率(growth rate)
# 可以理解为每个BottleNeck的输出通道数为其 (输入的通道数+growth_rate)

blk = DenseBlock(2, 10)
X = tf.random.uniform((4, 8, 8, 3))
Y = blk(X)
print(Y.shape)

(4, 8, 8, 23)


### Transition Layer

In [104]:
# 由于每个DenseBlock都会带来通道数的增加，使用过多会带来过于复杂的模型。所以使用过渡层来控制模型复杂度。
# 过渡层使用1x1卷积层来减少通道数，并使用strides=2的平均池化层减半高和宽，从而进一步降低模型复杂度

class TransitionLayer(tf.keras.layers.Layer):
    def __init__(self, out_channels):
        super(TransitionLayer, self).__init__()
        self.bn = tf.keras.layers.BatchNormalization()
        self.conv = tf.keras.layers.Conv2D(filters=out_channels,
                                           kernel_size=(1, 1),
                                           strides=1,
                                           padding="same")
        self.pool = tf.keras.layers.AveragePooling2D(pool_size=(2, 2),
                                                     strides=2,
                                                     padding="same")

    def call(self, inputs):
        x = self.bn(inputs)
        x = tf.keras.activations.relu(x)
        x = self.conv(x)
        x = self.pool(x)
        return x


In [105]:
# 将3.2.1中的Y输入进过渡层，查看通道数变化
# shape变为原来的一半(因为AvgPool中strides为2); channels变为10, 因为out_channels被设定为10

blk = TransitionLayer(10)
blk(Y).shape

TensorShape([4, 4, 4, 10])

### DenseNet

In [106]:
# num_init_features是网络里第一个Conv的filters的数量
# growth_rate是dense_block内所有BottleNeck的growth_rate(即每经过一个BottleNeck增加的通道数)
# compression_rate是TransitionLayer内对DenseBlock通道数的压缩比例
# # 例如compression_rate=0.5的时候，若DenseBlock的输出通道数为a, 则在经过TransitionLayer之后, 其通道数为0.5a
# block_layers里面的元素, 是每一个DenseBlock内BottleNeck的数量
# drop_rate是所有DenseBlock内所有DropOut层的drop概率

class DenseNet(tf.keras.Model):
    def __init__(self, num_init_features, growth_rate, block_layers, compression_rate, drop_rate):
        super(DenseNet, self).__init__()
        self.conv = tf.keras.layers.Conv2D(filters=num_init_features,
                                           kernel_size=(7, 7),
                                           strides=2,
                                           padding="same")
        self.bn = tf.keras.layers.BatchNormalization()
        self.pool = tf.keras.layers.MaxPool2D(pool_size=(3, 3),
                                              strides=2,
                                              padding="same")
        self.num_channels = num_init_features
        self.dense_block_1 = DenseBlock(num_layers=block_layers[0], growth_rate=growth_rate, drop_rate=drop_rate)
        self.num_channels += growth_rate * block_layers[0]
        self.num_channels = compression_rate * self.num_channels
        self.transition_1 = TransitionLayer(out_channels=int(self.num_channels))
        self.dense_block_2 = DenseBlock(num_layers=block_layers[1], growth_rate=growth_rate, drop_rate=drop_rate)
        self.num_channels += growth_rate * block_layers[1]
        self.num_channels = compression_rate * self.num_channels
        self.transition_2 = TransitionLayer(out_channels=int(self.num_channels))
        self.dense_block_3 = DenseBlock(num_layers=block_layers[2], growth_rate=growth_rate, drop_rate=drop_rate)
        self.num_channels += growth_rate * block_layers[2]
        self.num_channels = compression_rate * self.num_channels
        self.transition_3 = TransitionLayer(out_channels=int(self.num_channels))
        self.dense_block_4 = DenseBlock(num_layers=block_layers[3], growth_rate=growth_rate, drop_rate=drop_rate)

        self.avgpool = tf.keras.layers.GlobalAveragePooling2D()
        self.fc = tf.keras.layers.Dense(units=10,
                                        activation=tf.keras.activations.softmax)

    def call(self, inputs):
        x = self.conv(inputs)
        x = self.bn(x)
        x = tf.keras.activations.relu(x)
        x = self.pool(x)

        x = self.dense_block_1(x)
        x = self.transition_1(x)
        x = self.dense_block_2(x)
        x = self.transition_2(x)
        x = self.dense_block_3(x)
        x = self.transition_3(x,)
        x = self.dense_block_4(x)

        x = self.avgpool(x)
        x = self.fc(x)

        return x

dense_net = DenseNet(num_init_features=64, 
                     growth_rate=32, 
                     block_layers=[4,4,4,4], 
                     compression_rate=0.5, 
                     drop_rate=0.5)

In [107]:
# 检查输出shape
X = tf.random.uniform(shape=(1, 96, 96, 1))
for layer in dense_net.layers:
    X = layer(X)
    print(layer.name, 'output shape:\t', X.shape)

conv2d_333 output shape:	 (1, 48, 48, 64)
batch_normalization_323 output shape:	 (1, 48, 48, 64)
max_pooling2d_21 output shape:	 (1, 24, 24, 64)
dense_block_28 output shape:	 (1, 24, 24, 192)
transition_layer_20 output shape:	 (1, 12, 12, 96)
dense_block_29 output shape:	 (1, 12, 12, 224)
transition_layer_21 output shape:	 (1, 6, 6, 112)
dense_block_30 output shape:	 (1, 6, 6, 240)
transition_layer_22 output shape:	 (1, 3, 3, 120)
dense_block_31 output shape:	 (1, 3, 3, 248)
global_average_pooling2d_9 output shape:	 (1, 248)
dense_15 output shape:	 (1, 10)


In [108]:
# 获取数据+训练模型+验证模型

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train.reshape((60000, 28, 28, 1)).astype('float32') / 255
x_test = x_test.reshape((10000, 28, 28, 1)).astype('float32') / 255

dense_net.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# dense_net.fit(x_train, y_train,
#               batch_size=64,
#               epochs=5,
#               validation_split=0.2)

# 因为CPU训练要很久，所以用GPU训练完了把weights拿过来
dense_net.load_weights('files/class8_3_weights.h5')

# 评估模型
test_scores = dense_net.evaluate(x_test, y_test, verbose=2)

10000/10000 - 5s - loss: 0.5964 - accuracy: 0.8231


# 重点总结

- BN
    - BN能使得每一层的输出更加的稳定。(深隐层的输出不加BN会剧烈变化,导致训练不出好模型)
    - 在训练的时候
        - 对该层的output求其mean与var
        - Y_tmp = (output - mean) / sqrt(var + epsilon)。做标准化, 其中epsilon是一个很小的正数，防止分母是0。
        - Y = gamma * Y_tmp + beta。其中gamma与beta是两个超参数, 分别表示拉伸和平移。是训练出来的。
        - 然后对Y应用激活函数，将其 传输给下一层, 本次epoch本batch训练的BN层作用完成
    - 在预测的时候
        - 仍然要在BN层时对该层的输出做标准化，只不过用的不是该层output的mean与var，而是之前训练时留下的moving_mean和moving_var
            - 在训练开始初始化模型的时候，会将该BN层的moving_mean和moving_var初始化为0。之后每一个batch的数据进入该层的时候
            - moving_mean = momentum * moving_mean + (1 - momentum) * 本次训练output的mean
            - moving_var = momentum * moving_var + (1 - momentum) * 本次训练output的var
            - 其中momentum为本次移动窗口的权重，即该值越大，本次batch的mean和var对移动平均和移动方差的影响越小。在案例中使用的momentum是0.9
        - Y_tmp = (output - moving_mean) / sqrt(moving_var + epsilon)。做标准化, 其中epsilon是一个很小的正数，防止分母是0。
        - Y = gamma * Y_tmp + beta。其中gamma与beta是之前在训练的时候训练好的参数
        - 然后对Y应用激活函数，将其传输给下一层
    - 结论
        - 在训练的使用了BN，那么在预测的时候也要按之前训练时的batch_size将sample输入网络，而不是一次性全部输入网络。

- ResNet
    - 残差网络用于解决深层NN中的退化问题(即将很多层之前的input的信息跨通道传递到本层)
    - ResBlock
        - Y = X -> Conv -> BN -> ReLU -> Conv -> BN; output = Y + X
    - ResNet
        - Conv + BN + Activation + MaxPool + ResBlock*4 + GlobalAvgPool + Dense

- DenseNet
    - BottleNeck
        - Y = X -> BN -> ReLU -> Conv(filters=4\*growth_rate) -> BN -> ReLU -> Conv(filters=growth_rate) -> DropOut; output = [X, Y]
    - DenseBlock
        - n * BottleNeck
    - TransitionLayer
        - BN -> ReLU -> 1x1Conv(filters=input_channels*compression_rate) -> AvgPool
    - DenseNet
        - Conv -> BN -> ReLU -> MaxPool -> (DenseBlock + TransitionLayer) * 3 -> DenseBlock -> GlobalAvgPool -> Dense
    - 需要注意DenseNet里的growth_rate与compression_rate
        - 在一个DenseBlock内有多个BottleNeck, 每个BottleNeck都有相同的growth_rate, 即经过每层BottleNeck, 输出的通道数增加的数量。例如输入的通道数为a, growth_rate=b, 则经过一个BottleNeck其通道数为a+b
        - 在TransitionLayer中有一个compression_rate, 表示将输入的通道数压缩的比例。例如输入的通道数为a, compression_rate=0.5, 则经过TransitionLayer输出的通道数为0.5a