## 网络中的网络(NiN)

* 2014

LeNet,AlexNet,VGG共同之处是:先以卷积层构成的模块充分抽取空间特征,再以全连接层构成的模块来输出分类结果.  
AlexNet和VGG对LeNet的改进主要在与对这两个模块进行``加宽``和``加深``.

### (1)NiN块
``使用1x1卷积层代替全连接层``.   
NiN块由``一个卷积层+两个1x1卷积层``,其中第一个卷积层的超参数可以自行设定,第二个和第三个卷积层的超参数一般是固定的. 


1x1卷积层计算主要发生在通道维度上.  
可以看成是将``通道维当成特征维``,将``宽和高维度``上的元素当成``数据样本``.

#### 网络结构局部框架:  
AlexNet,VGG: 卷积层 -> 卷积层 -> 全连接层 -> 全连接层  
NiN: 卷积层 -> 1x1卷积层 -> 卷积层 -> 1x1卷积层

In [1]:
import time
import torch
from torch import nn, optim
import torchvision
import torchvision.transforms as transforms
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
def nin_block(in_channels, out_channels, kernel_size, stride, padding):
    blk = nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1),
        nn.ReLU()
    )
    return blk

### (2)NiN模型
* NiN使用卷积窗口形状分别为11,5和3的卷积层,相应的输出通道数也和AlexNet中的一致.每个NiN块后接一个步幅为2,窗口形状为3的最大池化层.
* 去掉了AlexNet最后的3个全连接层,取而代之地,NiN使用了``输出通道数等于标签类别数``的NiN块,然后使用``全局平均池化``对每个通道中所有元素求平均并``直接用于分类``.  
这里的全局平均池化即窗口形状等于输入空间维形状的平均池化层.

In [3]:
import torch.nn.functional as F
class GloablAvgPool2d(nn.Module):
    def __init__(self):
        super(GloablAvgPool2d, self).__init__()
    def forward(self, x):
        return F.avg_pool2d(x, kernel_size=x.size()[2:])
    
net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, stride=4, padding=0),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(96, 256, kernel_size=5, stride=1, padding=2),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(256, 384, kernel_size=3, stride=1, padding=1),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Dropout(0.5),
    # 标签数为10
    nin_block(384, 10, kernel_size=3, stride=1, padding=1),
    GloablAvgPool2d(),
    nn.Flatten()
)

In [4]:
print(net)

Sequential(
  (0): Sequential(
    (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
    (3): ReLU()
    (4): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1))
    (5): ReLU()
  )
  (1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (2): Sequential(
    (0): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    (3): ReLU()
    (4): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    (5): ReLU()
  )
  (3): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (4): Sequential(
    (0): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
    (3): ReLU()
    (4): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
    (5): ReLU()
  )
  (5): MaxPool2d(kernel_size=3, stri

In [5]:
X = torch.rand(1, 1, 224, 224)
for name, blk in net.named_children():
    X = blk(X)
    print(name, 'output shape:', X.shape)


0 output shape: torch.Size([1, 96, 54, 54])
1 output shape: torch.Size([1, 96, 26, 26])
2 output shape: torch.Size([1, 256, 26, 26])
3 output shape: torch.Size([1, 256, 12, 12])
4 output shape: torch.Size([1, 384, 12, 12])
5 output shape: torch.Size([1, 384, 5, 5])
6 output shape: torch.Size([1, 384, 5, 5])
7 output shape: torch.Size([1, 10, 5, 5])
8 output shape: torch.Size([1, 10, 1, 1])
9 output shape: torch.Size([1, 10])


### (3)获取数据

In [6]:
resize = 224
trans = []
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())
transform = torchvision.transforms.Compose(trans) # 将两个变换串联起来

mnist_train = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST', train=True, download=True, transform=transform)
mnist_test = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST', train=False, download=True, transform=transform)

batch_size = 128

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False)


### (4)训练模型

In [7]:
def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
    net = net.to(device)
    print('train on', device)
    loss = torch.nn.CrossEntropyLoss()
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            # print("y.shape", y.shape) # [128]
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            train_l_sum += l.cpu().item() # loss复制到cpu上
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1

        with torch.no_grad():
            test_acc_sum, n_test = 0.0, 0 # 创建在内存(CPU)
            for X_test, y_test in test_iter:
                net.eval() # 评估模式
                test_acc_sum += (net(X_test.to(device)).argmax(dim=1) == y_test.to(device)).sum().item()  # 对Tensor进行.item()取值后,得到的就是一个Python Scalar.
                net.train() # 训练模式
                n_test += y_test.shape[0]
            test_acc = test_acc_sum / n_test

        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
        % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))

### 训练

In [8]:
lr, num_epochs = 0.002, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

train on cuda
epoch 1, loss 1.2046, train acc 0.579, test acc 0.767, time 64.9 sec
epoch 2, loss 0.5717, train acc 0.794, test acc 0.809, time 65.3 sec
epoch 3, loss 0.4978, train acc 0.818, test acc 0.824, time 65.6 sec
epoch 4, loss 0.4564, train acc 0.833, test acc 0.837, time 65.5 sec
epoch 5, loss 0.4292, train acc 0.842, test acc 0.832, time 65.7 sec


* NiN``重复使用``由``卷积层``和``代替全连接层的1x1卷积层``构成的NiN块,来``构建深层网络``.  
* NiN去除了容易造成过拟合的全连接输出层,将其换成输出通道数等于标签类别数的NiN块和全局平均池化层.
* NiN的以上设计思想影响了后面一系列卷积神经网络的设计.