# 残差网络（ResNet）

- Q:对神经网络模型添加新的层，充分训练后的模型是否只可能更有效地降低训练误差？

    理论上，原模型解的空间只是新模型解的空间的子空间。也就是说，如果我们能将新添加的层训练成恒等映射$f(x)=x$，新模型和原模型将同样有效。由于新模型可能得出更优的解来拟合训练数据集，因此添加层似乎更容易降低训练误差。然而在实践中，添加过多的层后训练误差往往不降反升。即使利用批量归一化带来的数值稳定性使训练深层模型更加容易，该问题仍然存在。
    
## 残差块

如图,设输入为xx。假设我们希望学出的理想映射为$f(x)$，从而作为`上方激活函数的输入`。左图虚线框中的部分需要直接拟合出该映射$f(x)$，而右图虚线框中的部分则需要拟合出有关恒等映射的残差映射$f(x)−x$。残差映射在实际中往往更容易优化。以本节开头提到的恒等映射作为我们希望学出的理想映射$f(x)$。我们只需将图右图虚线框内上方的加权运算（如仿射）的`权重和偏差参数学成0`，那么$f(x)$即为恒等映射。实际中，当理想映射$f(x)$`极接近于恒等映射时，残差映射也易于捕捉恒等映射的细微波动`。右图也是ResNet的基础块，即残差块（residual block）。在残差块中，输入可通过**跨层的数据线路更快地向前传播。**


<img style="float: center;" src="./pics/4.ResNet.png" width=500 height=500>

In [1]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F

import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

D:\Anaconda\envs\torch\lib\site-packages\numpy\.libs\libopenblas.JPIJNSWNNAN3CE6LLI5FWSPHUT2VXMTH.gfortran-win_amd64.dll
D:\Anaconda\envs\torch\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
  stacklevel=1)


In [2]:
class Residual(nn.Module):
    def __init__(self,in_channels,out_channels,use_1x1conv = False,stride = 1):
        super(Residual,self).__init__()
        self.conv1 = nn.Conv2d(in_channels,out_channels,kernel_size = 3,padding = 1,stride = stride)
        self.conv2 = nn.Conv2d(out_channels,out_channels,kernel_size = 3,padding = 1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(in_channels,out_channels,kernel_size = 1,stride = stride)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
    def forward(self,X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        return F.relu(Y+X)

In [3]:
blk = Residual(3, 3)
X = torch.rand((4, 3, 6, 6))
blk(X).shape # torch.Size([4, 3, 6, 6])

torch.Size([4, 3, 6, 6])

In [4]:
blk = Residual(3, 6, use_1x1conv=True, stride=2)
blk(X).shape # torch.Size([4, 6, 3, 3])

torch.Size([4, 6, 3, 3])

In [5]:
net = nn.Sequential(
        nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), 
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [6]:
def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
    if first_block:
        assert in_channels == out_channels # 第一个模块的通道数同输入通道数一致
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
        else:
            blk.append(Residual(out_channels, out_channels))
    return nn.Sequential(*blk)

In [7]:
net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))

In [8]:
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, 512, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(512, 10))) 

In [9]:
X = torch.rand((1, 1, 224, 224))
for name, layer in net.named_children():
    X = layer(X)
    print(name, ' output shape:\t', X.shape)

0  output shape:	 torch.Size([1, 64, 112, 112])
1  output shape:	 torch.Size([1, 64, 112, 112])
2  output shape:	 torch.Size([1, 64, 112, 112])
3  output shape:	 torch.Size([1, 64, 56, 56])
resnet_block1  output shape:	 torch.Size([1, 64, 56, 56])
resnet_block2  output shape:	 torch.Size([1, 128, 28, 28])
resnet_block3  output shape:	 torch.Size([1, 256, 14, 14])
resnet_block4  output shape:	 torch.Size([1, 512, 7, 7])
global_avg_pool  output shape:	 torch.Size([1, 512, 1, 1])
fc  output shape:	 torch.Size([1, 10])


In [10]:
batch_size = 256
# 如出现“out of memory”的报错信息，可减小batch_size或resize
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)

lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

training on  cuda


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:46<00:00,  5.02it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 1, loss 0.3995, train acc 0.853, test acc 0.849, time 50.2 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:44<00:00,  5.29it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 2, loss 0.2508, train acc 0.908, test acc 0.904, time 48.0 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:47<00:00,  4.98it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 3, loss 0.2067, train acc 0.923, test acc 0.885, time 51.3 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:45<00:00,  5.11it/s]
  0%|                                                                                          | 0/235 [00:00<?, ?it/s]

epoch 4, loss 0.1787, train acc 0.934, test acc 0.917, time 49.6 sec


100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:45<00:00,  5.21it/s]


epoch 5, loss 0.1576, train acc 0.942, test acc 0.914, time 48.7 sec
