#### 层级学习率的理论基础
深度神经网络不同层次在特征提取和信息处理上扮演着不同的角色。基于这一认知，我们可以合理推断对不同层采用差异化的学习策略可能会更有效：

* 1、底层特征提取：网络的前几层通常负责捕获通用的低级特征，如边缘、纹理等。这些特征往往具有较强的通用性。
* 2、高层语义理解：网络的后基层倾向于提取更为抽象和任务相关的高级特征。
* 3、任务特定层：如全连接分类层，直接与特定任务相关。

基于上述观察，我们可以制定相应的学习率策略：

* 对于预训练的底层，使用较小的学习率保持其已经学到的通用特征。
* 对于中间层，采用适中的学习率。
* 对于任务特定的顶层，则可以使用较大的学习率以快速适应新任务。

#### Pytorch实现：以ResNet为例

In [1]:
import torch
import torch.nn as nn
import torchvision.models as models

In [2]:
# 加载预训练的ResNet18模型.
model = models.resnet18(pretrained=True)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to C:\Users\Administrator/.cache\torch\hub\checkpoints\resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:07<00:00, 5.87MB/s]


In [9]:
# 网络结构可视化.
import torchsummary
torchsummary.summary(model)

Layer (type:depth-idx)                   Param #
├─Conv2d: 1-1                            9,408
├─BatchNorm2d: 1-2                       128
├─ReLU: 1-3                              --
├─MaxPool2d: 1-4                         --
├─Sequential: 1-5                        --
|    └─BasicBlock: 2-1                   --
|    |    └─Conv2d: 3-1                  36,864
|    |    └─BatchNorm2d: 3-2             128
|    |    └─ReLU: 3-3                    --
|    |    └─Conv2d: 3-4                  36,864
|    |    └─BatchNorm2d: 3-5             128
|    └─BasicBlock: 2-2                   --
|    |    └─Conv2d: 3-6                  36,864
|    |    └─BatchNorm2d: 3-7             128
|    |    └─ReLU: 3-8                    --
|    |    └─Conv2d: 3-9                  36,864
|    |    └─BatchNorm2d: 3-10            128
├─Sequential: 1-6                        --
|    └─BasicBlock: 2-3                   --
|    |    └─Conv2d: 3-11                 73,728
|    |    └─BatchNorm2d: 3-12            25

Layer (type:depth-idx)                   Param #
├─Conv2d: 1-1                            9,408
├─BatchNorm2d: 1-2                       128
├─ReLU: 1-3                              --
├─MaxPool2d: 1-4                         --
├─Sequential: 1-5                        --
|    └─BasicBlock: 2-1                   --
|    |    └─Conv2d: 3-1                  36,864
|    |    └─BatchNorm2d: 3-2             128
|    |    └─ReLU: 3-3                    --
|    |    └─Conv2d: 3-4                  36,864
|    |    └─BatchNorm2d: 3-5             128
|    └─BasicBlock: 2-2                   --
|    |    └─Conv2d: 3-6                  36,864
|    |    └─BatchNorm2d: 3-7             128
|    |    └─ReLU: 3-8                    --
|    |    └─Conv2d: 3-9                  36,864
|    |    └─BatchNorm2d: 3-10            128
├─Sequential: 1-6                        --
|    └─BasicBlock: 2-3                   --
|    |    └─Conv2d: 3-11                 73,728
|    |    └─BatchNorm2d: 3-12            25

In [4]:
# 修改最后的全连接层以适应新的分类任务.
num_classes = 10    # 假设新任务有10个类别.
"""
model.fc
Linear(in_features=512, out_features=1000, bias=True)
"""
model.fc = nn.Linear(model.fc.in_features, num_classes)

In [5]:
print(model)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [7]:
# 参数分组,为不同层设置不同学习率.
backbone_lr = 1e-4
classifier_lr = 1e-3

# 创建参数组.
params = [{'params':model.conv1.parameters(), 'lr':backbone_lr}, 
          {'params':model.bn1.parameters(), 'lr':backbone_lr}, 
          {'params':model.layer1.parameters(), 'lr':backbone_lr}, 
          {'params':model.layer2.parameters(), 'lr':backbone_lr}, 
          {'params':model.layer3.parameters(), 'lr':backbone_lr}, 
          {'params':model.layer4.parameters(), 'lr':backbone_lr}, 
          {'params':model.fc.parameters(), 'lr':classifier_lr}]

In [10]:
# 优化器设置.
"""
PyTorch的优化器可以自动识别并应用参数分组中定义的不同学习率.
"""
optimizer = torch.optim.Adam(params)

In [None]:
# 定义损失函数Criteria.
criterion = nn.CrossEntropyLoss()

# 训练循环.
num_epochs = 500
for epoch in range(num_epochs):
    # train.
    for inputs, labels in train_loader:
        ouputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()   

#### 学习率调度

In [None]:
"""
除了设置层级学习率之外，还可以结合学习率调度器来动态调整学习率.
例如：StepLR.
"""
# 代码示例.
from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# 在训练循环中更新学习率.
num_epochs = 500
for epoch in range(num_epochs):
    # 前向传递.
    outputs = model(inputs)
    # 计算损失.
    loss = criterion(outputs, labels)
    # 反向传播.
    loss.backwark()
    # 参数更新.
    scheduler.step()
    # 梯度归零.
    scheduler.zero_grad()

#### 渐进式解冻

In [None]:
# 初始阶段：仅训练分类器.
for param in model.parameters():
    param.requires_grad = False
    model.fc.requires_grad = True

# 训练几个epoch之后.
model.layer4.requires_grad = True

# 再过几个epoch后.
model.layer3.requires_grad = True

#### 层适应学习率

通过自定义优化器来实现，不同的层具有不同的学习率范围。 例如，可以实现一个自定义的优化器来自动调整每一层的学习率.

In [None]:
class LayerAdaptiveLR(torch.optim.Adam):
     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0):
         super().__init__(params, lr, betas, eps, weight_decay)
         self.param_groups = sorted(self.param_groups, key=lambda x: id(x['params'][0]))
         
     def step(self, closure=None):
         loss = None
         if closure is not None:
             loss = closure()
 
         for group in self.param_groups:
             for p in group['params']:
                 if p.grad is None:
                     continue
                 grad = p.grad.data
                 state = self.state[p]
 
                 # 根据梯度统计调整学习率
                 if len(state) == 0:
                     state['step'] = 0
                     state['exp_avg'] = torch.zeros_like(p.data)
                     state['exp_avg_sq'] = torch.zeros_like(p.data)
 
                 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                 beta1, beta2 = group['betas']
 
                 state['step'] += 1
 
                 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
                 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
 
                 denom = exp_avg_sq.sqrt().add_(group['eps'])
                 
                 # 动态调整学习率
                 step_size = group['lr'] * (exp_avg.abs() / denom).mean().item()
                 p.data.add_(exp_avg, alpha=-step_size)
 
         return loss
 
# 使用示例
optimizer = LayerAdaptiveLR(model.parameters(), lr=1e-3)