# 实战Kaggle比赛：狗的品种识别（ImageNet Dogs）

本节我们将在Kaggle上实战狗品种识别问题。
本次(**比赛网址是https://www.kaggle.com/c/dog-breed-identification**)。
 :numref:`fig_kaggle_dog`显示了鉴定比赛网页上的信息。
需要一个Kaggle账户才能提交结果。

在这场比赛中，我们将识别120类不同品种的狗。
这个数据集实际上是著名的ImageNet的数据集子集。与 :numref:`sec_kaggle_cifar10`中CIFAR-10数据集中的图像不同，
ImageNet数据集中的图像更高更宽，且尺寸不一。

![狗的品种鉴定比赛网站，可以通过单击“数据”选项卡来获得比赛数据集。](../img/kaggle-dog.jpg)
:width:`400px`
:label:`fig_kaggle_dog`


In [52]:
# from google.colab import drive
# drive.flush_and_unmount()
# drive.mount('/content/drive', force_remount=False)


Mounted at /content/drive


In [53]:
# !unzip '/content/drive/MyDrive/Colab Notebooks/dog-breed-identification.zip' -d '/content/drive/MyDrive/data/dog-breed-identification'


In [54]:
# !nvidia-smi

Fri Nov  3 07:30:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    38W / 300W |   3148MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [55]:
# !pip install d2l==0.17.6

In [1]:
import os
import torch
import torchvision
from torch import nn
from d2l import torch as d2l

In [2]:
! pwd

/Users/frank/Desktop/d2l-zh/pytorch/chapter_computer-vision


## 获取和整理数据集

比赛数据集分为训练集和测试集，分别包含RGB（彩色）通道的10222张、10357张JPEG图像。
在训练数据集中，有120种犬类，如拉布拉多、贵宾、腊肠、萨摩耶、哈士奇、吉娃娃和约克夏等。

### 下载数据集

登录Kaggle后，可以点击 :numref:`fig_kaggle_dog`中显示的竞争网页上的“数据”选项卡，然后点击“全部下载”按钮下载数据集。在`../data`中解压下载的文件后，将在以下路径中找到整个数据集：

* ../data/dog-breed-identification/labels.csv
* ../data/dog-breed-identification/sample_submission.csv
* ../data/dog-breed-identification/train
* ../data/dog-breed-identification/test


上述结构与 :numref:`sec_kaggle_cifar10`的CIFAR-10类似，其中文件夹`train/`和`test/`分别包含训练和测试狗图像，`labels.csv`包含训练图像的标签。

同样，为了便于入门，[**我们提供完整数据集的小规模样本**]：`train_valid_test_tiny.zip`。
如果要在Kaggle比赛中使用完整的数据集，则需要将下面的`demo`变量更改为`False`。


In [2]:
d2l.DATA_HUB['dog_tiny'] = (d2l.DATA_URL + 'kaggle_dog_tiny.zip',
                            '0cb91d09b814ecdc07b50f31f8dcad3e81d6a86d')

# 如果使用Kaggle比赛的完整数据集，请将下面的变量更改为False
demo = False
if demo:
    data_dir = d2l.download_extract('dog_tiny')
else:
    data_dir = os.path.join('..', 'data', 'dog-breed-identification')

### [**整理数据集**]

我们可以像 :numref:`sec_kaggle_cifar10`中所做的那样整理数据集，即从原始训练集中拆分验证集，然后将图像移动到按标签分组的子文件夹中。

下面的`reorg_dog_data`函数读取训练数据标签、拆分验证集并整理训练集。


In [3]:
def reorg_dog_data(data_dir, valid_ratio):
    labels = d2l.read_csv_labels(os.path.join(data_dir, 'labels.csv'))
    d2l.reorg_train_valid(data_dir, labels, valid_ratio)
    d2l.reorg_test(data_dir)


batch_size = 32 if demo else 128
valid_ratio = 0.1
# reorg_dog_data(data_dir, valid_ratio)
# len(data_dir), data_dir[0]

## [**图像增广**]

回想一下，这个狗品种数据集是ImageNet数据集的子集，其图像大于 :numref:`sec_kaggle_cifar10`中CIFAR-10数据集的图像。
下面我们看一下如何在相对较大的图像上使用图像增广。


In [4]:
transform_train = torchvision.transforms.Compose([
    # 随机裁剪图像，所得图像为原始面积的0.08～1之间，高宽比在3/4和4/3之间。
    # 然后，缩放图像以创建224x224的新图像
    torchvision.transforms.RandomResizedCrop(224, scale=(0.08, 1.0),
                                             ratio=(3.0 / 4.0, 4.0 / 3.0)),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.RandomRotation(degrees=15),
    # 随机更改亮度，对比度和饱和度
    torchvision.transforms.ColorJitter(brightness=0.4,
                                       contrast=0.4,
                                       saturation=0.4),
    # 添加随机噪声
    torchvision.transforms.ToTensor(),
    # 标准化图像的每个通道
    torchvision.transforms.Normalize([0.485, 0.456, 0.406],
                                     [0.229, 0.224, 0.225])])

In [61]:
# transform_train_swin = torchvision.transforms.Compose([
#     # 随机裁剪图像，所得图像为原始面积的0.08～1之间，高宽比在3/4和4/3之间。
#     # 然后，缩放图像以创建224x224的新图像
#     torchvision.transforms.RandomResizedCrop(256, scale=(0.08, 1.0),
#                                              ratio=(3.0 / 4.0, 4.0 / 3.0)),
#     torchvision.transforms.RandomHorizontalFlip(),
#     torchvision.transforms.RandomRotation(degrees=15),
#     # 随机更改亮度，对比度和饱和度
#     torchvision.transforms.ColorJitter(brightness=0.4,
#                                        contrast=0.4,
#                                        saturation=0.4),
#     # 添加随机噪声
#     torchvision.transforms.ToTensor(),
#     # 标准化图像的每个通道
#     torchvision.transforms.Normalize([0.485, 0.456, 0.406],
#                                      [0.229, 0.224, 0.225])])

测试时，我们只使用确定性的图像预处理操作。


In [5]:
transform_test = torchvision.transforms.Compose([
    torchvision.transforms.Resize(256),
    # 从图像中心裁切224x224大小的图片
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize([0.485, 0.456, 0.406],
                                     [0.229, 0.224, 0.225])])

In [63]:
# transform_test_swin = torchvision.transforms.Compose([
#     torchvision.transforms.Resize(280),
#     # 从图像中心裁切224x224大小的图片
#     torchvision.transforms.CenterCrop(256),
#     torchvision.transforms.ToTensor(),
#     torchvision.transforms.Normalize([0.485, 0.456, 0.406],
#                                      [0.229, 0.224, 0.225])])

## [**读取数据集**]

与 :numref:`sec_kaggle_cifar10`一样，我们可以读取整理后的含原始图像文件的数据集。


In [6]:
train_ds, train_valid_ds = [torchvision.datasets.ImageFolder(
    os.path.join(data_dir, 'train_valid_test', folder),
    transform=transform_train) for folder in ['train', 'train_valid']]

valid_ds, test_ds = [torchvision.datasets.ImageFolder(
    os.path.join(data_dir, 'train_valid_test', folder),
    transform=transform_test) for folder in ['valid', 'test']]

In [65]:
# train_ds_swin, train_valid_ds_swin = [torchvision.datasets.ImageFolder(
#     os.path.join(data_dir, 'train_valid_test', folder),
#     transform=transform_train_swin) for folder in ['train', 'train_valid']]

# valid_ds_swin, test_ds_swin = [torchvision.datasets.ImageFolder(
#     os.path.join(data_dir, 'train_valid_test', folder),
#     transform=transform_test_swin) for folder in ['valid', 'test']]

In [10]:
len(train_ds), len(train_valid_ds), len(test_ds), train_ds[0][0].shape

(9502, 10222, 10357, torch.Size([3, 224, 224]))

In [67]:
# len(train_ds_swin), len(train_valid_ds_swin),len(test_ds_swin)

下面我们创建数据加载器实例的方式与 :numref:`sec_kaggle_cifar10`相同。


In [9]:
train_iter, train_valid_iter = [torch.utils.data.DataLoader(
    dataset, batch_size, shuffle=True, drop_last=True)
    for dataset in (train_ds, train_valid_ds)]

valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False,
                                         drop_last=True)

test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False,
                                        drop_last=False)

In [69]:
# train_iter_swin, train_valid_iter_swin = [torch.utils.data.DataLoader(
#     dataset, batch_size, shuffle=True, drop_last=True)
#     for dataset in (train_ds_swin, train_valid_ds_swin)]

# valid_iter_swin = torch.utils.data.DataLoader(valid_ds_swin, batch_size, shuffle=False,
#                                          drop_last=True)

# test_iter_swin = torch.utils.data.DataLoader(test_ds_swin, batch_size, shuffle=False,
#                                         drop_last=False)

## [**微调预训练模型**]

同样，本次比赛的数据集是ImageNet数据集的子集。
因此，我们可以使用 :numref:`sec_fine_tuning`中讨论的方法在完整ImageNet数据集上选择预训练的模型，然后使用该模型提取图像特征，以便将其输入到定制的小规模输出网络中。
深度学习框架的高级API提供了在ImageNet数据集上预训练的各种模型。
在这里，我们选择预训练的ResNet-34模型，我们只需重复使用此模型的输出层（即提取的特征）的输入。
然后，我们可以用一个可以训练的小型自定义输出网络替换原始输出层，例如堆叠两个完全连接的图层。
与 :numref:`sec_fine_tuning`中的实验不同，以下内容不重新训练用于特征提取的预训练模型，这节省了梯度下降的时间和内存空间。

回想一下，我们使用三个RGB通道的均值和标准差来对完整的ImageNet数据集进行图像标准化。
事实上，这也符合ImageNet上预训练模型的标准化操作。


In [11]:
def get_net(devices):
    finetune_net = nn.Sequential()
    finetune_net.features = torchvision.models.resnet152(pretrained=True)
    # 定义一个新的输出网络，共有120个输出类别
    finetune_net.output_new = nn.Sequential(nn.Linear(1000, 256),
                                            nn.ReLU(),
                                            nn.Linear(256, 120))
    # 将模型参数分配给用于计算的CPU或GPU
    finetune_net = finetune_net.to(devices[0])
    # 冻结参数
    for param in finetune_net.features.parameters():
        param.requires_grad = False
    return finetune_net


from torchvision.models import ResNet152_Weights, ViT_B_16_Weights, ViT_L_32_Weights, Swin_V2_B_Weights


def get_vit(devices):
    finetune_net = nn.Sequential()
    finetune_net.features = torchvision.models.vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1)

    # 将模型参数分配给用于计算的CPU或GPU
    finetune_net = finetune_net.to(devices[0])
    # 冻结参数
    for param in finetune_net.features.parameters():
        param.requires_grad = False
    finetune_net[0].heads = nn.Sequential(nn.Linear(768, 256),
                                          nn.ReLU(), nn.Dropout(0.5),
                                          nn.Linear(256, 120))
    return finetune_net


def get_vitl(devices):
    finetune_net = nn.Sequential()
    finetune_net.features = torchvision.models.vit_l_32(weights=ViT_L_32_Weights.IMAGENET1K_V1)
    finetune_net.features.heads = nn.Sequential(nn.Linear(1024, 2048),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(2048, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 1024),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(1024, 256),
                                                nn.ReLU(), nn.Dropout(0.5),
                                                nn.Linear(256, 120))
    # 将模型参数分配给用于计算的CPU或GPU
    finetune_net = finetune_net.to(devices[0])
    # 冻结参数
    for param in finetune_net.features.parameters():
        param.requires_grad = False
    for param in finetune_net.features.heads.parameters():
        param.requires_grad = True
    return finetune_net


def get_swin_B(devices):
    finetune_net = nn.Sequential()
    finetune_net.features = torchvision.models.swin_v2_b(weights=Swin_V2_B_Weights.IMAGENET1K_V1)
    finetune_net.features.head = nn.Sequential(nn.Linear(1024, 256),
                                               nn.ReLU(), nn.Dropout(0.5),
                                               nn.Linear(256, 120))
    # finetune_net.features.new_out = nn.Sequential(nn.Linear(1024, 256),
    #                                       nn.ReLU(), nn.Dropout(0.5),
    #                                       nn.Linear(256, 120))
    # 将模型参数分配给用于计算的CPU或GPU
    finetune_net = finetune_net.to(devices[0])
    # 冻结参数
    for param in finetune_net.features.parameters():
        param.requires_grad = False
    for param in finetune_net.features.head.parameters():
        param.requires_grad = True
    return finetune_net


class VisionTransformer(nn.Module):
    def __init__(self, patch_size, channels, hidden_dim, num_heads, num_layers, num_classes):
        super(VisionTransformer, self).__init__()
        self.patch_size = patch_size
        self.embedding = nn.Linear(patch_size * patch_size * channels, hidden_dim)
        self.positional_embedding = nn.Parameter(
            torch.randn(1, (224 // patch_size) * (224 // patch_size) + 1, hidden_dim))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=256,
            nhead=8,
            dim_feedforward=1024,
            dropout=0.1,
            activation=nn.functional.relu,
            layer_norm_eps=1e-5,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        # self.test = nn.Transformer()
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim))
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # Extract patches
        patches = x.unfold(2, self.patch_size, self.patch_size).unfold(3, self.patch_size, self.patch_size)
        patches = patches.contiguous().view(x.shape[0], -1, self.patch_size * self.patch_size * x.shape[1])

        # Embed patches
        x = self.embedding(patches)

        # Add positional embedding
        x = x + self.positional_embedding[:, :-1]
        cls_tokens = self.cls_token.repeat(x.shape[0], 1, 1)
        x = torch.cat((cls_tokens, x), dim=1)
        x = x + self.positional_embedding

        # Pass through transformer
        x = self.transformer(x)

        # Classifier on the CLS token
        x = self.fc(x[:, 0])
        return x


def get_myvit():
    net = VisionTransformer(16, 3, 512, 8, 4, 120)
    return net


class VisionTransformer_conv(nn.Module):
    def __init__(self, img_size, patch_size, channels, hidden_dim, num_heads, dim_feedforward, num_layers, num_classes):
        super(VisionTransformer_conv, self).__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) * (img_size // patch_size)

        # 使用卷积层代替线性层以直接从图像中提取patch
        self.patch_embed = nn.Conv2d(channels, hidden_dim, kernel_size=patch_size, stride=patch_size)

        self.positional_embedding = nn.Parameter(torch.randn(1, self.num_patches + 1, hidden_dim))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim,
            nhead=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=0.1,
            activation=nn.functional.relu,
            layer_norm_eps=1e-5,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim))
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # 使用卷积层提取patch并展平
        x = self.patch_embed(x)  # [B, C, H, W] -> [B, hidden_dim, H/P, W/P]
        x = x.flatten(2)  # [B, hidden_dim, H/P * W/P]
        x = x.transpose(1, 2)  # [B, H/P * W/P, hidden_dim]

        # Add positional embedding
        cls_tokens = self.cls_token.repeat(x.shape[0], 1, 1)  # [B, 1, hidden_dim]
        x = torch.cat((cls_tokens, x), dim=1)  # [B, 1 + H/P * W/P, hidden_dim]
        x = x + self.positional_embedding  # [B, 1 + H/P * W/P, hidden_dim]

        # Pass through transformer
        x = self.transformer(x)

        # Classifier on the CLS token
        x = self.fc(x[:, 0])
        return x


def get_vit_conv():
    net = VisionTransformer_conv(224, 16, 3, 256, 4, 512, 3, 120)
    return net


import clip


def get_clip():
    model, preprocess = clip.load('ViT-B/32', device=torch.device('mps'))
    return model, preprocess

In [19]:
model, preprocess = get_clip()
device = torch.device('mps')
from PIL import Image

image = preprocess(Image.open(
    '../data/dog-breed-identification/train_valid_test/test/unknown/0a0b97441050bba8e733506de4655ea1.jpg')).unsqueeze(
    0).to(device)
labels = ["a diagram", "a dog", 'a affenpinscher', "a cat"]
text = clip.tokenize(labels).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", labels[probs.argmax()])
print(probs)
# preds = []
# for data, label in test_iter:
#     output = torch.nn.functional.softmax(net4(data.to(devices[0])), dim=1)
#     preds.extend(output.cpu().detach().numpy())
# ids = sorted(os.listdir(
#     os.path.join(data_dir, 'train_valid_test', 'test', 'unknown')))
# 
# with open('./dog/submission.csv', 'w') as f:
#     f.write('id,' + ','.join(train_valid_ds.classes) + '\n')
#     for i, output in zip(ids, preds):
#         f.write(i.split('.')[0] + ',' + ','.join(
#             [str(num) for num in output]) + '\n')

Label probs: a dog
[[1.114e-03 9.971e-01 4.573e-04 9.093e-04]]


在[**计算损失**]之前，我们首先获取预训练模型的输出层的输入，即提取的特征。
然后我们使用此特征作为我们小型自定义输出网络的输入来计算损失。


In [11]:
loss = nn.CrossEntropyLoss(reduction='none')


def evaluate_loss(data_iter, net, devices):
    l_sum, n = 0.0, 0
    for features, labels in data_iter:
        features, labels = features.to(devices[0]), labels.to(devices[0])
        outputs = net(features)
        l = loss(outputs, labels)
        l_sum += l.sum()
        n += labels.numel()
    return (l_sum / n).to('cpu')

In [12]:
from tqdm import tqdm

## 定义[**训练函数**]

我们将根据模型在验证集上的表现选择模型并调整超参数。
模型训练函数`train`只迭代小型自定义输出网络的参数。


In [22]:
def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,
          lr_decay):
    # 只训练小型自定义输出网络
    # net = nn.DataParallel(net, device_ids=devices).to(devices[0])
    net = net.to(devices[0])
    trainer = torch.optim.SGD((param for param in net.parameters()
                               if param.requires_grad), lr=lr,
                              momentum=0.9, weight_decay=wd)
    scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_period, lr_decay)
    num_batches, timer = len(train_iter), d2l.Timer()
    legend = ['train loss']
    if valid_iter is not None:
        legend.append('valid loss')

    for epoch in range(num_epochs):
        metric = d2l.Accumulator(2)
        for features, labels in tqdm(train_iter):
            timer.start()
            features, labels = features.to(devices[0]), labels.to(devices[0])
            trainer.zero_grad()
            output = net(features)
            l = loss(output, labels).sum()
            # l.requires_grad = True
            l.backward()
            trainer.step()
            metric.add(l, labels.shape[0])
            timer.stop()
        measures = f'{epoch}/{num_epochs} train loss {metric[0] / metric[1]:.3f}'
        if valid_iter is not None:
            valid_loss = evaluate_loss(valid_iter, net, devices)
            measures += f', valid loss {valid_loss:.3f}'
        scheduler.step()
        print(measures + f',{output.shape}')
    torch.save(net.state_dict(), '/content/drive/MyDrive/models/dog_classifier.ckpt')
    if valid_iter is not None:
        measures += f', valid loss {valid_loss:.3f}'
    print(measures + f'\n{metric[1] * num_epochs / timer.sum():.1f}'
                     f' examples/sec on {str(devices)}')

## [**训练和验证模型**]

现在我们可以训练和验证模型了，以下超参数都是可调的。
例如，我们可以增加迭代轮数。
另外，由于`lr_period`和`lr_decay`分别设置为2和0.9，
因此优化算法的学习速率将在每2个迭代后乘以0.9。


In [30]:
devices, num_epochs, lr, wd = [torch.device('mps')], 50, 1e-4, 1e-4
lr_period, lr_decay, = 2, 0.9
# net1=get_vit(devices)
# net2 =  get_vitl(devices)
# net3 =  get_swin_B(devices)
net4 = get_vit_conv()
for name, param in net4.named_parameters():
    if param.requires_grad:
        print(f"{name} requires_grad={param.requires_grad}")

positional_embedding requires_grad=True
cls_token requires_grad=True
patch_embed.weight requires_grad=True
patch_embed.bias requires_grad=True
transformer.layers.0.self_attn.in_proj_weight requires_grad=True
transformer.layers.0.self_attn.in_proj_bias requires_grad=True
transformer.layers.0.self_attn.out_proj.weight requires_grad=True
transformer.layers.0.self_attn.out_proj.bias requires_grad=True
transformer.layers.0.linear1.weight requires_grad=True
transformer.layers.0.linear1.bias requires_grad=True
transformer.layers.0.linear2.weight requires_grad=True
transformer.layers.0.linear2.bias requires_grad=True
transformer.layers.0.norm1.weight requires_grad=True
transformer.layers.0.norm1.bias requires_grad=True
transformer.layers.0.norm2.weight requires_grad=True
transformer.layers.0.norm2.bias requires_grad=True
transformer.layers.1.self_attn.in_proj_weight requires_grad=True
transformer.layers.1.self_attn.in_proj_bias requires_grad=True
transformer.layers.1.self_attn.out_proj.weight 

In [24]:
# for name, param in net3.named_parameters():
#     print(f"{name} requires_grad={param.requires_grad}")
torch.__version__

'2.1.0'

In [25]:
train(net4, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,
      lr_decay)

100%|██████████| 74/74 [00:42<00:00,  1.76it/s]


0/50 train loss 4.849, valid loss 4.847,torch.Size([128, 120])


100%|██████████| 74/74 [00:43<00:00,  1.71it/s]


1/50 train loss 4.799, valid loss 4.711,torch.Size([128, 120])


100%|██████████| 74/74 [00:44<00:00,  1.65it/s]


2/50 train loss 4.751, valid loss 4.681,torch.Size([128, 120])


100%|██████████| 74/74 [00:44<00:00,  1.65it/s]


3/50 train loss 4.721, valid loss 4.654,torch.Size([128, 120])


100%|██████████| 74/74 [00:45<00:00,  1.64it/s]


4/50 train loss 4.702, valid loss 4.629,torch.Size([128, 120])


100%|██████████| 74/74 [00:43<00:00,  1.69it/s]


5/50 train loss 4.685, valid loss 4.635,torch.Size([128, 120])


100%|██████████| 74/74 [00:44<00:00,  1.66it/s]


6/50 train loss 4.668, valid loss 4.562,torch.Size([128, 120])


100%|██████████| 74/74 [00:45<00:00,  1.61it/s]


7/50 train loss 4.653, valid loss 4.605,torch.Size([128, 120])


 74%|███████▍  | 55/74 [00:37<00:12,  1.47it/s]


KeyboardInterrupt: 

In [None]:
# train(net3, train_iter_swin, valid_iter_swin, num_epochs, lr, wd, devices, lr_period,
#       lr_decay)

In [None]:
# train(net3, train_valid_iter_swin, None, 10, lr, wd, devices, lr_period,
#       lr_decay)

In [19]:
train(net4, train_valid_iter, None, 10, lr, wd, devices, lr_period,
      lr_decay)

100%|██████████| 79/79 [01:37<00:00,  1.23s/it]


0/10 train loss 0.696,torch.Size([128, 120])


100%|██████████| 79/79 [01:35<00:00,  1.21s/it]


1/10 train loss 0.673,torch.Size([128, 120])


100%|██████████| 79/79 [01:36<00:00,  1.23s/it]


2/10 train loss 0.630,torch.Size([128, 120])


100%|██████████| 79/79 [01:37<00:00,  1.23s/it]


3/10 train loss 0.649,torch.Size([128, 120])


100%|██████████| 79/79 [01:37<00:00,  1.23s/it]


4/10 train loss 0.639,torch.Size([128, 120])


 23%|██▎       | 18/79 [00:22<01:16,  1.25s/it]


KeyboardInterrupt: 

## [**对测试集分类**]并在Kaggle提交结果

与 :numref:`sec_kaggle_cifar10`中的最后一步类似，最终所有标记的数据（包括验证集）都用于训练模型和对测试集进行分类。
我们将使用训练好的自定义输出网络进行分类。


In [26]:
# net2 = get_vitl(devices)


preds = []
for data, label in test_iter:
    output = torch.nn.functional.softmax(net4(data.to(devices[0])), dim=1)
    preds.extend(output.cpu().detach().numpy())
ids = sorted(os.listdir(
    os.path.join(data_dir, 'train_valid_test', 'test', 'unknown')))


FileNotFoundError: [Errno 2] No such file or directory: '/dog/submission.csv'

In [27]:
with open('./dog/submission.csv', 'w') as f:
    f.write('id,' + ','.join(train_valid_ds.classes) + '\n')
    for i, output in zip(ids, preds):
        f.write(i.split('.')[0] + ',' + ','.join(
            [str(num) for num in output]) + '\n')

上面的代码将生成一个`submission.csv`文件，以 :numref:`sec_kaggle_house`中描述的方式提在Kaggle上提交。

## 小结

* ImageNet数据集中的图像比CIFAR-10图像尺寸大，我们可能会修改不同数据集上任务的图像增广操作。
* 要对ImageNet数据集的子集进行分类，我们可以利用完整ImageNet数据集上的预训练模型来提取特征并仅训练小型自定义输出网络，这将减少计算时间和节省内存空间。

## 练习

1. 试试使用完整Kaggle比赛数据集，增加`batch_size`（批量大小）和`num_epochs`（迭代轮数），或者设计其它超参数为`lr = 0.01`，`lr_period = 10`，和`lr_decay = 0.1`时，能取得什么结果？
1. 如果使用更深的预训练模型，会得到更好的结果吗？如何调整超参数？能进一步改善结果吗？


[Discussions](https://discuss.d2l.ai/t/2833)
