## 实验四：基于MindSpore框架实现VGG17训练与验证

本实验基于Modelarts平台，使用MindSpore深度学习框架，利用其成熟的算子库搭建VGG17神经网络模型，使用花卉数据集（雏菊、蒲公英、玫瑰、向日葵、郁金香）在Ascend910加速卡上进行训练和验证。

### 1. 实验目的
* 熟悉和使用MindSpore框架和ModelArts，熟悉MindSpore常见API的使用方法，熟悉ModelArts一站式模型训练和部署平台。
* 基于MindSpore框架构建VGG17网络。利用花卉数据集上完成模型训练（训练平台：ModelArts，可采用昇腾910芯片进行训练）。模型训练完成后，对模型进行保存。
* 基于昇腾310推理芯片作为计算平台，利用MindSpore框架导入训练好的模型，并在花卉测试数据集对构建的模型进行推理验证，输出推理性能以及测试集正确率。
* 本实验希望借助MindSpore帮助学生熟悉使用深度学习框架，感受框架封装基本操作的便捷。

### 2. 背景介绍
### 2.1 VGG模型原理介绍
该模块简要介绍VGG网络的原理。
- 在VGG中，使用了3个3x3卷积核来代替7x7卷积核，使用了2个3x3卷积核来代替5*5卷积核，相比AlexNet中的较大卷积核（11x11，7x7，5x5），VGG网络层数更深，提升了网络性能。
- 池化层均采用相同的池化核参数，stride=2。
- 模型由若干卷积层和池化层堆叠的方式构成。

注：在构造网络时，还需要考虑BN(Batch Normalization)层和Relu层（BN层可以提高网络训练稳定性，Relu层是非线性激活层）。此外为了提高网络鲁棒性，加入了dropout层。

VGG网络结构如下表所示：


| 算子           | 类型     | 输入通道数 | 输出通道数 | 窗口大小 | 边界扩充大小 | 步长 | 输出张量的高度和宽度 |
| -------------- | -------- | ---------- | ---------- | -------- | ------------ | ---- | -------------------- |
| layer1_conv1   | 卷积     | 3          | 64         | 3        | 1            | 1    | 224x224              |
| layer1_conv2   | 卷积     | 64         | 64         | 3        | 1            | 1    | 224x224              |
| layer1_maxpool | 最大池化 | 64         | 64         | 2        | -            | 2    | 112x112              |
| layer2_conv1   | 卷积     | 64         | 128        | 3        | 1            | 1    | 112x112              |
| layer2_conv2   | 卷积     | 128        | 128        | 3        | 1            | 1    | 112x112              |
| layer2_maxpool | 最大池化 | 128        | 128        | 2        | -            | 2    | 56x56                |
| layer3_conv1   | 卷积     | 128        | 256        | 3        | 1            | 1    | 56x56                |
| layer3_conv2   | 卷积     | 256        | 256        | 3        | 1            | 1    | 56x56                |
| layer3_conv3   | 卷积     | 256        | 256        | 3        | 1            | 1    | 56x56                |
| layer3_maxpool | 最大池化 | 256        | 256        | 2        | -            | 2    | 28x28                |
| layer4_conv1   | 卷积     | 256        | 512        | 3        | 1            | 1    | 28x28                |
| layer4_conv2   | 卷积     | 512        | 512        | 3        | 1            | 1    | 28x28                |
| layer4_conv3   | 卷积     | 512        | 512        | 3        | 1            | 1    | 28x28                |
| layer4_maxpool | 最大池化 | 512        | 512        | 2        | -            | 2    | 14x14                |
| layer5_conv1   | 卷积     | 512        | 512        | 3        | 1            | 1    | 14x14                |
| layer5_conv2   | 卷积     | 512        | 512        | 3        | 1            | 1    | 14x14                |
| layer5_conv3   | 卷积     | 512        | 512        | 3        | 1            | 1    | 14x14                |
| layer5_conv4   | 卷积     | 512        | 512        | 3        | 1            | 1    | 14x14                |
| layer5_maxpool | 最大池化 | 512        | 512        | 2        | -            | 2    | 7x7                  |
| flatten        | 扁平化   | -          | -          | -        | -            | -    | -                    |
| fullyconnect1  | 全连接   | 25088      | 4096       | -        | -            | -    | -                    |
| fullyconnect2  | 全连接   | 4096       | 4096       | -        | -            | -    | -                    |
| fullyconnect3  | 全连接   | 4096       | 4          | -        | -            | -    | -                    |



<img src="structure.png" style="margin: 0 auto;">



### 3. 实验环境

环境：支持GPU和Ascend环境 \
版本：MindSpore 2.0 & 编程语言：Python 3.7 \
    在动手进行实践之前，确保你已经正确安装了MindSpore。如果没有，可以通过MindSpore官网安装页面：https://www.mindspore.cn/install/ ，将MindSpore安装在你的电脑当中。

### 4. 数据处理
### 4.1 数据准备

我们示例中用到的图像花卉数据集，总共包括5种花的类型：分别是daisy（雏菊，633张），dandelion（蒲公英，898张），roses（玫瑰，641张），sunflowers（向日葵，699张），tulips（郁金香，799张），保存在5个文件夹当中，总共3670张，大小大概在230M左右。为了在模型部署上线之后进行测试，数据集在这里分成了 flower_photos_train 和 flower_photos_test 两部分。

请点击数据集链接，下载以下数据集，下载的data.zip保存到code文件夹下，即和notebook同步目录

数据集链接：https://openi.pcl.ac.cn/attachments/88c31019-22cc-41ed-a31c-8f7b11435b60?type=1

In [1]:
import os
import zipfile

def download_dataset(download_file, target_path):
    if os.path.exists(target_path+"data"):
        print("already exists")
        return
        
    if download_file.endswith("zip"):
        z = zipfile.ZipFile(download_file, "r")
        z.extractall(path=target_path)
        z.close()
download_dataset("data.zip","../")

already exists


在code的同级目录下，data文件夹的结构如下：\
./code \
|── train.py \
./data \
|&emsp;&emsp;&emsp;|── train \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── daisy \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── dandelion \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── roses \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── sunflowers \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── tulips \
|&emsp;&emsp;&emsp;|── test \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── daisy \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── dandelion \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── roses \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── sunflowers \
|&emsp;&emsp;&emsp;|&emsp;&emsp;&emsp;|── tulips 



### 4.2 数据加载
在得到数据集后，利用mindspore.dataset类下的ImageFolder Dataset加载图片数据，同一个文件夹中的所有图片将被分配相同的label。并使用Random Crop，RandomHorizontalFlip，HWC2CHW和Resize几种的数据增强操作。模块实现如下：


In [2]:
def vgg_create_dataset(data_home, image_size, batch_size, rank_id=0, rank_size=1, training=True):
    #加载路径
    """Data operations."""
    if training:
        data_dir = os.path.join(data_home, "train")
    else:
        data_dir = os.path.join(data_home, "test")
        print("data_dir",data_dir)
    data_set = de.ImageFolderDataset(data_dir,
                                     class_indexing={'daisy':0,'dandelion':1,'roses':2,'sunflowers':3,'tulips':4},
                                     shuffle=False, num_shards=rank_size, shard_id=rank_id)

    #数据增强的方法，上述提高的四种方法
    transform_img = vision.RandomCropDecodeResize([224,224], scale=(0.08, 1.0),
                                              ratio=(0.75, 1.333))  # 改变尺寸

    changeswap_op = vision.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.float32)
    random_horizontal_op = vision.RandomHorizontalFlip()
    #normalize_op =  vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))

    #map操作将指定函数操作于数据集的指定列数据
    data_set = data_set.map(input_columns="image", operations=transform_img)
    data_set = data_set.map(input_columns="image", operations=type_cast_op)
    data_set = data_set.map(input_columns="image", operations=random_horizontal_op)
    data_set = data_set.map(input_columns="image", operations=changeswap_op)

    # shuffle来进行数据集的混洗
    data_set = data_set.shuffle(buffer_size=data_set.get_dataset_size())

    # 连续 batch_size 条数据合并为一个批处理数据
    data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)
    return data_set



### 5. 实验内容

本模块需要利用MindSpore.nn相关API完整搭建VGG17网络结构。在利用MindSpore构建网络时，需要继承 mindspore.nn.Cell 类，并重写 \__init\__ 方法和construct方法。

In [3]:
import mindspore.nn as nn


class Vgg(nn.Cell):
    """
    VGG网络定义.

    参数:
        num_classes (int): Class numbers. Default: 5.
        phase (int): 指定是训练/评估阶段

    返回值:
        Tensor, infer output tensor.
        
    example：
        self.layer1_conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3,weight_init='XavierUniform')
        self.layer1_bn1 = nn.BatchNorm2d(num_features=64)
        self.layer1_relu1 = nn.LeakyReLU()

    """
    def __init__(self, num_classes=5, args=None, phase="train"):
        super(Vgg, self).__init__()
        dropout_ratio = 0.5
        if not args.has_dropout or phase == "test":
            dropout_ratio = 1.0
        
        self.layer1_conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3,weight_init='XavierUniform')
        self.layer1_bn1 = nn.BatchNorm2d(num_features=64)
        self.layer1_relu1 = nn.ReLU()
        self.layer1_conv2 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3,weight_init='XavierUniform')
        self.layer1_bn2 = nn.BatchNorm2d(num_features=64)
        self.layer1_relu2 = nn.ReLU()
        self.layer1_maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.layer2_conv1 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3,weight_init='XavierUniform')
        self.layer2_bn1 = nn.BatchNorm2d(num_features=128)
        self.layer2_relu1 = nn.ReLU()
        self.layer2_conv2 = nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3,weight_init='XavierUniform')
        self.layer2_bn2 = nn.BatchNorm2d(num_features=128)
        self.layer2_relu2 = nn.ReLU()
        self.layer2_maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.layer3_conv1 = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3,weight_init='XavierUniform')
        self.layer3_bn1 = nn.BatchNorm2d(num_features=256)
        self.layer3_relu1 = nn.ReLU()
        self.layer3_conv2 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3,weight_init='XavierUniform')
        self.layer3_bn2 = nn.BatchNorm2d(num_features=256)
        self.layer3_relu2 = nn.ReLU()
        self.layer3_conv3 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3,weight_init='XavierUniform')
        self.layer3_bn3 = nn.BatchNorm2d(num_features=256)
        self.layer3_relu3 = nn.ReLU()
        self.layer3_maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.layer4_conv1 = nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer4_bn1 = nn.BatchNorm2d(num_features=512)
        self.layer4_relu1 = nn.ReLU()
        self.layer4_conv2 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer4_bn2 = nn.BatchNorm2d(num_features=512)
        self.layer4_relu2 = nn.ReLU()
        self.layer4_conv3 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer4_bn3 = nn.BatchNorm2d(num_features=512)
        self.layer4_relu3 = nn.ReLU()
        self.layer4_maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.layer5_conv1 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer5_bn1 = nn.BatchNorm2d(num_features=512)
        self.layer5_relu1 = nn.ReLU()
        self.layer5_conv2 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer5_bn2 = nn.BatchNorm2d(num_features=512)
        self.layer5_relu2 = nn.ReLU()
        self.layer5_conv3 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer5_bn3 = nn.BatchNorm2d(num_features=512)
        self.layer5_relu3 = nn.ReLU()
        self.layer5_conv4 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3,weight_init='XavierUniform')
        self.layer5_bn4 = nn.BatchNorm2d(num_features=512)
        self.layer5_relu4 = nn.ReLU()
        self.layer5_maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()

        self.fullyconnect1 = nn.Dense(512 * 7 * 7, 4096)
        self.relu_1 = nn.ReLU()
        self.dropout_1 = nn.Dropout(dropout_ratio)

        self.fullyconnect2 = nn.Dense(4096, 4096)
        self.relu_2 = nn.ReLU()
        self.dropout_1 = nn.Dropout(dropout_ratio)

        self.fullyconnect3 = nn.Dense(4096, num_classes)


    def construct(self, x):
        x  =  self.layer1_conv1(x) 
        x  =  self.layer1_bn1(x)
        x  =  self.layer1_relu1(x) 
        x  =  self.layer1_conv2(x)
        x  =  self.layer1_bn2(x)
        x  =  self.layer1_relu2(x) 
        x  =  self.layer1_maxpool(x)

        x  =  self.layer2_conv1(x)
        x  =  self.layer2_bn1(x) 
        x  =  self.layer2_relu1(x) 
        x  =  self.layer2_conv2(x)
        x  =  self.layer2_bn2(x) 
        x  =  self.layer2_relu2(x) 
        x  =  self.layer2_maxpool(x)

        x  =  self.layer3_conv1(x)
        x  =  self.layer3_bn1(x) 
        x  =  self.layer3_relu1(x) 
        x  =  self.layer3_conv2(x)
        x  =  self.layer3_bn2(x) 
        x  =  self.layer3_relu2(x) 
        x  =  self.layer3_conv3(x)
        x  =  self.layer3_bn3(x) 
        x  =  self.layer3_relu3(x) 
        x  =  self.layer3_maxpool(x)

        x  =  self.layer4_conv1(x)
        x  =  self.layer4_bn1(x) 
        x  =  self.layer4_relu1(x) 
        x  =  self.layer4_conv2(x)
        x  =  self.layer4_bn2(x) 
        x  =  self.layer4_relu2(x) 
        x  =  self.layer4_conv3(x)
        x  =  self.layer4_bn3(x)
        x  =  self.layer4_relu3(x) 
        x  =  self.layer4_maxpool(x)
        
        x  =  self.layer5_conv1(x)
        x  =  self.layer5_bn1(x) 
        x  =  self.layer5_relu1(x) 
        x  =  self.layer5_conv2(x)
        x  =  self.layer5_bn2(x) 
        x  =  self.layer5_relu2(x) 
        x  =  self.layer5_conv3(x)
        x  =  self.layer5_bn3(x) 
        x  =  self.layer5_relu3(x) 
        x  =  self.layer5_conv4(x)
        x  =  self.layer5_bn4(x) 
        x  =  self.layer5_relu4(x) 
        x  =  self.layer5_maxpool(x)

        x = self.flatten(x) 
        x = self.fullyconnect1(x) 
        x = self.relu_1(x)
        x = self.dropout_1(x) 
        x = self.fullyconnect2(x)
        x = self.relu_2(x) 
        x = self.dropout_1(x) 
        x = self.fullyconnect3(x) 

        return x


### 6. 模型构建

实现模型训练的run_train()函数。

In [4]:
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
#################train vgg17 example on flowerphotos########################
"""
import datetime
import os
import time
import random
import numpy
import mindspore
import mindspore.nn as nn
from mindspore import Tensor
from mindspore import context
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore.nn import Momentum
from mindspore.train import Accuracy
from mindspore.train import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor,Callback
from mindspore.train import Model
from mindspore import ParallelMode
from mindspore import load_param_into_net, load_checkpoint
from mindspore.amp import FixedLossScaleManager
from mindspore import set_seed
from src.dataset import vgg_create_dataset
from src.dataset import classification_dataset

from src.crossentropy import CrossEntropy
from src.warmup_step_lr import warmup_step_lr
from src.warmup_cosine_annealing_lr import warmup_cosine_annealing_lr
from src.warmup_step_lr import lr_steps
from src.utils.logging import get_logger
from src.utils.util import get_param_groups
from src.vgg import vgg17

from model_utils.moxing_adapter import config
from model_utils.moxing_adapter import moxing_wrapper
from model_utils.device_adapter import get_device_id, get_rank_id, get_device_num

import sys
f = open('train.log', 'w')
sys.stdout = f
sys.stderr = f  

#modelarts预处理部分
def modelarts_pre_process():
    sum=1
    '''modelarts pre process function.'''
    def unzip(zip_file, save_dir):
        import zipfile
        s_time = time.time()
        if not os.path.exists(os.path.join(save_dir, config.modelarts_dataset_unzip_name)):
            zip_isexist = zipfile.is_zipfile(zip_file)
            if zip_isexist:
                fz = zipfile.ZipFile(zip_file, 'r')
                data_num = len(fz.namelist())
                print("Extract Start...")
                print("unzip file num: {}".format(data_num))
                data_print = int(data_num / 100) if data_num > 100 else 1
                i = 0
                for file in fz.namelist():
                    if i % data_print == 0:
                        print("unzip percent: {}%".format(int(i * 100 / data_num)), flush=True)
                    i += 1
                    fz.extract(file, save_dir)
                print("cost time: {}min:{}s.".format(int((time.time() - s_time) / 60),
                                                     int(int(time.time() - s_time) % 60)))
                print("Extract Done.")
            else:
                print("This is not zip.")
        else:
            print("Zip has been extracted.")

    if config.need_modelarts_dataset_unzip:
        zip_file_1 = os.path.join(config.data_path, config.modelarts_dataset_unzip_name + ".zip")
        save_dir_1 = os.path.join(config.data_path)

        sync_lock = "/tmp/unzip_sync.lock"

        # Each server contains 8 devices as most.
        if config.device_target == "GPU":
            init()
            device_id = get_rank()
            device_num = get_group_size()
        elif config.device_target == "Ascend":
            device_id = get_device_id()
            device_num = get_device_num()
        else:
            raise ValueError("Not support device_target.")

        if device_id % min(device_num, 8) == 0 and not os.path.exists(sync_lock):
            print("Zip file path: ", zip_file_1)
            print("Unzip file save dir: ", save_dir_1)
            unzip(zip_file_1, save_dir_1)
            print("===Finish extract data synchronization===")
            try:
                os.mknod(sync_lock)
            except IOError:
                pass

        while True:
            if os.path.exists(sync_lock):
                break
            time.sleep(1)

        print("Device: {}, Finish sync unzip data from {} to {}.".format(device_id, zip_file_1, save_dir_1))

    config.ckpt_path = os.path.join(config.output_path, config.ckpt_path)

#训练部分
@moxing_wrapper(pre_process=modelarts_pre_process)
def run_train():
    '''run train'''
    config.lr_epochs = list(map(int, config.lr_epochs.split(',')))
    config.image_size = list(map(int, config.image_size.split(',')))
    config.per_batch_size = config.batch_size

    _enable_graph_kernel = (config.device_target == "GPU")
    context.set_context(mode=context.PYNATIVE_MODE,
                        enable_graph_kernel=_enable_graph_kernel, device_target=config.device_target)
    config.rank = get_rank_id()
    config.device_id = get_device_id()
    config.group_size = get_device_num()

    if config.is_distributed:
        if config.device_target == "Ascend":
            init()
            context.set_context(device_id=config.device_id)
        elif config.device_target == "GPU":
            if not config.enable_modelarts:
                init()
            else:
                if not config.need_modelarts_dataset_unzip:
                    init()
    
        device_num = config.group_size
        context.reset_auto_parallel_context()
        context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
                                          gradients_mean=True, all_reduce_fusion_config=[15, 18])
    
    else:
        if config.device_target == "Ascend":
            if context.get_context('device_id')!=config.device_id:
                context.set_context(device_id=config.device_id)
    
    
    # select for master rank save ckpt or all rank save, compatible for model parallel
    config.rank_save_ckpt_flag = 0
    if config.is_save_on_master:
        if config.rank == 0:
            config.rank_save_ckpt_flag = 1
    else:
        config.rank_save_ckpt_flag = 1

    # logger
    config.outputs_dir = os.path.join(config.ckpt_path,
                                      datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S'))
    config.logger = get_logger(config.outputs_dir, config.rank)

    if config.dataset == "flower_photos":
        dataset = vgg_create_dataset(config.data_dir, config.image_size, config.per_batch_size,
                                     config.rank, config.group_size)
        eval_dataset = vgg_create_dataset(config.data_dir, config.image_size, config.per_batch_size,
                                     config.rank, config.group_size)

    batch_num = dataset.get_dataset_size()
    config.steps_per_epoch = dataset.get_dataset_size()
    config.logger.save_args(config)

    # network
    config.logger.important_info('start create network')

    # 构建网络
    network = vgg17(config.num_classes, config)
    network.set_train(True)

    # 是否有预训练权重文件
    if config.pre_trained:
        load_param_into_net(network, load_checkpoint(config.pre_trained))

    # 学习率
    if config.lr_scheduler == 'exponential':
        lr = warmup_step_lr(config.lr,
                            config.lr_epochs,
                            config.steps_per_epoch,
                            config.warmup_epochs,
                            config.max_epoch,
                            gamma=config.lr_gamma,
                            )
    elif config.lr_scheduler == 'cosine_annealing':
        lr = warmup_cosine_annealing_lr(config.lr,
                                        config.steps_per_epoch,
                                        config.warmup_epochs,
                                        config.max_epoch,
                                        config.T_max,
                                        config.eta_min)
    elif config.lr_scheduler == 'step':
        lr = lr_steps(0, lr_init=config.lr_init, lr_max=config.lr_max, warmup_epochs=config.warmup_epochs,
                      total_epochs=config.max_epoch, steps_per_epoch=batch_num)
    else:
        raise NotImplementedError(config.lr_scheduler)

    # 优化器
    opt = Momentum(params=get_param_groups(network),
                   learning_rate=Tensor(lr),
                   momentum=config.momentum,
                   weight_decay=config.weight_decay,
                   loss_scale=config.loss_scale)

    
    loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
    model = Model(network, loss_fn=loss, optimizer=opt, metrics={"Accuracy": Accuracy()},
                    amp_level="O2", keep_batchnorm_fp32=False, loss_scale_manager=None)
   
    # 定义回调函数
    time_cb = TimeMonitor(data_size=batch_num)
    loss_cb = LossMonitor()
    #epoch_per_eval = {"epoch": [], "acc": []}
    #eval_cb = EvalCallBack(model, eval_dataset, 1, epoch_per_eval)  #每个epoch都评估一下
    callbacks = [time_cb, loss_cb]
    if config.rank_save_ckpt_flag:
        ckpt_config = CheckpointConfig(save_checkpoint_steps=config.ckpt_interval * config.steps_per_epoch,
                                       keep_checkpoint_max=config.keep_checkpoint_max)
        save_ckpt_path = os.path.join(config.outputs_dir, 'ckpt_' + str(config.rank) + '/')
        print(save_ckpt_path)
        ckpt_cb = ModelCheckpoint(config=ckpt_config,
                                  directory=save_ckpt_path,
                                  prefix='{}'.format(config.rank))
        callbacks.append(ckpt_cb)
    
    #进行模型训练
    model.train(config.max_epoch, dataset, callbacks=callbacks)




Namespace(config_path='/home/ma-user/work/exp4/teacher/code/flowerphotos_config.yaml')
{'device_target': 'device where the code will be implemented.', 'dataset': 'flower_photos', 'data_dir': 'data dir', 'pre_trained': 'model_path, local pretrained model to load', 'lr_gamma': 'decrease lr by a factor of exponential lr_scheduler', 'eta_min': 'eta_min in cosine_annealing scheduler', 'T_max': 'T-max in cosine_annealing scheduler', 'log_interval': 'logging interval', 'ckpt_path': 'checkpoint save location', 'ckpt_interval': 'ckpt_interval', 'is_save_on_master': 'save ckpt on master or all rank', 'is_distributed': 'if multi device', 'per_batch_size': 'batch size for per npu', 'graph_ckpt': 'graph ckpt or feed ckpt', 'log_path': 'path to save log', 'result_dir': 'result files path.', 'label_dir': 'image file path.', 'dataset_name': 'flower_photos', 'result_path': 'result path', 'ckpt_file': 'vgg17 ckpt file.', 'file_name': 'vgg17 output file name.', 'file_format': "file format, choices in ['A

### 7. 模型训练与验证

在Ascend或者GPU上运行，调用run_train()函数，运行结果在train.log中。

In [5]:
config.config_path="flowerphotos_config.yaml"
config.dataset="flower_photos"
config.is_distributed=0
config.data_dir="../data"
config.device_target="Ascend" # 或者选GPU
config.lr_epochs='30,60,90,120'
config.image_size="224,224"
config.pre_trained="pretrained/0-400_45.ckpt"

run_train()
    

使用 mindspore.Model.eval 接口进行评估，相关代码如下：

In [6]:
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Eval"""
import os
import time
import datetime
import random 
import mindspore
import glob
import numpy as np
import mindspore.nn as nn

from mindspore import Tensor, context
from mindspore.communication import init, get_rank, get_group_size
from mindspore.train import Model
from mindspore import load_checkpoint, load_param_into_net
import mindspore.ops as P
from mindspore import dtype as mstype

from src.utils.logging import get_logger
from src.vgg import vgg17
from src.dataset import vgg_create_dataset
from src.dataset import classification_dataset

from model_utils.moxing_adapter import config
from model_utils.moxing_adapter import moxing_wrapper
from model_utils.device_adapter import get_device_id, get_rank_id, get_device_num

import sys
f = open('eval.log', 'w')
sys.stdout = f
sys.stderr = f 

class ParameterReduce(nn.Cell):
    """ParameterReduce"""
    def __init__(self):
        super(ParameterReduce, self).__init__()
        self.cast = P.Cast()
        self.reduce = P.AllReduce()

    def construct(self, x):
        one = self.cast(P.scalar_to_tensor(1.0), mstype.float32)[0]
        out = x * one
        ret = self.reduce(out)
        return ret


def get_top5_acc(top5_arg, gt_class):
    sub_count = 0
    for top5, gt in zip(top5_arg, gt_class):
        if gt in top5:
            sub_count += 1
    return sub_count


def modelarts_pre_process():
    '''modelarts pre process function.'''
    def unzip(zip_file, save_dir):
        import zipfile
        s_time = time.time()
        if not os.path.exists(os.path.join(save_dir, config.modelarts_dataset_unzip_name)):
            zip_isexist = zipfile.is_zipfile(zip_file)
            if zip_isexist:
                fz = zipfile.ZipFile(zip_file, 'r')
                data_num = len(fz.namelist())
                print("Extract Start...")
                print("unzip file num: {}".format(data_num))
                data_print = int(data_num / 100) if data_num > 100 else 1
                i = 0
                for file in fz.namelist():
                    if i % data_print == 0:
                        print("unzip percent: {}%".format(int(i * 100 / data_num)), flush=True)
                    i += 1
                    fz.extract(file, save_dir)
                print("cost time: {}min:{}s.".format(int((time.time() - s_time) / 60),
                                                     int(int(time.time() - s_time) % 60)))
                print("Extract Done.")
            else:
                print("This is not zip.")
        else:
            print("Zip has been extracted.")

    if config.need_modelarts_dataset_unzip:
        zip_file_1 = os.path.join(config.data_path, config.modelarts_dataset_unzip_name + ".zip")
        save_dir_1 = os.path.join(config.data_path)

        sync_lock = "/tmp/unzip_sync.lock"

        # Each server contains 8 devices as most.
        if config.device_target == "GPU":
            init()
            device_id = get_rank()
            device_num = get_group_size()
        elif config.device_target == "Ascend":
            device_id = get_device_id()
            device_num = get_device_num()
        else:
            raise ValueError("Not support device_target.")

        # Each server contains 8 devices as most.
        if device_id % min(device_num, 8) == 0 and not os.path.exists(sync_lock):
            print("Zip file path: ", zip_file_1)
            print("Unzip file save dir: ", save_dir_1)
            unzip(zip_file_1, save_dir_1)
            print("===Finish extract data synchronization===")
            try:
                os.mknod(sync_lock)
            except IOError:
                pass

        while True:
            if os.path.exists(sync_lock):
                break
            time.sleep(1)

        print("Device: {}, Finish sync unzip data from {} to {}.".format(device_id, zip_file_1, save_dir_1))

    config.log_path = os.path.join(config.output_path, config.log_path)


@moxing_wrapper(pre_process=modelarts_pre_process)
def run_eval():
    """run eval"""
    config.per_batch_size = config.batch_size
    config.image_size = list(map(int, config.image_size.split(',')))
    config.rank = get_rank_id()
    config.group_size = get_device_num()


    _enable_graph_kernel = config.device_target == "GPU"
    context.set_context(mode=context.GRAPH_MODE, enable_graph_kernel=False,
                        device_target=config.device_target, save_graphs=False)
    if os.getenv('DEVICE_ID', "not_set").isdigit() and config.device_target == "Ascend":
        if context.get_context('device_id')!=int(os.getenv('DEVICE_ID')):
            context.set_context(device_id=int(os.getenv('DEVICE_ID')))

    config.outputs_dir = os.path.join(config.log_path,
                                      datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S'))

    config.logger = get_logger(config.outputs_dir, config.rank)
    config.logger.save_args(config)

    if config.dataset == "flower_photos":
        net = vgg17(num_classes=config.num_classes, args=config,phase="test")
        loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
        model = Model(net, loss_fn=loss, metrics={'acc'})
        param_dict = load_checkpoint(config.pre_trained)
        load_param_into_net(net, param_dict)
        net.set_train(False)
        dataset = vgg_create_dataset(config.data_dir, config.image_size, config.per_batch_size, training=False)
        res = model.eval(dataset)
        print("result: ", res)


运行评估代码如下：
评估可用output_flowers中生成的权重文件，这里我们用已经生成好的权重文件为例。路径在output文件夹下。

In [7]:
config.pre_trained="output/output_example.ckpt"
config.dataset="flower_photos"
config.image_size="224,224"
config.device_target="Ascend" # 或者选GPU
config.data_dir="../data"

run_eval()
