- Contents
- WideResNet Description
- Model Architecture
- Dataset
- Environment Requirements
- Quick Start
- Script Description
- Model Description
- Description of Random Situation
- ModelZoo Homepage
Szagoruyko proposed WideResNet on the basis of ResNet, which is used to solve the problem of deep and thin network models. Only a limited number of layers have learned useful knowledge, and more layers have made little contribution to the final result. This problem is also called diminishing feature reuse. The authors of WideResNet widened the residual block, which increased the training speed by several times, and the accuracy was also significantly improved.
Just like a ResNet - WideResNet network is not a network with any particular architecture, but an example of the idea of wide residual networks. So there is a group of networks called "wideresnet". But unlike ResNet, WideResNets differ by two numbers (not just one). The first number is the number of layers, as in resnet, and the second number is the "widening factor" and shows how many times the blocks of this network are wider than the same blocks in ResNet.
These is example of training WideResNet-40-10 (40 layers and 10 times wider) with CIFAR-10 dataset in MindSpore.
1.[paper] Wide Residual Networks: Sergey Zagoruyko, Nikos Komodakis,
The overall network architecture of WideResNet is shown below: paper
Dataset used: CIFAR-10
- Dataset size:60,000 32*32 colorful images in 10 classes
- Train:50,000 images
- Test: 10,000 images
- Data format:binary files
- Note:Data will be processed in dataset.py
- Download the dataset, the directory structure is as follows:
├─cifar-10-batches-bin
│
└─cifar-10-verify-bin
- Hardware(Ascend)
- Prepare hardware environment with Ascend.
- Framework
- For more information, please check the resources below:
After installing MindSpore via the official website, you can start training and evaluation as follows:
- Running on Ascend
# Distributed training
usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [CONFIG_PATH] [EXPERIMENT_LABEL]
[DATASET_PATH] is the path of the dataset.
.
# Standalone training
usage: bash run_standalone_train.sh [DATASET_PATH] [CONFIG_PATH] [EXPERIMENT_LABEL]
[ DATASET_PATH] is the path of the data set.
# Run evaluation example
usage:bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
[ DATASET_PATH] is the path of the data set.
[ CHECKPOINT_PATH] The trained ckpt file.
.
└──WideResNet
├── requirements.txt
├── README.md
├── config # parameter configuration
├── wideresnet_cifar10_config.yaml
├── scripts
├── run_distribute_train.sh # launch ascend distributed training(8 pcs)
├── run_standalone_train.sh # launch ascend standalone training(1 pcs)
├── run_eval.sh # launch ascend evaluation
└── cache_util.sh # a collection of helper functions to manage cache
├── src
├── dataset.py # data preprocessing
├── callbacks.py # evaluation and save callbacks
├── cross_entropy_smooth.py # loss definition for ImageNet2012 dataset
├── generator_lr.py # generate learning rate for each step
├── wide_resnet.py # wide_resnet backbone
├── model_utils
└── config.py # parameter configuration
├── export.py # Ascend 910 export network
├── eval.py # eval net
└── train.py # train net
Parameters for both training and evaluation can be set in config file.
- Config for WideResNet-40-10, CIFAR-10 dataset
"num_classes" : 10 , # Number of data set classes
"batch_size" : 32 , # Input tensor batch size
"epoch_size" : 300 , # Training period size
"save_checkpoint_path" : "./" , # Checkpoint relative execution path Jin’s save path
"repeat_num" : 1 , # number of repetitions of data set
"widen_factor" : 10 , # network width
"depth" : 40 , # network depth
"lr_init" : 0.1 , # initial learning rate
"weight_decay" : 5e-4 , # Weight decay
"momentum" :0.9 , # Momentum optimizer
"loss_scale" : 32 , # Loss level
"save_checkpoint" : False , # Whether to save checkpoints during training
"save_checkpoint_epochs" : 5 , # Period interval between two checkpoints; by default, the last check Points will be saved after the last cycle is completed
"use_label_smooth" : True , # label smoothing
"label_smooth_factor" : 0.1 , # label smoothing factor
"pretrain_epoch_size" : 0 , # pretrain Training period
"warmup_epochs" :5, # Warm-up cycle
# Distributed training
usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [CONFIG_PATH] [LABEL]
[ DATASET_PATH] is the path of the dataset.
# Standalone training
usage: bash bash run_standalone_train.sh [DATASET_PATH] [CONFIG_PATH] [LABEL]
[ DATASET_PATH] is the path of the data set.
For distributed training, a hccl configuration file with JSON format needs to be created in advance.
Please follow the instructions in the link hccn_tools.
Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". Under this, you can find checkpoint file together with result like the following in log.
If you want to change device_id for standalone training, you can set environment variable export DEVICE_ID=x
or set device_id=x
in context.
# distributed training Ascend with evaluation example:
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [CONFIG_PATH] [LABEL] [RUN_EVAL] [EVAL_DATASET_PATH]
# standalone training Ascend with evaluation example:
bash run_standalone_train.sh [DATASET_PATH] [CONFIG_PATH] [LABEL] [RUN_EVAL] [EVAL_DATASET_PATH]
RUN_EVAL
and EVAL_DATASET_PATH
are optional arguments, setting RUN_EVAL
=True allows you to do evaluation while training. When RUN_EVAL
is set, EVAL_DATASET_PATH
must also be set.
And you can also set these optional arguments: save_best_ckpt
, eval_start_epoch
, eval_interval
for python script when RUN_EVAL
is True.
By default, a standalone cache server would be started to cache all eval images in tensor format in memory to improve the evaluation performance. Please make sure the dataset fits in memory (Around 30GB of memory required for ImageNet2012 eval dataset, 6GB of memory required for CIFAR-10 eval dataset).
Users can choose to shutdown the cache server after training or leave it alone for future usage.
# distributed training
Usage:bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [CONFIG_PATH] [EXPERIMENT_LABEL] [PRETRAINED_CKPT_PATH]
# standalone training
Usage:bash run_standalone_train.sh [DATASET_PATH] [CONFIG_PATH] [EXPERIMENT_LABEL] [PRETRAINED_CKPT_PATH]
- Training WideResNet-40-10 with CIFAR-10 dataset
# distribute training result(8 pcs)
epoch: 1 step: 5, loss is 2.3153763
epoch: 1 step: 5, loss is 2.274118
epoch: 1 step: 5, loss is 2.2663743
epoch: 1 step: 5, loss is 2.324574
epoch: 1 step: 5, loss is 2.253627
epoch: 1 step: 5, loss is 2.2363935
epoch: 1 step: 5, loss is 2.3112013
epoch: 1 step: 5, loss is 2.252127
...
# Evaluation
Usage: bash run_eval.sh [DATASET_PATH] [CONFIG_PATH] [CHECKPOINT_PATH]
[ DATASET_PATH] is the path of the data set.
[ CHECKPOINT_PATH] The trained ckpt file.
# Evaluation example
bash run_eval.sh /cifar10 ../config/wideresnet.yaml WideResNet_best.ckpt
checkpoint can be produced in training process.
Evaluation result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the following in log.
- Evaluating WideResNet-40-10 with CIFAR-10 dataset
result: {'top_1_accuracy': 0.961738782051282}
python export.py --ckpt_file [CKPT_PATH] --file_format [FILE_FORMAT] --device_id [0]
[ CKPT_PATH] is the ckpt file saved after training
The parameter ckpt_file is required and file_formatmust be selected in ["AIR", "MINDIR"].
Before performing inference, the mindir file must be export.pyexported through a script. The following shows an example of using the mindir model to perform inference.
# Ascend310 inference
bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [DEVICE_ID]
MINDIR_PATH
mindir file pathDATASET_PATH
Inference data set pathDEVICE_ID
Optional, the default value is 0.
The inference result is saved in the current path of the script execution. You can view the inference accuracy in acc.log in the current folder and the inference time in time_Result.
Parameters | Ascend 910 |
---|---|
Model Version | WideResNet-40-10 |
Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G |
uploaded Date | 02/25/2021 (month/day/year) |
MindSpore Version | 1.1.1 |
Dataset | CIFAR-10 |
Training Parameters | epoch=300, steps per epoch=195, batch_size = 32 |
Optimizer | Momentum |
Loss Function | Softmax Cross Entropy |
outputs | probability |
Loss | 0.545541 |
Speed | 65.2 ms/step (8 cards)) |
Total time | 70 minutes |
Parameters (M) | 52.1 |
Checkpoint for Fine tuning | 426.49M (.ckpt file) |
Scripts | Link |
In dataset.py, we set the seed inside "create_dataset" function. We also use random seed in train.py.
Please check the official homepage.
Refer to the ModelZoo FAQ for some common question.
-
Q: What should I do if memory overflow occurs when using PYNATIVE_MODE?
A: The memory overflow is usually because PYNATIVE_MODE requires more memory. Set the batch size to 16 to reduce memory consumption and allow network training.