This repo contains the official code and pre-trained models for the glance and focus networks (GFNet).
- (NeurIPS 2020) Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification
- (T-PAMI) Glance and Focus Networks for Dynamic Visual Recognition
Update on 2020/12/28: Release Training Code.
Update on 2020/10/08: Release Pre-trained Models and the Inference Code on ImageNet.
Inspired by the fact that not all regions in an image are task-relevant, we propose a novel framework that performs efficient image classification by processing a sequence of relatively small inputs, which are strategically cropped from the original image. Experiments on ImageNet show that our method consistently improves the computational efficiency of a wide variety of deep models. For example, it further reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 20% without sacrificing accuracy.
@inproceedings{NeurIPS2020_7866,
title={Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification},
author={Wang, Yulin and Lv, Kangchen and Huang, Rui and Song, Shiji and Yang, Le and Huang, Gao},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2020},
}
@article{huang2023glance,
title={Glance and Focus Networks for Dynamic Visual Recognition},
author={Huang, Gao and Wang, Yulin and Lv, Kangchen and Jiang, Haojun and Huang, Wenhui and Qi, Pengfei and Song, Shiji},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
volume={45},
number={4},
pages={4605-4621},
doi={10.1109/TPAMI.2022.3196959}
}
- Top-1 accuracy on ImageNet v.s. Multiply-Adds
- Top-1 accuracy on ImageNet v.s. Inference Latency (ms) on an iPhone XS Max
- Visualization
Backbone CNNs | Patch Size | T | Links |
---|---|---|---|
ResNet-50 | 96x96 | 5 | Tsinghua Cloud / Google Drive |
ResNet-50 | 128x128 | 5 | Tsinghua Cloud / Google Drive |
DenseNet-121 | 96x96 | 5 | Tsinghua Cloud / Google Drive |
DenseNet-169 | 96x96 | 5 | Tsinghua Cloud / Google Drive |
DenseNet-201 | 96x96 | 5 | Tsinghua Cloud / Google Drive |
RegNet-Y-600MF | 96x96 | 5 | Tsinghua Cloud / Google Drive |
RegNet-Y-800MF | 96x96 | 5 | Tsinghua Cloud / Google Drive |
RegNet-Y-1.6GF | 96x96 | 5 | Tsinghua Cloud / Google Drive |
MobileNet-V3-Large (1.00) | 96x96 | 3 | Tsinghua Cloud / Google Drive |
MobileNet-V3-Large (1.00) | 128x128 | 3 | Tsinghua Cloud / Google Drive |
MobileNet-V3-Large (1.25) | 128x128 | 3 | Tsinghua Cloud / Google Drive |
EfficientNet-B2 | 128x128 | 4 | Tsinghua Cloud / Google Drive |
EfficientNet-B3 | 128x128 | 4 | Tsinghua Cloud / Google Drive |
EfficientNet-B3 | 144x144 | 4 | Tsinghua Cloud / Google Drive |
- What are contained in the checkpoints:
**.pth.tar
├── model_name: name of the backbone CNNs (e.g., resnet50, densenet121)
├── patch_size: size of image patches (i.e., H' or W' in the paper)
├── model_prime_state_dict, model_state_dict, fc, policy: state dictionaries of the four components of GFNets
├── model_flops, policy_flops, fc_flops: Multiply-Adds of inferring the encoder, patch proposal network and classifier for once
├── flops: a list containing the Multiply-Adds corresponding to each length of the input sequence during inference
├── anytime_classification: results of anytime prediction (in Top-1 accuracy)
├── dynamic_threshold: the confidence thresholds used in budgeted batch classification
├── budgeted_batch_classification: results of budgeted batch classification (a two-item list, [0] and [1] correspond to the two coordinates of a curve)
- python 3.7.7
- pytorch 1.3.1
- torchvision 0.4.2
- pyyaml 5.3.1 (for RegNets)
Read the evaluation results saved in pre-trained models
CUDA_VISIBLE_DEVICES=0 python inference.py --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 0
Read the confidence thresholds saved in pre-trained models and infer the model on the validation set
CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 1
Determine confidence thresholds on the training set and infer the model on the validation set
CUDA_VISIBLE_DEVICES=0 python inference.py --data_url PATH_TO_DATASET --checkpoint_path PATH_TO_CHECKPOINTS --eval_mode 2
The dataset is expected to be prepared as follows:
ImageNet
├── train
│ ├── folder 1 (class 1)
│ ├── folder 2 (class 1)
│ ├── ...
├── val
│ ├── folder 1 (class 1)
│ ├── folder 2 (class 1)
│ ├── ...
-
Here we take training ResNet-50 (96x96, T=5) for example. All the used initialization models and stage-1/2 checkpoints can be found in Tsinghua Cloud / Google Drive. Currently, this link includes ResNet and MobileNet-V3. We will update it as soon as possible. If you need other helps, feel free to contact us.
-
The Results in the paper is based on 2 Tesla V100 GPUs. For most of experiments, up to 4 Titan Xp GPUs may be enough.
Training stage 1, the initializations of global encoder (model_prime) and local encoder (model) are required:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --data_url PATH_TO_DATASET --train_stage 1 --model_arch resnet50 --patch_size 96 --T 5 --print_freq 10 --model_prime_path PATH_TO_CHECKPOINTS --model_path PATH_TO_CHECKPOINTS
Training stage 2, a stage-1 checkpoint is required:
CUDA_VISIBLE_DEVICES=0 python train.py --data_url PATH_TO_DATASET --train_stage 2 --model_arch resnet50 --patch_size 96 --T 5 --print_freq 10 --checkpoint_path PATH_TO_CHECKPOINTS
Training stage 3, a stage-2 checkpoint is required:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --data_url PATH_TO_DATASET --train_stage 3 --model_arch resnet50 --patch_size 96 --T 5 --print_freq 10 --checkpoint_path PATH_TO_CHECKPOINTS
If you have any question, please feel free to contact the authors. Yulin Wang: wang-yl19@mails.tsinghua.edu.cn.
Our code of MobileNet-V3 and EfficientNet is from here. Our code of RegNet is from here.
-
Update the code for visualizing.
-
Update the code for MIXED PRECISION TRAINING。