This repository helps get the MobileViTv3 model into PyTorch. It uses the CVNets library and MobileViTv3 repository (code).
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features [arXiv]
I recommend to use Python 3.8+ and PyTorch (version >= v1.8.0).
# Clone the repo
git clone https://github.com/evalyev/MobileViTv3-PyTorch.git
cd MobileViTv3-PyTorch
# install requirements
pip install -r requirements.txt
Download the trained MobileViTv3 models from here and save model as pt with bash code.
# Save model as pt
cd MobileViTv3-v1
python save_model.py --common.config-file ../models/MobileViTv3-v1/results_classification/mobilevitv3_S_e300_7930/config.yaml (config path)
Get model with pretrained weights in PyTorch.
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_path = '../models/MobileViTv3-v1/results_classification/mobilevitv3_S_e300_7930/checkpoint_ema_best.pt'
model = torch.load(''model_structure.pt'')
model_weights = torch.load(model_path, map_location=device)
model.load_state_dict(model_weights)
output = model(image)
To be supplemented
Download the trained MobileViTv3 models from here.
checkpoint_ema_best.pt
files inside the model folder is used to generated the accuracy of models.
Low-latency models are build by reducing the number of MobileViTv3-blocks in 'layer4' from 4 to 2.
Please refer to the paper for more details.
Note that for the segmentation and detection, only the backbone architecture parameters are listed.
Model name | Accuracy (%) | Parameters (Million) | FLOPs (Million) | Foldername |
---|---|---|---|---|
MobileViTv3-S | 79.3 | 5.8 | 1841 | mobilevitv3_S_e300_7930 |
MobileViTv3-XS | 76.7 | 2.5 | 927 | mobilevitv3_XS_e300_7671 |
MobileViTv3-XXS | 70.98 | 1.2 | 289 | mobilevitv3_XXS_e300_7098 |
MobileViTv3-1.0 | 78.64 | 5.1 | 1876 | mobilevitv3_1_0_0 |
MobileViTv3-0.75 | 76.55 | 3.0 | 1064 | mobilevitv3_0_7_5 |
MobileViTv3-0.5 | 72.33 | 1.4 | 481 | mobilevitv3_0_5_0 |
Model name | Accuracy (%) | Parameters (Million) | FLOPs (Million) | Foldername |
---|---|---|---|---|
MobileViTv3-S-L2 | 79.06 | 5.2 | 1651 | mobilevitv3_S_L2_e300_7906 |
MobileViTv3-XS-L2 | 76.10 | 2.3 | 853 | mobilevitv3_XS_L2_e300_7610 |
MobileViTv3-XXS-L2 | 70.23 | 1.1 | 256 | mobilevitv3_XXS_L2_e300_7023 |
Model name | mIoU | Parameters (Million) | Foldername |
---|---|---|---|
MobileViTv3-S | 79.59 | 7.2 | mobilevitv3_S_voc_e50_7959 |
MobileViTv3-XS | 78.77 | 3.3 | mobilevitv3_XS_voc_e50_7877 |
MobileViTv3-XXS | 74.04 | 2.0 | mobilevitv3_XXS_voc_e50_7404 |
MobileViTv3-1.0 | 80.04 | 13.6 | mobilevitv3_voc_1_0_0 |
MobileViTv3-0.5 | 76.48 | 6.3 | mobilevitv3_voc_0_5_0 |
Model name | mIoU | Parameters (Million) | Foldername |
---|---|---|---|
MobileViTv3-1.0 | 39.13 | 13.6 | mobilevitv3_ade20k_1_0_0 |
MobileViTv3-0.75 | 36.43 | 9.7 | mobilevitv3_ade20k_0_7_5 |
MobileViTv3-0.5 | 33.57 | 6.4 | mobilevitv3_ade20k_0_5_0 |
Model name | mAP | Parameters (Million) | Foldername |
---|---|---|---|
MobileViTv3-S | 27.3 | 5.5 | mobilevitv3_S_coco_e200_2730 |
MobileViTv3-XS | 25.6 | 2.7 | mobilevitv3_XS_coco_e200_2560 |
MobileViTv3-XXS | 19.3 | 1.5 | mobilevitv3_XXS_coco_e200_1930 |
MobileViTv3-1.0 | 27.0 | 5.8 | mobilevitv3_coco_1_0_0 |
MobileViTv3-0.75 | 25.0 | 3.7 | mobilevitv3_coco_0_7_5 |
MobileViTv3-0.5 | 21.8 | 2.0 | mobilevitv3_coco_0_5_0 |
If you find this repository useful, please consider giving a star ⭐ and citation 📣:
@inproceedings{wadekar2022mobilevitv3,
title = {MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features},
author = {Wadekar, Shakti N. and Chaurasia, Abhishek},
doi = {10.48550/ARXIV.2209.15159},
year = {2022}
}