Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Official repository for the paper "Salient Mask-Guided Vision Transformer for Fine-Grained Classification",
accepted as a Full Paper to VISAPP '23 (part of VISIGRAPP '23).

Salient Mask-Guided Vision Transformer for Fine-Grained Classification
Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham Cholakkal, Fahad Shahbaz Khan

Approach

Main Architecture	Attention guiding (see Eq. 3)

-	blue, green, red bars - attention to salient patches

In this work, we introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC. Our method, named SalientMask-Guided Vision Transformer (SM-ViT), utilises a salient object detection module comprising an off-the-shelf saliency detector to produce a salient mask likely focusing on the potentially discriminative foreground object regions in an image. The saliency mask is then utilised within our ViT-like Salient Mask-Guided Encoder (SMGE) to boost the discriminabil-ity of the standard self-attention mechanism, thereby focusing on more distinguishable tokens.

Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to distinguish potentially discriminative regions while disregarding the rest. However, such approaches may struggle to effectively focus on truly discriminative regions due to only relying on the inherent self-attention mechanism, resulting in the classification token likely aggregating global information from less-important background patches. Moreover, due to the immense lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, we introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT's attention maps is boosted through salient masking of potentially discriminative foreground regions. Extensive experiments demonstrate that with the standard training procedure our SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.

Main Contributions

We introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC.
To the best of our knowledge, we are the first to explore the effective utilisation of saliency masks in order to extract more distinguishable information within the ViT encoder layers by boosting the discriminability of self-attention features for the FGVC task.
Our extensive experiments on three popular FGVC datasets (Stanford Dogs, CUB, and NABirds) demonstrate that with the standard training procedure the proposed SM-ViT achieves state-of-the-art performance.
Important advantage of our solution is its integrability, since it can be fine-tuned on top of a ViT-based backbone or can be integrated into a Transformer-like architecture that leverages the standard self-attention mechanism.

🐘 Model Zoo

All models in our experiments are first initialised with publicly available pre-trained ViT/B-16 model's weights and then fine-tuned on the corresponding datasets.

Main Models

Model	Baseline	Input Size	St. Dogs	Weights	CUB-200	Weights	NABirds	Weights
Vanilla ViT	ViT-B/16	448x448	91.4	-	90.6	-	89.6	-
SM-ViT (ours)	ViT-B/16	400x400	92.3	link	91.6	link	90.2	link
SM-ViT (ours)	ViT-B/16	448x448					90.5	link
SM-ViT (ours)	ViT-B/16	560x560					90.7	link

Experimental Models (outside the paper)

Model	Input Size	St. Dogs	Weights	CUB-200	Weights	NABirds	Weights
SM-ViT + Advanced guiding	400x400	-	-	91.7	link	90.7	link

🧋 How to start

Installation

For environment installation and pre-trained models preparation, please follow the instructions in INSTALL.md.

Data preparation

For datasets preparation, please follow the instructions in DATASET.md.

Training and Evaluation

For training and evaluation, please follow the instructions in RUN.md.

🆕 News

(Dec 20, 2022)
- Repo description added (README.md).
(Dec 30, 2022)
- Pretrained models are released.
- Code instructions added (INSTALL.md, DATASET.md, RUN.md).
(Jan 09, 2023)
- Training and evaluation code is released.
(Soon)
- Optimisation

🖋️ Credits

Citation

In case you would like to utilise or refer to our approach (source code, trained models, or results) in your research, please consider citing:

@conference{demidov2022smvit,
    author={Dmitry Demidov. and Muhammad Sharif. and Aliakbar Abdurahimov. and Hisham Cholakkal. and Fahad Khan.},
    title={Salient Mask-Guided Vision Transformer for Fine-Grained Classification},
    booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,},
    year={2023},
    pages={27-38},
    publisher={SciTePress},
    organization={INSTICC},
    doi={10.5220/0011611100003417},
    isbn={978-989-758-634-7},
    issn={2184-4321},
}

Contacts

In case you have a question or suggestion, please create an issue or contact us at dmitry.demidov@mbzuai.ac.ae .

Acknowledgements

Our code is partially based on ViT-pytorch, U2N, and FFVT repositories and we thank the corresponding authors for releasing their code. If you use our derived code, please consider giving credits to these works as well.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
U2Net		U2Net
docs		docs
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
train.py		train.py

License

demidovd98/sm-vit

Folders and files

Latest commit

History

Repository files navigation

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Approach

Main Contributions

🐘 Model Zoo

Main Models

Experimental Models (outside the paper)

🧋 How to start

Installation

Data preparation

Training and Evaluation

🆕 News

🖋️ Credits

Citation

Contacts

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages