Skip to content

demidovd98/sm-vit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Official repository for the paper "Salient Mask-Guided Vision Transformer for Fine-Grained Classification",
accepted as a Full Paper to VISAPP '23 (part of VISIGRAPP '23).

Salient Mask-Guided Vision Transformer for Fine-Grained Classification paper
Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham Cholakkal, Fahad Shahbaz Khan

Approach

Main Architecture Attention guiding (see Eq. 3)
- blue, green, red bars -
attention to salient patches

In this work, we introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC. Our method, named SalientMask-Guided Vision Transformer (SM-ViT), utilises a salient object detection module comprising an off-the-shelf saliency detector to produce a salient mask likely focusing on the potentially discriminative foreground object regions in an image. The saliency mask is then utilised within our ViT-like Salient Mask-Guided Encoder (SMGE) to boost the discriminabil-ity of the standard self-attention mechanism, thereby focusing on more distinguishable tokens.

Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to distinguish potentially discriminative regions while disregarding the rest. However, such approaches may struggle to effectively focus on truly discriminative regions due to only relying on the inherent self-attention mechanism, resulting in the classification token likely aggregating global information from less-important background patches. Moreover, due to the immense lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, we introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT's attention maps is boosted through salient masking of potentially discriminative foreground regions. Extensive experiments demonstrate that with the standard training procedure our SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.

Main Contributions

  1. We introduce a simple yet effective approach to improve the performance of the standard Vision Transformer architecture at FGVC.
  2. To the best of our knowledge, we are the first to explore the effective utilisation of saliency masks in order to extract more distinguishable information within the ViT encoder layers by boosting the discriminability of self-attention features for the FGVC task.
  3. Our extensive experiments on three popular FGVC datasets (Stanford Dogs, CUB, and NABirds) demonstrate that with the standard training procedure the proposed SM-ViT achieves state-of-the-art performance.
  4. Important advantage of our solution is its integrability, since it can be fine-tuned on top of a ViT-based backbone or can be integrated into a Transformer-like architecture that leverages the standard self-attention mechanism.

🐘 Model Zoo

All models in our experiments are first initialised with publicly available pre-trained ViT/B-16 model's weights and then fine-tuned on the corresponding datasets.

Main Models

Model Baseline Input Size St. Dogs Weights CUB-200 Weights NABirds Weights
Vanilla ViT ViT-B/16 448x448 91.4 - 90.6 - 89.6 -
SM-ViT
(ours)
ViT-B/16 400x400 92.3 link 91.6 link 90.2 link
SM-ViT
(ours)
ViT-B/16 448x448 90.5 link
SM-ViT
(ours)
ViT-B/16 560x560 90.7 link

Experimental Models (outside the paper)

Model Input Size St. Dogs Weights CUB-200 Weights NABirds Weights
SM-ViT
+ Advanced guiding
400x400 - - 91.7 link 90.7 link

🧋 How to start

Installation

For environment installation and pre-trained models preparation, please follow the instructions in INSTALL.md.

Data preparation

For datasets preparation, please follow the instructions in DATASET.md.

Training and Evaluation

For training and evaluation, please follow the instructions in RUN.md.


🆕 News

  • (Dec 20, 2022)

    • Repo description added (README.md).
  • (Dec 30, 2022)

    • Pretrained models are released.
    • Code instructions added (INSTALL.md, DATASET.md, RUN.md).
  • (Jan 09, 2023)

    • Training and evaluation code is released.
  • (Soon)

    • Optimisation

🖋️ Credits

Citation

In case you would like to utilise or refer to our approach (source code, trained models, or results) in your research, please consider citing:

@conference{demidov2022smvit,
    author={Dmitry Demidov. and Muhammad Sharif. and Aliakbar Abdurahimov. and Hisham Cholakkal. and Fahad Khan.},
    title={Salient Mask-Guided Vision Transformer for Fine-Grained Classification},
    booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,},
    year={2023},
    pages={27-38},
    publisher={SciTePress},
    organization={INSTICC},
    doi={10.5220/0011611100003417},
    isbn={978-989-758-634-7},
    issn={2184-4321},
}

Contacts

In case you have a question or suggestion, please create an issue or contact us at dmitry.demidov@mbzuai.ac.ae .

Acknowledgements

Our code is partially based on ViT-pytorch, U2N, and FFVT repositories and we thank the corresponding authors for releasing their code. If you use our derived code, please consider giving credits to these works as well.

Releases

No releases published

Packages

No packages published

Languages