Skip to content

cmcamdy/Train_ViT_Translation_Perceptibility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Train_ViT_Translation_Perceptibility

Contents

  1. Abstract
  2. Motivation
  3. Contritution
  4. Train with translation perceptibility
  5. Main Exam
  6. References

Abstract

Abstract: The Vision Transformer (ViT) has become a popular vision model in recent years, replacing traditional Convolutional Neural Network (CNN) models. However, ViTs tend to require a larger amount of data due to the lack of some properties inherent in the CNN architecture. To address this problem, researchers have proposed various methods to optimize ViTs' performance on small datasets. In this paper, we propose a self-supervised auxiliary task to guide ViTs in learning translation perceptibility, which enables the models to acquire inductive bias more efficiently from small datasets, without the need for pre-training on large datasets or modifications to the network architecture. The effectiveness of the approach has been demonstrated on multiple small datasets, as well as its scale perceptibility, and its application in conjunction with current state-of-the-art methods has further improved performance. Our codes are available at: https://github.com/cmcamdy/Train_ViT_Translation_Perceptibility.

Motivation

  • We observed that both convolutional layers and fully connected layers utilize matrix operations at the underlying level to perform computations in neural networks. The difference lies in the special parameter distribution present in convolutional layers, which provides them with a unique advantage in handling translation equivariance. Based on this observation, we speculate that to some extent, fully connected layers can partially replace certain characteristics of convolutional layers, such as translation equivariance, while preserving the intrinsic features.

Contritution

  • We propose a simple self-supervised auxiliary task to guide ViTs in learning translation perceptibility without changing the model structure, which helps the model learn inductive bias more efficiently from small datasets, aiding ViTs in achieving better performance.
  • We demonstrate the effectiveness of our approach on multiple small datasets, and further show that our approach is scalable and can be used in conjunction with the current state-of-the-art methods to further improve performance.
  • By extending the proposed auxiliary task to guide ViTs in learning scale sensitivity, we present empirical evidence that this method enhances ViTs' perception of hierarchical characteristics.

Train with translation perceptibility

cd vits-for-small-scale-datasets
root_path=/YOUR/PROJECT/ROOT/PATH
datapath=${root_path}/datasets/tiny-imagenet-200
output_dir=${root_path}/outputs/ssloff/vit_base_imagenet-200
python finetune_affine.py --arch vit  \
        --dataset Tiny-Imagenet \
        --datapath ${datapath} \
        --batch_size 128 \
        --epochs 300 \
        --output_dir ${output_dir} \
        --tag vfsd_with_trans \
        --ls --lr 0.001 \
        --alpha_trans 0.4 \
        --init_weight 1.2 \
        --with_trans

Integration with other approaches

SPT_LSA_ViT(SL-ViT)

  • script path: Train_ViT_Translation_Perceptibility/SPT_LSA_ViT/script/
root_path=/YOUR/PROJECT/ROOT/PATH
data_path=${root_path}/datasets/tiny-imagenet-200
dataset=T-IMNET
batch_size=128
save_path=${root_path}/outputs/spt_lsa
tag=vit-spt-lsa-open-img100-affine_trans-vit
python main_affine.py --with_trans --model vit --data_path ${data_path} --dataset ${dataset} --batch_size ${batch_size} --save_path ${save_path} --tag ${tag} --is_LSA --is_SPT

vits-for-small-scale-datasets(vfsd)

  • script path: Train_ViT_Translation_Perceptibility/vits-for-small-scale-datasets/script/train_ssl_imagenet-200_vit_vfsd.sh
root_path=/YOUR/PROJECT/ROOT/PATH
datapath=${root_path}/datasets/tiny-imagenet-200
pretrained_weights=${root_path}/outputs/ssloff/vit_base_imagenet-200/checkpoint.pth
output_dir=${root_path}/outputs/ssloff/vit_base_imagenet-200
python finetune_affine.py --arch vit  \
        --dataset Tiny-Imagenet \
        --datapath ${datapath} \
        --batch_size 128 \
        --epochs 300 \
        --output_dir ${output_dir} \
        --pretrained_weights ${pretrained_weights} \
        --tag vfsd_with_trans \
        --ls --lr 0.001 \
        --alpha_trans 0.4 \
        --init_weight 1.2 \
        --with_trans

VTs-Drloc(Drloc)

  • script path: Train_ViT_Translation_Perceptibility/VTs-Drloc/main_affine.sh
DATA_DIR=/YOUR/DATA/DIR
IMG_SIZE=32 # 224, 384
MODE=vitcs # swintiny, cvt13, t2t, resnet50, vit
CONFIG=vitcs_base_16 # swin_tiny_patch4_window7, cvt_13, t2tvit_14, resnet_50, vit_base_16
LAMBDA_DRLOC=0.5 # swin: 0.5, t2t: 0.1, cvt: 0.1
DRLOC_MODE=l1 # l1, ce, cbr
BATCH_SIZE=128
# DATASET=imagenet-100 # imagenet-100, imagenet, cifar-10, cifar-100, svhn, places365, flowers102, clipart, infograph, painting, quickdraw, real, sketch
DATASET=cifar-100
NUM_CLASSES=100

DISK_DATA=${DATA_DIR}/datasets/${DATASET}
TARGET_FOLDER=${DATASET}-${MODE}-sz${IMG_SIZE}-drloc${LAMBDA_DRLOC}-bs${BATCH_SIZE}-choose_trans
SAVE_DIR=${DATA_DIR}/outputs/SPTv3/visiontransformer-expr/${TARGET_FOLDER}

python3 -m torch.distributed.launch  --master_port 65001 \
    main_affine.py \
    --cfg ./configs/${CONFIG}_${IMG_SIZE}.yaml \
    --dataset ${DATASET} \
    --num_classes ${NUM_CLASSES} \
    --data-path ${DISK_DATA} \
    --batch-size ${BATCH_SIZE} \
    --output ${SAVE_DIR} \
    --lambda_drloc ${LAMBDA_DRLOC} --drloc_mode ${DRLOC_MODE} --use_drloc  --use_abs --with_trans

Main Exam

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published