Skip to content
This repository has been archived by the owner on Mar 15, 2024. It is now read-only.

Fine-tuning details #45

Closed
nakashima-kodai opened this issue Jan 14, 2021 · 14 comments
Closed

Fine-tuning details #45

nakashima-kodai opened this issue Jan 14, 2021 · 14 comments
Labels
question Further information is requested

Comments

@nakashima-kodai
Copy link

nakashima-kodai commented Jan 14, 2021

Hi,

I am trying to replicate the results of the paper that have been fine-tuned to datasets such as CIFAR-10 and Stanford Cars. Could you give details about hyper-parameters used (like batch size, learning rate etc.)

Thanks.

@haoweiz23
Copy link

follow this issue. I have same question

@Christine620
Copy link

I have the same question too

@TouvronHugo
Copy link
Contributor

Hi @nakashima-kodai , @Zhu-haow and @Christine620,

Thanks for your question, for CIFAR-10 and Cars we use:

  • Image size: 224 or 384 (it's to simplify because we don't change the patche size.)
  • Batch size: 768
  • lr: 0.01
  • optimizer: SGD
  • weight-decay: 1e-4
  • epochs: 1000

We remove random erasing and stochastic depth. All other elements are the same as for training on ImageNet.
You can also use AdamW for finetuning in this case just take a smaller lr and keep the weight decay used for ImageNet training.
Do not hesitate if you have other questions.

Best,

Hugo

@TouvronHugo TouvronHugo added the question Further information is requested label Jan 22, 2021
@nicolas-dufour
Copy link

Hi @TouvronHugo
I was wondering if some particular procedure is needed to finetune the distiled model?
Thank you for your help!

@TouvronHugo
Copy link
Contributor

Hi @nicolas-dufour,
It is better to keep the distillation signal during the fine-tuning otherwise nothing special.
Do not hesitate if you have other questions.
Best,
Hugo

@claverru
Copy link

Hello @TouvronHugo, first of all congratulations on your great work, and thanks for replying here. I have two questions regarding to this topic:

  • How do we keep the distillation signal when finetuning on different dataset with different classes? Am I supose to have another conv teacher model trained on the same task before? If not needed, what signal is suposed to be ok, same as in the standard head? Maybe something softer?
  • Is there any way to finetune a pretrained model on higher image size? E.g. I take 384 base and finetune with images of size 512. I've read in the main.py script that you interpolate position embedding, but it isn't clear to me, even after reading your paper, what do you do with that. Can I interpolate pos embedding to finetune in a higher size for a different task?

@forjiuzhou
Copy link

Hi @nakashima-kodai , @Zhu-haow and @Christine620,

Thanks for your question, for CIFAR-10 and Cars we use:

  • Image size: 224 or 384 (it's to simplify because we don't change the patche size.)
  • Batch size: 768
  • lr: 0.01
  • optimizer: SGD
  • weight-decay: 1e-4
  • epochs: 1000

We remove random erasing and stochastic depth. All other elements are the same as for training on ImageNet.
You can also use AdamW for finetuning in this case just take a smaller lr and keep the weight decay used for ImageNet training.
Do not hesitate if you have other questions.

Best,

Hugo

Hi, a question about finetuning on CIFAR. How can one train CIFAR with 224 or 384 image size? What does 224 or 384 image size mean here?

@TouvronHugo
Copy link
Contributor

Hi @claverru,
Thanks for your questions,

  1. For transfer learning I think that doing fine-tuning with a teacher fine-tune on the target dataset is the best. Nevertheless, not using the distillation signal and doing a classic fine-tuning also works (We did that in the DeiT).

  2. You can fine-tune the model at any resolution just by interpolating the position embedding and fine-tune the network.

I hope I have answered your questions ,
Best,
Hugo

@TouvronHugo
Copy link
Contributor

Hi @forjiuzhou,

224 or 384 image size means using image with resolution 224x224 pixels or 384x384 pixels. On CIFAR it is necessary to do an interpolation of the original images that are of size 32x32.

Best,
Hugo

@zizhaozhang
Copy link

@TouvronHugo Hi

I tried this script, could you help me verify?

I found this 768 batch size will OOM even use deit_small. I have to decrease to 256 batch size.

MODEL=deit_small_distilled_patch16_224
FT='https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth'
IS=224

LR=0.01
WD=1e-4
EPO=1000

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --master_port=10000 --nproc_per_node=4 --use_env main.py --model ${MODEL} --batch-size 768 --data-path ${SAVEPATH} --output_dir ${CKPTPATH} --finetune ${FT} --data-set 'CIFAR' --lr ${LR} --weight-decay ${WD} --epochs ${EPO} --opt 'sgd' --input-size=${IS} --num_workers=4

@haoweiz23
Copy link

@TouvronHugo Hi,
I wonder if smaller batch size need to set smaller lr and min-lr ?
Your baseline batch size is 768?

Will finetune longer epoch (more than 30) get higher accuracy?

Does bigger model (deit-large) needs to finetune more epochs or bigger learning rate?

@TouvronHugo
Copy link
Contributor

As there is no more activity on this issue I will close it but feel free to reopen it if needed.

@cashincashout
Copy link

@TouvronHugo Hi, I'm also wondering about the training recipe for CIFAR-100 and Flowers dataset. Thanks for your help!

@TouvronHugo
Copy link
Contributor

Hi @jizxny ,
For CIFAR-100 and Flowers dataset you can use the same hparams as for CIFAR-10 and Cars.
Best,
Hugo

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants