Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch Results of DNA_c #10

Closed
hongyuanyu opened this issue Mar 22, 2020 · 3 comments
Closed

Mismatch Results of DNA_c #10

hongyuanyu opened this issue Mar 22, 2020 · 3 comments

Comments

@hongyuanyu
Copy link

Hi,

Thanks for sharing the training code.
I try to retrain DNA_c with this config:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 ~/imagenet --model DNA_c \ --epochs 500 --warmup-epochs 5 --batch-size 128 --lr 0.064 --opt rmsproptf --opt-eps 0.001 --sched step --decay-epochs 3 --decay-rate 0.963 --color-jitter 0.06 --drop 0.2 -j 8 --num-classes 1000 --model-ema
After 500 epochs training, the best top1 accuracy is 77.2%, which is 0.6% lower than paper.
*** Best metric: 77.19799990478515 (epoch 458)

@jiefengpeng
Copy link
Collaborator

Hi,hongyuanyu.
We implemented our training with 32x RTX2080ti, 64 batch size/gpu and optimizer_step every 2 iterations, so that we can guarntee a 4096 total batch size and initial lr 0.256 as efficientnets suggested. Small batch size and initial lr might reduce the final performance. You can try an optimizer_step every 4 iterations with 128 batch size/gpu and 0.256 lr to guarantee a big batch size.

@changlin31
Copy link
Owner

changlin31 commented Mar 22, 2020

Hi,

As for ImageNet retraining of the searched models, we used a similar protocol with EfficientNet [30], i.e., a batch size of 4,096, an RMSprop optimizer with momentum 0.9, and an initial learning rate of 0.256 which decays by 0.97 every 2.4 epochs.

Our training config is:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 ~/imagenet --model DNA_c \ --epochs 500 --warmup-epochs 5 --batch-size 64 --lr 0.256 --opt rmsproptf --opt-eps 0.001 --sched step --decay-epochs 3 --decay-rate 0.963 --color-jitter 0.06 --drop 0.2 -j 8 --num-classes 1000 --model-ema with 4 nodes, i.e., 32 GPUs. And we step the optimizer every 2 training steps to simulate large training batch.
We achieve the highest top1 accuracy 77.77% at epoch 351.

The differences are the total batch size: 32x2X64=4096 vs. 8x128=1024. And we decrease the learning rate using the linear rule: lr = 0.256x1024/4096 = 0.064 in the suggested setting. This change in total batch size was intended for easier reproduce, but we can not guarantee the performance.

You can try enlarging your total batch size or step your optimizer less frequently as suggested by @jiefengpeng .

@hongyuanyu
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants