You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For all of the experiments we use these settings:
Stochastic gradient descent (SGD) with momentum, with an initial learning rate (LR) of 0.001 and momentum of 0.9. We also apply LR decay, where we reduce the current LR by 1/3 after each 20 epochs if training for 50 epochs, and after 50 epochs if training for 200 epochs.
In general we also found we could get even better and "faster" results by using a small batch size, in particular with batch size 16, we could get past the 80% top-1 classification accuracy in less than 10 epochs using DAF:re.
With respect to the results in page 4 of the paper (https://arxiv.org/pdf/2101.08674.pdf), they were obtained using the 50 epochs setting and saving the model with the best results on the validation set, but the "best" validation accuracy is also reached by somewhere around the 20th epoch.
Thank you very much for sharing your interesting project!
I'm planning to train a classifier with your classification code on another character face dataset.
May I ask configurations of other hyperparameters of the best ViT L-16 model? (E.g. learning rate, epoch, decay.)
The text was updated successfully, but these errors were encountered: