-
Notifications
You must be signed in to change notification settings - Fork 693
The batch-sizes of single machine commands are not adjusted #33
Comments
Hi Vasilis, thanks for your question! We use |
Oh makes sense, TorchVision scripts don't support the specific parameter and I didn't go through the code to see that it handles it. Thanks for the clarification! |
Glad I could help. Do you have a workaround for training, if TorchVision scripts don't support the specific parameter (update_freq)? Could you train with multiple nodes? |
Yes we train with multiple nodes like you do, using submitit. If we run into a situation where you can't fit all in a single node, we provide on the documentation the single-machine equivalent of the command but indicate that in practice this was trained with X machines and Y GPUs. Part of the reason why I didn't notice the extra parameter is that our training script works similarly. So in this case, I assumed that you did the same and you missed to do the conversion. Your approach of |
We inherited this gradient accumulation feature from the BEiT codebase. In the coming week we would be busy with some other paper related work so I'm not sure I can help contribute this in a short time. If after one week it is still relevant or needed, I'm happy to contribute. For me the main thing to work out would be the process; the code part should be simple enough. BTW, the conversion for single machine you mentioned (using |
hi, I am trying to reproduce the convnext-tiny result on ImageNet-1k. I hava a single machine with 8 gpus, so I set nproc_per_node=8. But I didn't notice the setting you mentioned for “the effective batch size 4096 ” in the training doc.I just set "nproc_per_node=8, batch_size=96, update_freq=4", so my effective batch size is only 8964=3072. Will it affect my reproduction results? |
Hi @anonymoussss, It is possible to affect reproduction results, as each batch size will have different optimal learning rates. It is common practice to scale lr in proportion to batch size, meaning you may use 3e-3 (instead of 4e-3 for 4096) as learning rate if your effective batch size is 3072. |
On the training doc, I believe we need to adjust the batch-size (or the LR) on the single machine commands to maintain the total batch-size the same.
For example, currently the ConvNeXt-S reports:
--nodes 4 --ngpus 8 --batch_size 128 --lr 4e-3
--nproc_per_node=8 --batch_size 128 --lr 4e-3
<- I believe here it should be--batch_size 512
Same applies for the other variants.
The text was updated successfully, but these errors were encountered: