-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A very large batchsize requires 64 GPUs #10
Comments
Unfortunately, we only experimented with batch_size=4096, thus have no empirical results. For low resource regimes, the published code provides "gradient accumulation" options. |
If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps? |
@Jxu-Thu |
Many thanks for your kind reply! |
@Jxu-Thu |
Thanks for your reminder |
I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize. Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w? |
@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config) |
vg+mscoco+gcc+sbuINFO - ViLT - Running command 'print_config' coco+vgINFO - ViLT - Running command 'print_config' |
@Jxu-Thu Thank you. |
@Jxu-Thu I ran your settings.
Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. |
Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours. |
Hi, the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus. Has anyone faced these issues before? |
what is the total batch size for this run? |
Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.
If using a small batch size, the performance would drop? How much? Can you provide any empirical results?
The text was updated successfully, but these errors were encountered: