A very large batchsize requires 64 GPUs #10

Jxu-Thu · 2021-06-08T07:52:26Z

Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.

If using a small batch size, the performance would drop? How much? Can you provide any empirical results?

dandelin · 2021-06-09T05:02:15Z

Unfortunately, we only experimented with batch_size=4096, thus have no empirical results.
Though I believe the performance will be preserved for lower batch sizes like 2048 or 1024.

For low resource regimes, the published code provides "gradient accumulation" options.
It will automatically compute the steps to accumulate gradients with given per_gpu_batchsize and the number of GPUS. (see https://github.com/dandelin/ViLT/blob/master/run.py#L42-L44)
Theoretically, the gradient accumulation will result in the same output compared to the non-accumulation version. (However, we did not use gradient accumulation for our experiments. So it is not guaranteed.)

Jxu-Thu · 2021-06-13T03:37:37Z

If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps?

dandelin · 2021-06-13T08:57:36Z

@Jxu-Thu
As far as I know, Pytorch lightning will increase the LightningModule's internal step only if the accumulation is done.
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L813)
So you should not change the other configurations for applying the gradient accumulation feature.

Jxu-Thu · 2021-06-13T09:11:13Z

Many thanks for your kind reply!
I am trying to reproduce the results with 24 V100 GPUs with accu steps 3 and batch size over 4k without modifying any configurations.

dandelin · 2021-06-13T09:15:05Z

@Jxu-Thu
Also, please pull the latest commit (#12 (comment))

Jxu-Thu · 2021-06-13T09:27:26Z

Thanks for your reminder

Jxu-Thu · 2021-06-16T07:09:07Z

I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize.
Given the vg+mscoco+gcc+sbu (about 900w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/2392933 [00:00<?, ?it/s]
Given the vg+mscoco (about 500w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/169000 [00:00<?, ?it/s]

Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w?
For vg+mscoco , 32 x 16.9w=500w samples.
However, for vg+mscoco+gcc+sbu, 32 x 239w=7648w. I cannot understand why there are so many iterations.
I carefully check the codes but do not find any clues. Could you help me?

dandelin · 2021-06-16T07:16:27Z

@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config)

Jxu-Thu · 2021-06-16T07:29:20Z

vg+mscoco+gcc+sbu

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg', 'sbu', 'gcc']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

coco+vg

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

dandelin · 2021-06-16T07:40:15Z

@Jxu-Thu Thank you.
I'll investigate this issue soon.

dandelin · 2021-06-16T09:39:47Z

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32
=> Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'
=> Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset.
Please double-check your arrow files' sanity.

Jxu-Thu · 2021-06-16T11:52:43Z

Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours.

HarmanDotpy · 2022-09-27T03:22:26Z

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps donot change. for eg if I run
python run.py with data_root=/mnt/nfs/dandelin num_gpus=4 num_nodes=8 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'

the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus.

Has anyone faced these issues before?

HarmanDotpy · 2022-09-27T19:16:09Z

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 => Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]' => Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. Please double-check your arrow files' sanity.

what is the total batch size for this run?

jkkishore1999 mentioned this issue Jul 8, 2021

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt" #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A very large batchsize requires 64 GPUs #10

A very large batchsize requires 64 GPUs #10

Jxu-Thu commented Jun 8, 2021 •

edited

dandelin commented Jun 9, 2021

Jxu-Thu commented Jun 13, 2021 •

edited

dandelin commented Jun 13, 2021

Jxu-Thu commented Jun 13, 2021

dandelin commented Jun 13, 2021

Jxu-Thu commented Jun 13, 2021

Jxu-Thu commented Jun 16, 2021 •

edited

dandelin commented Jun 16, 2021

Jxu-Thu commented Jun 16, 2021 •

edited

dandelin commented Jun 16, 2021

dandelin commented Jun 16, 2021

Jxu-Thu commented Jun 16, 2021

HarmanDotpy commented Sep 27, 2022

HarmanDotpy commented Sep 27, 2022

A very large batchsize requires 64 GPUs #10

A very large batchsize requires 64 GPUs #10

Comments

Jxu-Thu commented Jun 8, 2021 • edited

dandelin commented Jun 9, 2021

Jxu-Thu commented Jun 13, 2021 • edited

dandelin commented Jun 13, 2021

Jxu-Thu commented Jun 13, 2021

dandelin commented Jun 13, 2021

Jxu-Thu commented Jun 13, 2021

Jxu-Thu commented Jun 16, 2021 • edited

dandelin commented Jun 16, 2021

Jxu-Thu commented Jun 16, 2021 • edited

vg+mscoco+gcc+sbu

coco+vg

dandelin commented Jun 16, 2021

dandelin commented Jun 16, 2021

Jxu-Thu commented Jun 16, 2021

HarmanDotpy commented Sep 27, 2022

HarmanDotpy commented Sep 27, 2022

Jxu-Thu commented Jun 8, 2021 •

edited

Jxu-Thu commented Jun 13, 2021 •

edited

Jxu-Thu commented Jun 16, 2021 •

edited

Jxu-Thu commented Jun 16, 2021 •

edited