Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A very large batchsize requires 64 GPUs #10

Open
Jxu-Thu opened this issue Jun 8, 2021 · 14 comments
Open

A very large batchsize requires 64 GPUs #10

Jxu-Thu opened this issue Jun 8, 2021 · 14 comments

Comments

@Jxu-Thu
Copy link

Jxu-Thu commented Jun 8, 2021

Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.

If using a small batch size, the performance would drop? How much? Can you provide any empirical results?

@dandelin
Copy link
Owner

dandelin commented Jun 9, 2021

Unfortunately, we only experimented with batch_size=4096, thus have no empirical results.
Though I believe the performance will be preserved for lower batch sizes like 2048 or 1024.

For low resource regimes, the published code provides "gradient accumulation" options.
It will automatically compute the steps to accumulate gradients with given per_gpu_batchsize and the number of GPUS. (see https://github.com/dandelin/ViLT/blob/master/run.py#L42-L44)
Theoretically, the gradient accumulation will result in the same output compared to the non-accumulation version. (However, we did not use gradient accumulation for our experiments. So it is not guaranteed.)

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 13, 2021

If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps?

@dandelin
Copy link
Owner

@Jxu-Thu
As far as I know, Pytorch lightning will increase the LightningModule's internal step only if the accumulation is done.
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L813)
So you should not change the other configurations for applying the gradient accumulation feature.

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 13, 2021

Many thanks for your kind reply!
I am trying to reproduce the results with 24 V100 GPUs with accu steps 3 and batch size over 4k without modifying any configurations.

@dandelin
Copy link
Owner

@Jxu-Thu
Also, please pull the latest commit (#12 (comment))

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 13, 2021

Thanks for your reminder

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 16, 2021

I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize.
Given the vg+mscoco+gcc+sbu (about 900w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/2392933 [00:00<?, ?it/s]
Given the vg+mscoco (about 500w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/169000 [00:00<?, ?it/s]

Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w?
For vg+mscoco , 32 x 16.9w=500w samples.
However, for vg+mscoco+gcc+sbu, 32 x 239w=7648w. I cannot understand why there are so many iterations.
I carefully check the codes but do not find any clues. Could you help me?

@dandelin
Copy link
Owner

@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config)

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 16, 2021

vg+mscoco+gcc+sbu

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg', 'sbu', 'gcc']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00


coco+vg

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

@dandelin
Copy link
Owner

@Jxu-Thu Thank you.
I'll investigate this issue soon.

@dandelin
Copy link
Owner

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32
=> Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'
=> Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset.
Please double-check your arrow files' sanity.

@Jxu-Thu
Copy link
Author

Jxu-Thu commented Jun 16, 2021

Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours.

@HarmanDotpy
Copy link

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps donot change. for eg if I run
python run.py with data_root=/mnt/nfs/dandelin num_gpus=4 num_nodes=8 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'

the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus.

Has anyone faced these issues before?

@HarmanDotpy
Copy link

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 => Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]' => Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. Please double-check your arrow files' sanity.

what is the total batch size for this run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants