Inquiry about GPU distributed training #3

yizhilll · 2020-05-15T04:32:27Z

Hi, I've started running the model training with gpus of less RAM ( 4 x 11gb).

The batch_size is set to 4 in order to avoid memory error. This runs well.

nohup python -m torch.distributed.launch --nproc_per_node=4 --master_port=19234 train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 4 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 4 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

The code will also be fine if I set only one GPU visible, and the batch_size could be 8.

nohup python train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 8 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 8 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

However, while the visible number is set to 4 ( or more than 1 ), this error will occur. Seems that the distributed training can only be launched with python -m torch.distributed.launch. I think this might be clarified in the README.

  File "train_dense_encoder.py", line 564, in <module>
    main()
  File "train_dense_encoder.py", line 554, in main
    trainer.run_train()
  File "train_dense_encoder.py", line 129, in run_train
    self._train_epoch(scheduler, epoch, eval_step, train_iterator)
  File "train_dense_encoder.py", line 324, in _train_epoch
    loss, correct_cnt = _do_biencoder_fwd_pass(self.biencoder, biencoder_batch, self.tensorizer, args)
  File "train_dense_encoder.py", line 472, in _do_biencoder_fwd_pass
    input.ctx_segments, ctx_attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../DPR/dpr/models/biencoder.py", line 85, in forward
    question_attn_mask, self.fix_q_encoder)
  File ".../DPR/dpr/models/biencoder.py", line 77, in get_representation
    sequence_output, pooled_output, hidden_states = sub_model(ids, segments, attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1_mnt/private/liyizhi/DPR/dpr/models/hf_models.py", line 125, in forward
    attention_mask=attention_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_bert.py", line 707, in forward
    attention_mask, input_shape, self.device
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_utils.py", line 113, in device
    return next(self.parameters()).device
StopIteration

PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.

The text was updated successfully, but these errors were encountered:

yizhilll · 2020-05-15T10:38:49Z

PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.

I've found the pretrained model.

For using, download:

# download models
python data/download_data.py --resource checkpoint

And run:

cd ../DPR
# for one device
python generate_dense_embeddings.py  --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp   --ctx_file ./data/wikipedia_split/psgs_w100.tsv --out_file ./embeddings/wikipedia_split/wiki_emb

# for multipule gpu (four here)
for i in {0..3}; do
  export CUDA_VISIBLE_DEVICES=${i}
  nohup python generate_dense_embeddings.py  --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp   --ctx_file ./data/wikipedia_split/psgs_w100.tsv --shard_id ${i} --num_shards 4 --out_file ./embeddings/wikipedia_split/wiki_emb > ./log/nohup.generate_wiki_emb.ser23_3.${i} 2>&1 &
done

vlad-karpukhin · 2020-05-15T19:50:32Z

Hi EDLMM,
If you launch retriever or reader model training without "python -m torch.distributed.launch" it will just use all visible gpus in DataParallel model.
Since the performance of our models depends on the effective batch size, launching it with less than 16 batch size per gpu will produce inferior models. So you need to have 128 as the effective batch size in order to be able to reproduce our retriever results.

Looks like you already found out how to download our trained checkpoints.
Generating all documents vectors may take quite a lot of time if you only have 3 shards/gpus.
I can probably help you by providing our wikipedia generated vectors as well by putting them in s3. Let me know if the generate_dense_embeddings process takes you too much time.

yizhilll · 2020-05-15T23:49:52Z

@vlad-karpukhin ,thanks, if you could provide extra s3, that would be great.
I split the embeddings into 16 shards to do this and there were some minor bugs to drag the process slow. If it's possible, I would like to make a PR for the changes of the code

vlad-karpukhin · 2020-05-16T02:27:24Z

We only have the version for NQ trained retriever(which is a bit better that that one reported in the paper), not the mixed one in you example above.
It is split to 50 files:
https://dl.fbaipublicfiles.com/dpr/data/wiki_encoded/single/nq/wiki_passages_{0 to 49}
Hope this helps.

vlad-karpukhin closed this as completed May 18, 2020

AkariAsai mentioned this issue May 27, 2020

A question on hardware requirements to run retriever validation #6

Closed

XY2323819551 mentioned this issue Aug 4, 2021

RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select #178

Closed

AkariAsai mentioned this issue Apr 1, 2022

Hardware requirement to build a FAISS index from the dense embedding? AkariAsai/CORA#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about GPU distributed training #3

Inquiry about GPU distributed training #3

yizhilll commented May 15, 2020

yizhilll commented May 15, 2020

vlad-karpukhin commented May 15, 2020 •

edited

Loading

yizhilll commented May 15, 2020

vlad-karpukhin commented May 16, 2020 •

edited

Loading

Inquiry about GPU distributed training #3

Inquiry about GPU distributed training #3

Comments

yizhilll commented May 15, 2020

yizhilll commented May 15, 2020

vlad-karpukhin commented May 15, 2020 • edited Loading

yizhilll commented May 15, 2020

vlad-karpukhin commented May 16, 2020 • edited Loading

vlad-karpukhin commented May 15, 2020 •

edited

Loading

vlad-karpukhin commented May 16, 2020 •

edited

Loading