Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Inquiry about GPU distributed training #3

Closed
yizhilll opened this issue May 15, 2020 · 4 comments
Closed

Inquiry about GPU distributed training #3

yizhilll opened this issue May 15, 2020 · 4 comments

Comments

@yizhilll
Copy link
Contributor

Hi, I've started running the model training with gpus of less RAM ( 4 x 11gb).

The batch_size is set to 4 in order to avoid memory error. This runs well.

nohup python -m torch.distributed.launch --nproc_per_node=4 --master_port=19234 train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 4 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 4 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

The code will also be fine if I set only one GPU visible, and the batch_size could be 8.

nohup python train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 8 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 8 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

However, while the visible number is set to 4 ( or more than 1 ), this error will occur. Seems that the distributed training can only be launched with python -m torch.distributed.launch. I think this might be clarified in the README.

  File "train_dense_encoder.py", line 564, in <module>
    main()
  File "train_dense_encoder.py", line 554, in main
    trainer.run_train()
  File "train_dense_encoder.py", line 129, in run_train
    self._train_epoch(scheduler, epoch, eval_step, train_iterator)
  File "train_dense_encoder.py", line 324, in _train_epoch
    loss, correct_cnt = _do_biencoder_fwd_pass(self.biencoder, biencoder_batch, self.tensorizer, args)
  File "train_dense_encoder.py", line 472, in _do_biencoder_fwd_pass
    input.ctx_segments, ctx_attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../DPR/dpr/models/biencoder.py", line 85, in forward
    question_attn_mask, self.fix_q_encoder)
  File ".../DPR/dpr/models/biencoder.py", line 77, in get_representation
    sequence_output, pooled_output, hidden_states = sub_model(ids, segments, attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1_mnt/private/liyizhi/DPR/dpr/models/hf_models.py", line 125, in forward
    attention_mask=attention_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_bert.py", line 707, in forward
    attention_mask, input_shape, self.device
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_utils.py", line 113, in device
    return next(self.parameters()).device
StopIteration

PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.

@yizhilll
Copy link
Contributor Author

PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.

I've found the pretrained model.

For using, download:

# download models
python data/download_data.py --resource checkpoint

And run:

cd ../DPR
# for one device
python generate_dense_embeddings.py  --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp   --ctx_file ./data/wikipedia_split/psgs_w100.tsv --out_file ./embeddings/wikipedia_split/wiki_emb

# for multipule gpu (four here)
for i in {0..3}; do
  export CUDA_VISIBLE_DEVICES=${i}
  nohup python generate_dense_embeddings.py  --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp   --ctx_file ./data/wikipedia_split/psgs_w100.tsv --shard_id ${i} --num_shards 4 --out_file ./embeddings/wikipedia_split/wiki_emb > ./log/nohup.generate_wiki_emb.ser23_3.${i} 2>&1 &
done

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented May 15, 2020

Hi EDLMM,
If you launch retriever or reader model training without "python -m torch.distributed.launch" it will just use all visible gpus in DataParallel model.
Since the performance of our models depends on the effective batch size, launching it with less than 16 batch size per gpu will produce inferior models. So you need to have 128 as the effective batch size in order to be able to reproduce our retriever results.

Looks like you already found out how to download our trained checkpoints.
Generating all documents vectors may take quite a lot of time if you only have 3 shards/gpus.
I can probably help you by providing our wikipedia generated vectors as well by putting them in s3. Let me know if the generate_dense_embeddings process takes you too much time.

@yizhilll
Copy link
Contributor Author

@vlad-karpukhin ,thanks, if you could provide extra s3, that would be great.
I split the embeddings into 16 shards to do this and there were some minor bugs to drag the process slow. If it's possible, I would like to make a PR for the changes of the code

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented May 16, 2020

We only have the version for NQ trained retriever(which is a bit better that that one reported in the paper), not the mixed one in you example above.
It is split to 50 files:
https://dl.fbaipublicfiles.com/dpr/data/wiki_encoded/single/nq/wiki_passages_{0 to 49}
Hope this helps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants