-
Notifications
You must be signed in to change notification settings - Fork 303
Inquiry about GPU distributed training #3
Comments
I've found the pretrained model. For using, download: # download models
python data/download_data.py --resource checkpoint And run: cd ../DPR
# for one device
python generate_dense_embeddings.py --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp --ctx_file ./data/wikipedia_split/psgs_w100.tsv --out_file ./embeddings/wikipedia_split/wiki_emb
# for multipule gpu (four here)
for i in {0..3}; do
export CUDA_VISIBLE_DEVICES=${i}
nohup python generate_dense_embeddings.py --model_file ./checkpoint/retriever/multiset/bert-base-encoder.cp --ctx_file ./data/wikipedia_split/psgs_w100.tsv --shard_id ${i} --num_shards 4 --out_file ./embeddings/wikipedia_split/wiki_emb > ./log/nohup.generate_wiki_emb.ser23_3.${i} 2>&1 &
done |
Hi EDLMM, Looks like you already found out how to download our trained checkpoints. |
@vlad-karpukhin ,thanks, if you could provide extra s3, that would be great. |
We only have the version for NQ trained retriever(which is a bit better that that one reported in the paper), not the mixed one in you example above. |
Hi, I've started running the model training with gpus of less RAM ( 4 x 11gb).
The
batch_size
is set to 4 in order to avoid memory error. This runs well.The code will also be fine if I set only one GPU visible, and the
batch_size
could be 8.However, while the visible number is set to 4 ( or more than 1 ), this error will occur. Seems that the distributed training can only be launched with
python -m torch.distributed.launch
. I think this might be clarified in the README.PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.
The text was updated successfully, but these errors were encountered: