Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot do BERT inference in a multi-instance way #864

Open
TaoLv opened this issue Aug 6, 2019 · 1 comment

Comments

@TaoLv
Copy link
Member

commented Aug 6, 2019

Description

Cannot do BERT inference in a multi-instance way.

Error Message

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Install the latest MXNet and GluonNLP
  2. Fine-tune the parameter of MRPC task to get model_bert_MRPC_2.params
  3. Execute below script on a 24 cores CPU machine (eg. c5.12xlarge) with 4 cores for each instance:
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=4

numactl --physcpubind=0-3   --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &
numactl --physcpubind=4-7   --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &
numactl --physcpubind=8-11  --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &
numactl --physcpubind=12-15 --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &
numactl --physcpubind=16-19 --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &
numactl --physcpubind=20-23 --membind=0 python finetune_classifier.py --task_name MRPC --epochs 1 --lr 2e-5 --only_inference --model_parameters ./output_dir/model_bert_MRPC_2.params --dev_batch_size 1 --pad &

What have you tried to solve it?

  1. Duplicate the parameter file to 6 copies and give each instance a separate copy. The issue is still there.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

@TaoLv TaoLv added the bug label Aug 6, 2019

@eric-haibin-lin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@TaoLv is there still segfault if dataloader multi-process is turned off?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.