-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
I followed the getting started directions here
https://www.deepspeed.ai/tutorials/bert-finetuning/
I pulled the docker image and started a container.
I ran the following commands in a Jupyter notebook (server running in the container)
%set_env CUDA_VISIBLE_DEVICES=0,1,2
%cd /home/santosh/Projects/MsZeroTS
!git clone https://github.com/microsoft/DeepSpeed
!mkdir tfModel
!mkdir hfModel
#Save a HF version of the model
!pip install transformers
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-cased')
model.save_pretrained('/home/santosh/Projects/MsZeroTS/hfModel')
#Save a tf version of the model
%cd /home/santosh/Projects/MsZeroTS/tfModel
!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
!unzip -q cased_L-12_H-768_A-12.zip
%cd ..
%cd DeepSpeed
!git submodule update --init --recursive
%cd DeepSpeedExamples/BingBertSquad
#Download data
!mkdir Data
%cd Data
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
%cd ..
#try tf version
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/TestModel/cased_L-12_H-768_A-12 /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output1
#try hf version
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output2
Neither the tf or hf versions of the models are working. This is a sample output from the baselines
lr is 0.00003
seed is 12345
master port is 29500
dropout is 0.1
deepspeed --num_nodes 1 --num_gpus 3 --master_port=29500 --hostfile /dev/null nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:19,447] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,465] [WARNING] [engine.py:48:<module>] Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,469] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 04:30:19,533] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:20,172] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,190] [WARNING] [engine.py:48:<module>] Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,193] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 04:30:20,193] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 04:30:20,194] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 04:30:20,194] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 04:30:20,194] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 04:30:20,194] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2020-10-16 04:30:20,899] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,917] [WARNING] [engine.py:48:<module>] Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,927] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,945] [WARNING] [engine.py:48:<module>] Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,985] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
[2020-10-16 04:30:21,003] [WARNING] [engine.py:48:<module>] Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
main()
File "nvidia_run_squad_deepspeed.py", line 816, in main
model = BertForQuestionAnsweringPreLN(bert_config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
self.bert = BertModel(config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
self.embeddings = BertEmbeddings(config)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 922, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
main()
File "nvidia_run_squad_deepspeed.py", line 816, in main
model = BertForQuestionAnsweringPreLN(bert_config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
self.bert = BertModel(config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
self.embeddings = BertEmbeddings(config)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 922, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
main()
File "nvidia_run_squad_deepspeed.py", line 816, in main
model = BertForQuestionAnsweringPreLN(bert_config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
self.bert = BertModel(config, args)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
self.embeddings = BertEmbeddings(config)
File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 922, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
I tried both the hf and tf versions of the model because it looked like the error was related to the model initialization.
This info might be helpful; in the same notebook I ran another pytorch training script without any errors.
I tried running run_squad_baseline.sh outside the jupyter notebook, directly in terminal. For both the hf and tf versions, I get a different error; it looks like it's not able to load the model from the directory. Here is a sample output.
/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad$ bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel/ /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output12
deepspeed --num_nodes 1 --num_gpus 3 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:17,877] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 05:23:17,910] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:18,468] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 05:23:18,469] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 05:23:18,469] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 05:23:18,469] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 05:23:18,469] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 05:23:18,469] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/16/2020 05:23:19 - INFO - __main__ - device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ - device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ - device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ - Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
File "nvidia_run_squad_baseline.py", line 1158, in <module>
main()
File "nvidia_run_squad_baseline.py", line 872, in main
map_location=torch.device("cpu"))
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ - Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
File "nvidia_run_squad_baseline.py", line 1158, in <module>
main()
File "nvidia_run_squad_baseline.py", line 872, in main
map_location=torch.device("cpu"))
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ - Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
File "nvidia_run_squad_baseline.py", line 1158, in <module>
main()
File "nvidia_run_squad_baseline.py", line 872, in main
map_location=torch.device("cpu"))
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'