Skip to content

Following the Bert-finetuning tutorial results in ImportError or IsADirectoryError: for run_squad_baseline.sh #474

@Santosh-Gupta

Description

@Santosh-Gupta

I followed the getting started directions here

https://www.deepspeed.ai/tutorials/bert-finetuning/

I pulled the docker image and started a container.

I ran the following commands in a Jupyter notebook (server running in the container)

%set_env CUDA_VISIBLE_DEVICES=0,1,2
%cd /home/santosh/Projects/MsZeroTS
!git clone https://github.com/microsoft/DeepSpeed
!mkdir tfModel
!mkdir hfModel

#Save a HF version of the model
!pip install transformers
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-cased')
model.save_pretrained('/home/santosh/Projects/MsZeroTS/hfModel')

#Save a tf version of the model 
%cd /home/santosh/Projects/MsZeroTS/tfModel
!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
!unzip -q cased_L-12_H-768_A-12.zip
%cd ..

%cd DeepSpeed
!git submodule update --init --recursive
%cd DeepSpeedExamples/BingBertSquad

#Download data 
!mkdir Data
%cd Data
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
%cd ..

#try tf version
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/TestModel/cased_L-12_H-768_A-12 /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output1

#try hf version 
!bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output2

Neither the tf or hf versions of the models are working. This is a sample output from the baselines

lr is 0.00003
seed is 12345
master port is 29500
dropout is 0.1
deepspeed --num_nodes 1 --num_gpus 3 --master_port=29500 --hostfile /dev/null nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:19,447] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,465] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:19,469] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 04:30:19,533] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_deepspeed.py --bert_model bert-large-uncased --do_train --do_lower_case --predict_batch_size 3 --do_predict --train_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 0.00003 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MicrosoftZero/DeepSpeed/output --job_name deepspeed_3GPUs_24batch_size --gradient_accumulation_steps 2 --fp16 --deepspeed --deepspeed_config onebit_deepspeed_bsz24_config.json --dropout 0.1 --model_file /home/santosh/Projects/MicrosoftZero/TestModel/cased_L-12_H-768_A-12 --seed 12345 --preln
[2020-10-16 04:30:20,172] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,190] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,193] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 04:30:20,193] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 04:30:20,194] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 04:30:20,194] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 04:30:20,194] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 04:30:20,194] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2020-10-16 04:30:20,899] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,917] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,927] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,945] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:20,985] [WARNING] [stage2.py:32:<module>] apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
[2020-10-16 04:30:21,003] [WARNING] [engine.py:48:<module>] Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 04:30:21 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1143, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 816, in main
    model = BertForQuestionAnsweringPreLN(bert_config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 1500, in __init__
    self.bert = BertModel(config, args)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 927, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad/turing/nvidia_modelingpreln.py", line 354, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
  File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

I tried both the hf and tf versions of the model because it looked like the error was related to the model initialization.

This info might be helpful; in the same notebook I ran another pytorch training script without any errors.

I tried running run_squad_baseline.sh outside the jupyter notebook, directly in terminal. For both the hf and tf versions, I get a different error; it looks like it's not able to load the model from the directory. Here is a sample output.

/home/santosh/Projects/MicrosoftZero/DeepSpeed/DeepSpeedExamples/BingBertSquad$ bash run_squad_baseline.sh 3 /home/santosh/Projects/MsZeroTS/hfModel/ /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data /home/santosh/Projects/MsZeroTS/output12
deepspeed --num_nodes 1 --num_gpus 3 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:17,877] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-10-16 05:23:17,910] [INFO] [runner.py:355:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 nvidia_run_squad_baseline.py --bert_model bert-large-uncased --do_train --do_lower_case --do_predict --train_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/train-v1.1.json --predict_file /home/santosh/Projects/MsZeroTS/DeepSpeed/DeepSpeedExamples/BingBertSquad/Data/dev-v1.1.json --train_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir /home/santosh/Projects/MsZeroTS/output12 --job_name baseline_3GPUs_24batch_size --gradient_accumulation_steps 1 --fp16 --model_file /home/santosh/Projects/MsZeroTS/hfModel/
[2020-10-16 05:23:18,468] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.6.4
[2020-10-16 05:23:18,469] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2020-10-16 05:23:18,469] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=3, node_rank=0
[2020-10-16 05:23:18,469] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2020-10-16 05:23:18,469] [INFO] [launch.py:100:main] dist_world_size=3
[2020-10-16 05:23:18,469] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - __main__ -   device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
10/16/2020 05:23:19 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /home/santosh/Projects/MicrosoftZero/DeepSpeed/cache/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'
VOCAB SIZE: 30528
10/16/2020 05:23:32 - INFO - __main__ -   Loading Pretrained Bert Encoder from: /home/santosh/Projects/MsZeroTS/hfModel/
Traceback (most recent call last):
  File "nvidia_run_squad_baseline.py", line 1158, in <module>
    main()
  File "nvidia_run_squad_baseline.py", line 872, in main
    map_location=torch.device("cpu"))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
IsADirectoryError: [Errno 21] Is a directory: '/home/santosh/Projects/MsZeroTS/hfModel/'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions