Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #4531

Open
himanshucodz55 opened this issue Jul 24, 2022 · 22 comments
Labels
Question Question

Comments

@himanshucodz55
Copy link

himanshucodz55 commented Jul 24, 2022

Describe the bug
Hi, @espnet team thanks for amazing work. I am running librispeech recipe with distributed mode using slurm on esonet2. i am running on two oracle instance each one has single gpu (Tesla V100). but when i ran stage 11 it created jobs on both machine and gpu memory is also utlized but it failed after sometime.

Basic environments:

  • OS information: Ubuntu 18.04 x86_64
  • python version: python 3.9 [GCC 7.3.0]]
  • espnet version: latest
  • pytorch version 1.12.0
  • cuda 10.2

Task information:

  • Task: ASR
  • Recipe: librispeech
  • ESPnet2

To Reproduce
when i ran the stage 11 with slurm it showing error after sometime...

slurm.conf
#Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0 --ntasks-per-node=1
option num_threads=1 --cpus-per-task 12 --ntasks-per-node=1
option num_nodes=* --nodes $0
option gpu=1 -p tgpu
option gpu=* -p tgpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
#note: the --max-jobs-run option is supported as a special case
#by slurm.pl and you don't have to handle it in the config file.
#default cpu=1

$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
tgpu* up infinite 2 idle hp-[1-2]

$ scontrol show nodes
NodeName=hp-1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.34
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:1
NodeAddr=hp-1 NodeHostName=hp-1 Version=17.11
OS=Linux 5.4.0-1079-oracle #87~18.04.1-Ubuntu SMP Mon Jul 11 03:41:03 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=86991 Sockets=12 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=tgpu
BootTime=2022-07-24T06:57:55 SlurmdStartTime=2022-07-24T10:10:49
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=hp-2 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.09
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:1
NodeAddr=hp-2 NodeHostName=hp-2 Version=17.11
OS=Linux 5.4.0-1079-oracle #87~18.04.1-Ubuntu SMP Mon Jul 11 03:41:03 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=86953 Sockets=12 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=tgpu
BootTime=2022-07-24T07:00:18 SlurmdStartTime=2022-07-24T10:15:26
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

GPU utilization
Screenshot 2022-07-23 at 10 50 34 PM

Error logs
#Running on hp-1
#Started at Sat Jul 23 17:17:24 UTC 2022
#SLURMD_NODENAME=hp-1
#SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
#SLURM_CLUSTER_NAME=cluster
#SLURM_CPUS_ON_NODE=12
#SLURM_CPUS_PER_TASK=12
#SLURM_EXPORT_ENV=PATH
#SLURM_GET_USER_ENV=1
#SLURM_GTIDS=0
#SLURM_JOBID=70
#SLURM_JOB_CPUS_PER_NODE='12(x2)'
#SLURM_JOB_GID=1001
#SLURM_JOB_ID=70
#SLURM_JOB_NAME=test
#SLURM_JOB_NODELIST='hp-[1-2]'
#SLURM_JOB_NUM_NODES=2
#SLURM_JOB_PARTITION=tgpu
#SLURM_JOB_UID=1001
#SLURM_JOB_USER=ubuntu
#SLURM_LOCALID=0
#SLURM_NNODES=2
#SLURM_NODEID=0
#SLURM_NODELIST='hp-[1-2]'
#SLURM_NODE_ALIASES='(null)'
#SLURM_NPROCS=2
#SLURM_NTASKS=2
#SLURM_NTASKS_PER_NODE=1
#SLURM_OPEN_MODE=a
#SLURM_PRIO_PROCESS=0
#SLURM_PROCID=0
#SLURM_SUBMIT_DIR=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1
#SLURM_SUBMIT_HOST=hp-1
#SLURM_TASKS_PER_NODE='1(x2)'
#SLURM_TASK_PID=28524
#SLURM_TOPOLOGY_ADDR=hp-1
#SLURM_TOPOLOGY_ADDR_PATTERN=node
#SLURM_WORKING_CLUSTER=cluster:155.248.167.102:6817:8192
#srun --export=ALL srun -N2 python3 -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_+SJraOwsjSi9F2aB --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_a71d8596-0515-49d8-8cff-e85faece2c90
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
hp-1:28603:28603 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.27<0>
hp-1:28603:28603 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-1:28603:28603 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
hp-1:28603:28603 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.27<0>
hp-1:28603:28603 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
hp-1:28608:28608 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.27<0>
hp-1:28608:28608 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-1:28608:28608 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
hp-1:28608:28608 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.27<0>
hp-1:28608:28608 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in
main()
File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
cls.main_worker(args)
File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker
cls.trainer.run(
File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run
dp_model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms
Exception raised from get at ../torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f03a5bba612 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f03a5bb6cab in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7f03da1ce739 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f03da1d13c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f03da1d13c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7f03a6ffa301 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x204 (0x7f03a6ffe794 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x34b (0x7f03a700c7db in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::optional<std::weak_ptrc10d::Logger > const&) + 0x3f5 (0x7f03da21b825 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x87cebc (0x7f03ef97debc in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x21ebc5 (0x7f03ef31fbc5 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x1828f4 (0x55f7867078f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #12: _PyObject_MakeTpCall + 0x2df (0x55f7866c147f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x55f78675f2e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #14: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #15: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #16: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #17: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #18: _PyFunction_Vectorcall + 0x244 (0x55f78671cd24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #19: _PyObject_FastCallDictTstate + 0xee (0x55f786707a2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #20: + 0x18c429 (0x55f786711429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #21: _PyObject_MakeTpCall + 0x38f (0x55f7866c152f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x1350 (0x55f78675bc90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #23: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #24: + 0x198709 (0x55f78671d709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #25: + 0xfe73d (0x55f78668373d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #26: + 0x198559 (0x55f78671d559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #27: + 0xff300 (0x55f786684300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #28: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #29: + 0x198709 (0x55f78671d709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #30: + 0xfe73d (0x55f78668373d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #31: + 0x231418 (0x55f7867b6418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #32: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #33: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #34: PyEval_EvalCodeEx + 0x4c (0x55f7867c8a7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #35: PyEval_EvalCode + 0x1b (0x55f78671cdbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #36: + 0x27a33e (0x55f7867ff33e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #37: + 0x1a1571 (0x55f786726571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #38: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #39: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #40: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #41: + 0xfe088 (0x55f786683088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #42: + 0x196fe3 (0x55f78671bfe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #43: _PyFunction_Vectorcall + 0x1d4 (0x55f78671ccb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #44: _PyObject_Call + 0x1da (0x55f7866cb30a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #45: + 0x274eaa (0x55f7867f9eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #46: Py_RunMain + 0x18f (0x55f7867fec0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #47: Py_BytesMain + 0x39 (0x55f7867feff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #48: __libc_start_main + 0xe7 (0x7f0416b22c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #49: + 0x2016a0 (0x55f7867866a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

Traceback (most recent call last):
File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in
main()
File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
cls.main_worker(args)
File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker
cls.trainer.run(
File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run
dp_model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms
Exception raised from get at ../torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa47e37a612 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fa47e376cab in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7fa4b298e739 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fa4b29913c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fa4b29913c2 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb1 (0x7fa47f7ba301 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x204 (0x7fa47f7be794 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x34b (0x7fa47f7cc7db in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::optional<std::weak_ptrc10d::Logger > const&) + 0x3f5 (0x7fa4b29db825 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x87cebc (0x7fa4c813debc in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x21ebc5 (0x7fa4c7adfbc5 in /home/ubuntu/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x1828f4 (0x559e091508f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #12: _PyObject_MakeTpCall + 0x2df (0x559e0910a47f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x559e091a82e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #14: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #15: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #16: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #17: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #18: _PyFunction_Vectorcall + 0x244 (0x559e09165d24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #19: _PyObject_FastCallDictTstate + 0xee (0x559e09150a2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #20: + 0x18c429 (0x559e0915a429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #21: _PyObject_MakeTpCall + 0x38f (0x559e0910a52f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x1350 (0x559e091a4c90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #23: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #24: + 0x198709 (0x559e09166709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #25: + 0xfe73d (0x559e090cc73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #26: + 0x198559 (0x559e09166559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #27: + 0xff300 (0x559e090cd300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #28: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #29: + 0x198709 (0x559e09166709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #30: + 0xfe73d (0x559e090cc73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #31: + 0x231418 (0x559e091ff418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #32: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #33: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #34: PyEval_EvalCodeEx + 0x4c (0x559e09211a7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #35: PyEval_EvalCode + 0x1b (0x559e09165dbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #36: + 0x27a33e (0x559e0924833e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #37: + 0x1a1571 (0x559e0916f571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #38: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #39: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #40: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #41: + 0xfe088 (0x559e090cc088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #42: + 0x196fe3 (0x559e09164fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #43: _PyFunction_Vectorcall + 0x1d4 (0x559e09165cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #44: _PyObject_Call + 0x1da (0x559e0911430a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #45: + 0x274eaa (0x559e09242eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #46: Py_RunMain + 0x18f (0x559e09247c0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #47: Py_BytesMain + 0x39 (0x559e09247ff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #48: __libc_start_main + 0xe7 (0x7fa4ef2e2c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #49: + 0x2016a0 (0x559e091cf6a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

srun: error: hp-2: task 1: Exited with exit code 1
srun: error: hp-2: task 1: Exited with exit code 1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
srun: forcing job termination
srun: got SIGCONT
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
slurmstepd-hp-1: error: *** STEP 70.2 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 ***
srun: forcing job termination
slurmstepd-hp-1: error: *** STEP 70.1 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 ***
slurmstepd-hp-1: error: *** STEP 70.0 ON hp-1 CANCELLED AT 2022-07-23T18:02:02 ***
srun: forcing job termination

@himanshucodz55 himanshucodz55 added the Bug bug should be fixed label Jul 24, 2022
@kamo-naoyuki
Copy link
Collaborator

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms

I don't know the detail of this error. I think pytorch developers can answer it.

Depending on your environment, NCCL can be failed for different reasons, so I can't answer the reason for your error directly. Please refer to the troubleshooting of NCCL:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

At the first, I recommend you check your network setting.

If you'll use TCP/IP connection, please set NCCL_SOCKET_IFNAME correctly. We have the default value for it, but this is not suitable for many cases.

# You need to change or unset NCCL_SOCKET_IFNAME according to your network environment
# https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html#nccl-socket-ifname
export NCCL_SOCKET_IFNAME="^lo,docker,virbr,vmnet,vboxnet"

If you have Infiniband in your environment and you'll use it, you need to configure some environment variables:

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

@himanshucodz55
Copy link
Author

Thanks @kamo-naoyuki for your quick response. we ran the same experiments without slurm using NCCL, it's running fine. we set NCCL_SOCKET_IFNAME connection correctly and training was running successfully. when we running the stage 11 with slurm it gives error...

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 25, 2022

Do you mean you could run the distributed training of espnet2 (with multiple machines) without slurm?

@himanshucodz55
Copy link
Author

Yes, I am able to run distributed training using 2 hosts which have 2 gpu respectively without slurm. But when i start with using slurm ( after slurm installation and setup )gives me error.

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 25, 2022

Then, probably, this is a bug of espnet.

When running on slurm (=--dist_launcher slurm ), we automatically set the rank, local rank, and worldsize from the environment variables set by slurm e.g. SLURM_PROCID. https://github.com/espnet/espnet/blob/master/espnet2/train/distributed_utils.py

Could you check the value of each job using, e.g. , print function? I think the following line is a better place to print it.

torch.distributed.init_process_group(
backend=self.dist_backend,
init_method=self.dist_init_method,
world_size=self.dist_world_size,
rank=self.dist_rank,
)

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 25, 2022

By the way,

WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.

There are no logging messages from espnet2. This is also a bug of espnet when using pytorch=1.12 ?(or, due to your environment. I'm not sure. ).
Indeed, we should a local logger instead of global logging...

@kamo-naoyuki
Copy link
Collaborator

Hmm, I found weird things in your logs.

There are 4 python prorcesses (=2processes x 2nodes). The logs of nvidia-smi has two processes for each node. This should be one for each node.

srun --export = ALL srun -N2 python3 -m espnet2.bin.asr_train

I think this is the reason. I think you modified our launch.py, right? -N2 shouldn't be inseted here.

@himanshucodz55
Copy link
Author

@kamo-naoyuki there are total 4 process, 2 process running on each machine because task_per_node=2. btw i have ran the experiment without -N2 showing the same error.
error log

# Running on hp-1
# Started at Mon Jul 25 15:31:27 UTC 2022
# SLURMD_NODENAME=hp-1
# SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
# SLURM_CLUSTER_NAME=cluster
# SLURM_CPUS_ON_NODE=12
# SLURM_CPUS_PER_TASK=6
# SLURM_EXPORT_ENV=PATH
# SLURM_GET_USER_ENV=1
# SLURM_GTIDS=0
# SLURM_JOBID=81
# SLURM_JOB_CPUS_PER_NODE='12(x2)'
# SLURM_JOB_GID=0
# SLURM_JOB_ID=81
# SLURM_JOB_NAME=test
# SLURM_JOB_NODELIST='hp-[1-2]'
# SLURM_JOB_NUM_NODES=2
# SLURM_JOB_PARTITION=tgpu
# SLURM_JOB_UID=0
# SLURM_JOB_USER=root
# SLURM_LOCALID=0
# SLURM_NNODES=2
# SLURM_NODEID=0
# SLURM_NODELIST='hp-[1-2]'
# SLURM_NODE_ALIASES='(null)'
# SLURM_NPROCS=2
# SLURM_NTASKS=2
# SLURM_NTASKS_PER_NODE=1
# SLURM_OPEN_MODE=a
# SLURM_PRIO_PROCESS=0
# SLURM_PROCID=0
# SLURM_SUBMIT_DIR=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1
# SLURM_SUBMIT_HOST=hp-1
# SLURM_TASKS_PER_NODE='1(x2)'
# SLURM_TASK_PID=3613
# SLURM_TOPOLOGY_ADDR=hp-1
# SLURM_TOPOLOGY_ADDR_PATTERN=node
# SLURM_WORKING_CLUSTER=cluster:150.230.203.45:6817:8192
# srun --export=ALL srun python3 -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_aTnmdUXrhDrHtUci --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_c31a4dcb-0874-4525-8989-fb8e3a0f238d 
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_aTnmdUXrhDrHtUci --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_c31a4dcb-0874-4525-8989-fb8e3a0f238d
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_aTnmdUXrhDrHtUci --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_c31a4dcb-0874-4525-8989-fb8e3a0f238d
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_aTnmdUXrhDrHtUci --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_c31a4dcb-0874-4525-8989-fb8e3a0f238d
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_aTnmdUXrhDrHtUci --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_c31a4dcb-0874-4525-8989-fb8e3a0f238d
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_pos and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
WARNING:root:Using legacy_rel_selfattn and it will be deprecated in the future.
hp-1:3679:3679 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.130<0>
hp-1:3679:3679 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-1:3679:3679 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
hp-1:3679:3679 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.130<0>
hp-1:3679:3679 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
hp-1:3685:3685 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.130<0>
hp-1:3685:3685 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-1:3685:3685 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
hp-1:3685:3685 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.130<0>
hp-1:3685:3685 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker
    cls.trainer.run(
  File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run
    dp_model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms
Exception raised from get at /opt/conda/conda-bld/pytorch_1646755849709/work/torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7ff0f5e8a1bd in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x7ff0f5e86838 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7ff13495feb9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7ff134962b42 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7ff134962b42 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7ff0f71eb834 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7ff0f71ef8c9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7ff0f71fac21 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x3b9 (0x7ff1349aa289 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x7dd73c (0x7ff13c98e73c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1e5fa4 (0x7ff13c396fa4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x1828f4 (0x559323cfe8f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #12: _PyObject_MakeTpCall + 0x2df (0x559323cb847f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x559323d562e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #14: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #15: _PyFunction_Vectorcall + 0x244 (0x559323d13d24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #16: _PyObject_FastCallDictTstate + 0xee (0x559323cfea2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #17: <unknown function> + 0x18c429 (0x559323d08429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #18: _PyObject_MakeTpCall + 0x38f (0x559323cb852f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1350 (0x559323d52c90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1309, in main_worker
    cls.trainer.run(
  File "/home/ubuntu/users/himanshu/espnet/espnet2/train/trainer.py", line 220, in run
    dp_model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms
Exception raised from get at /opt/conda/conda-bld/pytorch_1646755849709/work/torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f4978a531bd in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x7f4978a4f838 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::FileStore::get(std::string const&) + 0xb09 (0x7f49b7528eb9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f49b752bb42 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7f49b752bb42 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7f4979db4834 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7f4979db88c9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7f4979dc3c21 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x3b9 (0x7f49b7573289 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x7dd73c (0x7f49bf55773c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1e5fa4 (0x7f49bef5ffa4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x1828f4 (0x563ad65428f4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #12: _PyObject_MakeTpCall + 0x2df (0x563ad64fc47f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #13: _PyEval_EvalFrameDefault + 0x49a9 (0x563ad659a2e9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #14: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #15: _PyFunction_Vectorcall + 0x244 (0x563ad6557d24 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #16: _PyObject_FastCallDictTstate + 0xee (0x563ad6542a2e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #17: <unknown function> + 0x18c429 (0x563ad654c429 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #18: _PyObject_MakeTpCall + 0x38f (0x563ad64fc52f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1350 (0x563ad6596c90 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #20: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #21: <unknown function> + 0x198709 (0x563ad6558709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #22: <unknown function> + 0xfe73d (0x563ad64be73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #23: <unknown function> + 0x198559 (0x563ad6558559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #24: <unknown function> + 0xff300 (0x563ad64bf300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #25: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #26: <unknown function> + 0x198709 (0x563ad6558709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #27: <unknown function> + 0xfe73d (0x563ad64be73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #28: <unknown function> + 0x231418 (0x563ad65f1418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #29: <unknown function> + 0xfe088 (0x563ad64be088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #30: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #31: PyEval_EvalCodeEx + 0x4c (0x563ad6603a7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #32: PyEval_EvalCode + 0x1b (0x563ad6557dbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #33: <unknown function> + 0x27a33e (0x563ad663a33e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #34: <unknown function> + 0x1a1571 (0x563ad6561571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #35: <unknown function> + 0xfe088 (0x563ad64be088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #36: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #37: _PyFunction_Vectorcall + 0x1d4 (0x563ad6557cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #38: <unknown function> + 0xfe088 (0x563ad64be088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #39: <unknown function> + 0x196fe3 (0x563ad6556fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #40: _PyFunction_Vectorcall + 0x1d4 (0x563ad6557cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #41: _PyObject_Call + 0x1da (0x563ad650630a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #42: <unknown function> + 0x274eaa (0x563ad6634eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #43: Py_RunMain + 0x18f (0x563ad6639c0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #44: Py_BytesMain + 0x39 (0x563ad6639ff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #45: __libc_start_main + 0xe7 (0x7f49e5221c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: <unknown function> + 0x2016a0 (0x563ad65c16a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

frame #20: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #21: <unknown function> + 0x198709 (0x559323d14709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #22: <unknown function> + 0xfe73d (0x559323c7a73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #23: <unknown function> + 0x198559 (0x559323d14559 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #24: <unknown function> + 0xff300 (0x559323c7b300 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #25: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #26: <unknown function> + 0x198709 (0x559323d14709 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #27: <unknown function> + 0xfe73d (0x559323c7a73d in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #28: <unknown function> + 0x231418 (0x559323dad418 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #29: <unknown function> + 0xfe088 (0x559323c7a088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #30: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #31: PyEval_EvalCodeEx + 0x4c (0x559323dbfa7c in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #32: PyEval_EvalCode + 0x1b (0x559323d13dbb in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #33: <unknown function> + 0x27a33e (0x559323df633e in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #34: <unknown function> + 0x1a1571 (0x559323d1d571 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #35: <unknown function> + 0xfe088 (0x559323c7a088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #36: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #37: _PyFunction_Vectorcall + 0x1d4 (0x559323d13cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #38: <unknown function> + 0xfe088 (0x559323c7a088 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #39: <unknown function> + 0x196fe3 (0x559323d12fe3 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #40: _PyFunction_Vectorcall + 0x1d4 (0x559323d13cb4 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #41: _PyObject_Call + 0x1da (0x559323cc230a in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #42: <unknown function> + 0x274eaa (0x559323df0eaa in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #43: Py_RunMain + 0x18f (0x559323df5c0f in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #44: Py_BytesMain + 0x39 (0x559323df5ff9 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)
frame #45: __libc_start_main + 0xe7 (0x7ff162658c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: <unknown function> + 0x2016a0 (0x559323d7d6a0 in /home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3)

srun: error: hp-2: task 1: Exited with exit code 1
srun: error: hp-2: task 1: Exited with exit code 1

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 25, 2022

Please let me confirm: task_per_node must be 1 because each node has single GPU in this case.

In your logs, there are still 4 processes, so this caused an error in espnet, at least.

SLURM_CPUS_ON_NODE=12
SLURM_CPUS_PER_TASK=6

Do you know why SLURM_CPUS_PER_TASK is 6? In your slurm.conf, it was set to 12.

option num_threads=1 --cpus-per-task 12 --ntasks-per-node=1

@himanshucodz55
Copy link
Author

Thanks for your response @kamo-naoyuki. Yes, ntask_per_node=1. I am just checking with SLURM_CPUS_PER_TASK=6 what will happen but error is same. I have ran with SLURM_CPUS_PER_TASK=12 you can see in log above, still error is same.

@kamo-naoyuki
Copy link
Collaborator

Sorry, actually, I don't know all priority orders of slurm options, so I'm not sure why there are 4 processes.

Do you use select/linear plugin? What happens if --cpus-per-task 1?

However, I think you can solve this issues just by reading the documentation of slurm: https://slurm.schedmd.com/sbatch.html

Please let me know if you find some solutions.

@himanshucodz55
Copy link
Author

if i ran with --cpus-per-task 1 still error is same. In espnet official documentation you have mentioned "2Hosts and 2GPUs for each node using Slurm with multiprocessing distributed (https://espnet.github.io/espnet/espnet2_distributed.html). could you pls share the log file if possible, i will help to solve the problem. or if you provide all config file that you have used with this exp, pls share...

@kamo-naoyuki
Copy link
Collaborator

option gpu=1 -p tgpu

Why did you change from gpu=0 to gpu=1 ?

Could you also check SelectType value of slurm.conf of your slurm server (e.g. /etc/slurm/slurm.conf).
Maybe, espnet assumes SelectType=select/cons_res or SelectType=select/cons_tres

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 25, 2022

Oh, I found you ran srun with srun

srun --export=ALL srun python3

Please don't touch launch.py.

(srun -> (x2) -> srun -> (x2), so there are 4 processes.)

@himanshucodz55
Copy link
Author

In my slurm.conf SelectType=select/linear
I am using single gpu in each machine that's why i have changed this parameter but i have checked with gpu=0 you can see log

# Running on hp-1
# Started at Tue Jul 26 09:04:42 UTC 2022
# SLURMD_NODENAME=hp-1
# SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
# SLURM_CLUSTER_NAME=cluster
# SLURM_CPUS_ON_NODE=12
# SLURM_CPUS_PER_TASK=1
# SLURM_EXPORT_ENV=PATH
# SLURM_GET_USER_ENV=1
# SLURM_GTIDS=0
# SLURM_JOBID=73
# SLURM_JOB_CPUS_PER_NODE='12(x2)'
# SLURM_JOB_GID=0
# SLURM_JOB_GPUS=0
# SLURM_JOB_ID=73
# SLURM_JOB_NAME=test
# SLURM_JOB_NODELIST='hp-[1-2]'
# SLURM_JOB_NUM_NODES=2
# SLURM_JOB_PARTITION=tgpu
# SLURM_JOB_UID=0
# SLURM_JOB_USER=root
# SLURM_LOCALID=0
# SLURM_NNODES=2
# SLURM_NODEID=0
# SLURM_NODELIST='hp-[1-2]'
# SLURM_NODE_ALIASES='(null)'
# SLURM_NPROCS=2
# SLURM_NTASKS=2
# SLURM_NTASKS_PER_NODE=1
# SLURM_OPEN_MODE=a
# SLURM_PRIO_PROCESS=0
# SLURM_PROCID=0
# SLURM_SUBMIT_DIR=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1
# SLURM_SUBMIT_HOST=hp-1
# SLURM_TASKS_PER_NODE='1(x2)'
# SLURM_TASK_PID=20577
# SLURM_TOPOLOGY_ADDR=hp-1
# SLURM_TOPOLOGY_ADDR_PATTERN=node
# SLURM_WORKING_CLUSTER=cluster:168.138.215.228:6817:8192
# srun --export=ALL srun python3 -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_CpaKmSCsJiSEonKe --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_9115cd10-f4e1-4ad4-8c56-6ac03a88f6bb 
srun: Job 73 step creation temporarily disabled, retrying
srun: Job 73 step creation temporarily disabled, retrying

@himanshucodz55
Copy link
Author

himanshucodz55 commented Jul 26, 2022

my stage 11 command is:

jobname=test
    ${python} -m espnet2.bin.launch \
        --cmd "${cuda_cmd} --name ${jobname}" \
        --log "${asr_exp}"/train.log \
        --ngpu "${ngpu}" \
        --num_nodes "${num_nodes}" \
        --init_file_prefix "${asr_exp}"/.dist_init_ \
        --multiprocessing_distributed true -- \
        srun ${python} -m espnet2.bin.asr_train \
            --ngpu 1 --multiprocessing_distributed true \
            --dist_launcher slurm \
            --dist_init_method "file://$(pwd)/.dist_init_$(openssl rand -base64 12)" \
            --use_preprocessor true \
            --bpemodel "${bpemodel}" \
            --token_type "${token_type}" \
            --token_list "${token_list}" \
            --non_linguistic_symbols "${nlsyms_txt}" \
            --cleaner "${cleaner}" \
            --g2p "${g2p}" \
            --valid_data_path_and_name_and_type "${_asr_valid_dir}/${_scp},speech,${_type}" \
            --valid_data_path_and_name_and_type "${_asr_valid_dir}/text,text,text" \
            --valid_shape_file "${asr_stats_dir}/valid/speech_shape" \
            --valid_shape_file "${asr_stats_dir}/valid/text_shape.${token_type}" \
            --resume false \
            --init_param ${pretrained_model} \
            --ignore_init_mismatch ${ignore_init_mismatch} \
            --fold_length "${_fold_length}" \
            --fold_length "${asr_text_fold_length}" \
            --output_dir "${asr_exp}" \
            ${_opts} ${asr_args}

@himanshucodz55
Copy link
Author

ran without srun in stage 11 that i have mentioned above...
see the error:

# Running on hp-1
# Started at Tue Jul 26 09:26:47 UTC 2022
# SLURMD_NODENAME=hp-1
# SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
# SLURM_CLUSTER_NAME=cluster
# SLURM_CPUS_ON_NODE=12
# SLURM_CPUS_PER_TASK=12
# SLURM_EXPORT_ENV=PATH
# SLURM_GET_USER_ENV=1
# SLURM_GTIDS=0
# SLURM_JOBID=74
# SLURM_JOB_CPUS_PER_NODE='12(x2)'
# SLURM_JOB_GID=0
# SLURM_JOB_GPUS=0
# SLURM_JOB_ID=74
# SLURM_JOB_NAME=test
# SLURM_JOB_NODELIST='hp-[1-2]'
# SLURM_JOB_NUM_NODES=2
# SLURM_JOB_PARTITION=tgpu
# SLURM_JOB_UID=0
# SLURM_JOB_USER=root
# SLURM_LOCALID=0
# SLURM_NNODES=2
# SLURM_NODEID=0
# SLURM_NODELIST='hp-[1-2]'
# SLURM_NODE_ALIASES='(null)'
# SLURM_NPROCS=2
# SLURM_NTASKS=2
# SLURM_NTASKS_PER_NODE=1
# SLURM_OPEN_MODE=a
# SLURM_PRIO_PROCESS=0
# SLURM_PROCID=0
# SLURM_SUBMIT_DIR=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1
# SLURM_SUBMIT_HOST=hp-1
# SLURM_TASKS_PER_NODE='1(x2)'
# SLURM_TASK_PID=4546
# SLURM_TOPOLOGY_ADDR=hp-1
# SLURM_TOPOLOGY_ADDR_PATTERN=node
# SLURM_WORKING_CLUSTER=cluster:168.138.215.228:6817:8192
# srun --export=ALL python3 -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_dCTULovzTbf2VsIV --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_0390f457-df76-44a4-ab71-21960a8c88d9 
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_dCTULovzTbf2VsIV --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_0390f457-df76-44a4-ab71-21960a8c88d9
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/bin/python3 /home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/.dist_init_dCTULovzTbf2VsIV --use_preprocessor true --bpemodel /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/bpe.model --token_type bpe --token_list /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/data/en_token_list/bpe_unigram5000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/wav.scp,speech,sound --valid_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/dev/text,text,text --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/speech_shape --valid_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//valid/text_shape.bpe --resume false --init_param --ignore_init_mismatch false --fold_length 80000 --fold_length 150 --output_dir exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm --config /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/conf/cf2.yaml --frontend_conf fs=8k --normalize=global_mvn --normalize_conf stats_file=/home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/feats_stats.npz --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/wav.scp,speech,sound --train_data_path_and_name_and_type /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/dump/raw/train_100/text,text,text --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/speech_shape --train_shape_file /home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_stats_raw_en_bpe5000//train/text_shape.bpe --ngpu 1 --multiprocessing_distributed true --dist_launcher slurm --dist_init_method file:///home/ubuntu/users/himanshu/espnet/egs2/librispeech/asr1/exp/asr_conformer_lr2e-3_8k_nospec_warmup25k_amp_nondeterministic_slurm/.dist_init_0390f457-df76-44a4-ab71-21960a8c88d9
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1103, in main_worker
    distributed_option.init_torch_distributed()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/train/distributed_utils.py", line 96, in init_torch_distributed
    torch.distributed.init_process_group(
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:05:00)
srun: error: hp-2: task 1: Exited with exit code 1
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1013, in main
    cls.main_worker(args)
  File "/home/ubuntu/users/himanshu/espnet/espnet2/tasks/abs_task.py", line 1103, in main_worker
    distributed_option.init_torch_distributed()
  File "/home/ubuntu/users/himanshu/espnet/espnet2/train/distributed_utils.py", line 96, in init_torch_distributed
    torch.distributed.init_process_group(
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ec2-user/SageMaker/espnet_env1/envs/espnet_env1/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 255, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:05:00)
srun: error: hp-1: task 0: Exited with exit code 1
# Accounting: begin_time=1658827607
# Accounting: end_time=1658827915
# Accounting: time=308 threads=1
# Finished at Tue Jul 26 09:31:55 UTC 2022 with status 1

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 26, 2022

srun: Job 73 step creation temporarily disabled, retrying

SelectType=select/linear allocates all resources for the job, so the specified options were incompatible.
Please read the documentation of slurm.
I recommend SelectType=select/cons_res or SelectType=select/cons_tres basically, if you'll use espnet for it.

ran without srun in stage 11 that i have mentioned above...

Please keep it as it is even if the error happened... If you changed anything, how we could help you?

Now, at last, I understood the problem. The storage, /home/ubuntu/users/, is not shared between machines, e.g. NFS, right?

ESPnet assumes the base directory is shared.

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Jul 26, 2022

SLURM_JOB_USER=root

This job was executed by root, normally it should be your username, because root doesn't have the write permission for /home/ubuntu/users/himanshu, it might fail.
Please recheck the setting of slurm, the user permission, or etc.

@kamo-naoyuki kamo-naoyuki added Question Question and removed Bug bug should be fixed labels Jul 26, 2022
@himanshucodz55
Copy link
Author

Hi @kamo-naoyuki, I have reinstalled Slurm with proper user permission but the got the same error. and also i have tried the exact way that you have mentioned in Espnet distributed training official doc. I have checked with all the changes that you have mentioned above in the comment but still error is same.
if it's works for you pls share all the config file or share the exact setup process that you had followed...
Thanks in advance.

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Aug 14, 2022

Your error suggests that you simply failed in calling torch.distributed.init_process_group. It's better to run it with more simple codes.

# node1
            torch.distributed.init_process_group(
                backend="nccl",
                init_method="file:///somewhere",
                world_size=2,
                rank=0,
            )
# node2
            torch.distributed.init_process_group(
                backend="nccl",
                init_method="file:///somewhere",
                world_size=2,
                rank=1,
            )

Please note that: this issue is already out of our supports because there are no problems for espnet. The configuration for espnet or the configuration for slurm does not relate to it.

@himanshucodz55
Copy link
Author

Thank you @kamo-naoyuki for your support and quick response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Question
Projects
None yet
Development

No branches or pull requests

2 participants