"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

jminjie · 2022-05-03T22:09:38Z

❓ Questions and Help

After following setup steps I ran metaseq-api-local and got this output:

$ metaseq-api-local
Traceback (most recent call last):
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 17, in <module>
    from metaseq_internal.constants import LOCAL_SSD, MODEL_SHARED_FOLDER
ModuleNotFoundError: No module named 'metaseq_internal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/metadata.py", line 86, in load
    module = import_module(match.group('module'))
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq_cli/interactive_hosted.py", line 31, in <module>
    from metaseq.service.constants import (
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 40, in <module>
    raise RuntimeError(
RuntimeError: You must set the variables in metaseq.service.constants to launch the API.

Am I missing a step? I tried manually setting LOCAL_SSD, MODEL_SHARED_FOLDER to a new folder I created but then other things failed.

fairseq Version (e.g., 1.0 or master): followed setup.md
PyTorch Version (e.g., 1.0) followed setup.md
OS (e.g., Linux): Ubuntu
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): followed setup.md
Python version: 3.9.12
CUDA/cuDNN version: 11.3
GPU models and configuration: Quadro RTX 5000
Any other relevant information:

The text was updated successfully, but these errors were encountered:

Fixes #23.

jminjie · 2022-05-03T22:33:13Z

1546035 doesn't seem to fix this issue (I tried it locally).

stephenroller · 2022-05-03T22:37:09Z

Can you share with me how you have set things?

jminjie · 2022-05-03T22:41:51Z

Sure. I have multiple CUDA versions installed, so I'm using a conda virtual environment set to use CUDA 11.3 on an Ubuntu machine with an Nvidia Quadro RTX 5000. I installed PyTorch, Apex, Megatron, fairscale, and metaseq using the instructions on setup.md. Now I'm trying to run metaseq-api-local and seeing errors.

(I'm also not sure if the intention is that I can run the API right away, or if I need to download weights or something somewhere first)

Fixes #23.

jminjie · 2022-05-03T22:53:00Z

(This issue is not resolved btw)

I'll try to reproduce again. Let me know if there's any specifics about my setup that you need.

hunterlang · 2022-05-03T23:20:55Z

I had the same problems and got a few steps farther by directly modifying MODEL_SHARED_FOLDER and LOCAL_SSD inside constants.py to point to where I had downloaded the model files.

Then I used the dict.txt file from Stephen linked in #19

Then I think you'll also need to copy the files from https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets to your MODEL_SHARED_FOLDER.

jminjie · 2022-05-03T23:29:11Z

Thanks Hunter! The sense I'm getting is that

I didn't download the weights and need to do that (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)
I'll try the changes in dict.txt and /assets and check back in.

Hopefully this will resolve the issues.

jminjie · 2022-05-03T23:47:24Z

After downloaded weights (for a smaller model) and the dict, I'm seeing

(conda_env_opt) [~/fbopt/storage]$ metaseq-api-local   
2022-05-03 16:40:45 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy                                                                                                                    2022-05-03 16:40:45 | INFO | metaseq.tasks.language_modeling | dictionary: 50272 types                                                                                                                                        2022-05-03 16:40:45 | INFO | metaseq.hub_utils | loading model(s) from /home/jliu/fbopt/storage/175B/reshard_no_os/reshard.pt                                                                                                 2022-05-03 16:40:46 | INFO | metaseq.checkpoint_utils | Done reading from disk                                                                                                                                                Detected CUDA files, patching ldflags                                                                                                                                                                                         Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...                                                                                                                            Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)                                                                                                             [1/3] c++ -MMD -MF scaled_upper_triang_masked_softmax.o.d -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -D
PYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/tor
ch/csrc/api/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /u
sr/local/cuda-11.3/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_upper_triang_mask
ed_softmax.cpp -o scaled_upper_triang_masked_softmax.o     
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"
_cxxabi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__
CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o                                                                                                                                                                               [3/3] c++ scaled_upper_triang_masked_softmax.o scaled_upper_triang_masked_softmax_cuda.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o scaled_upper_triang_masked_softmax_cuda.so                                                                                             Loading extension module scaled_upper_triang_masked_softmax_cuda...                                                                                                                                                           Detected CUDA files, patching ldflags                                                                                                                                                                                         
Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...                                                                                                                            Building extension module scaled_masked_softmax_cuda...                             
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF scaled_masked_softmax.o.d -DTORCH_EXTENSION_NAME=scaled_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxa
bi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isyst
em /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include
 -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_masked_softmax.cpp -o scaled_masked_softmax
.o                                                                                                             
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=scaled_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\"
 -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home
/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -isyste
m /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-re
laxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_
CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_masked_softmax_cuda.cu -o scaled_masked_softmax_cud
a.cuda.o                                                                                                       
[3/3] c++ scaled_masked_softmax.o scaled_masked_softmax_cuda.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -lt
orch -ltorch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o scaled_masked_softmax_cuda.so
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags                                                                          
Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...                          
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF layer_norm_cuda.o.d -DTORCH_EXTENSION_NAME=fused_mix_prec_layer_norm_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi
1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem
 /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -
isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/layer_norm_cuda.cpp -o layer_norm_cuda.o 
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_mix_prec_layer_norm_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi10
11\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /
home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -is
ystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --exp
t-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -maxrregcount=50 -gencode arch=compute_80,
code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/layer_norm_cuda_kernel.cu -o layer_norm_cuda_kernel.cuda.o 
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltor
ch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/jliu/fbopt/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/utils.py", line 226, in call_main
    main(cfg, **kwargs)                                
  File "/home/jliu/fbopt/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jliu/fbopt/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jliu/fbopt/metaseq/metaseq/checkpoint_utils.py", line 505, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/jliu/fbopt/metaseq/metaseq/hub_utils.py", line 476, in _build_model
    return fsdp_wrap(model)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 145, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/home/jliu/fbopt/fairscale/fairscale/nn/wrap/auto_wrap.py", line 170, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 48, in __init__
    super().__init__(*args, **kwargs)
  File "/home/jliu/fbopt/fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 281, in __init__
    self.process_group = process_group or get_process_group_cached()
  File "/home/jliu/fbopt/fairscale/fairscale/utils/parallel.py", line 92, in get_process_group_cached
    raise RuntimeError("torch.distributed is not yet initialized but process group is requested.")
RuntimeError: torch.distributed is not yet initialized but process group is requested.

Any guidance here?

stephenroller · 2022-05-04T00:25:28Z

Can you report your fairscale version?

jminjie · 2022-05-04T16:37:44Z

I installed fairscale from source with

git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
git checkout prefetch_fsdp_params_simple
pip3 install -e .

as described in setup.md. I'm not sure how to check the version number. Based on fairscale/CHANGELOG.md it seems version 0.4.1 is the most recent version upgrade on this commit.

DGideas · 2022-05-08T13:38:20Z

I installed fairscale from source with
git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
git checkout prefetch_fsdp_params_simple
pip3 install -e .
as described in setup.md. I'm not sure how to check the version number. Based on fairscale/CHANGELOG.md it seems version 0.4.1 is the most recent version upgrade on this commit.

I got the error message "torch.distributed is not yet initialized but process group is requested" too. BTW, You could goto fairscale/fairscale/__init__.py and should noticed that __version__ = "0.4.1" on line 7

seelam · 2022-05-10T02:44:03Z

I still see the issue, any resolution?

stephenroller · 2022-05-11T15:53:46Z

Which model are you using? How exactly have you set all the parameters?

This looks like distributed is not being initialized which is most strange.

jminjie · 2022-05-11T17:58:36Z

I really just followed your setup document, so the model and parameters are somewhat opaque to me. Can you recommend what files to check to give you the information you need?

samuelstevens · 2022-05-11T18:05:47Z

I am experiencing the same issue on commit 809e49c. I have changed the MODEL_SHARED_FOLDER, LOCAL_SSD, CHECKPOINT_FOLDER, and CHECKPOINT_LOCAL variables in metaseq/service/constants.py.

I'm running it with CUDA_HOME=/usr/local/cuda-11.6 metaseq-api-local.

I get RuntimeError: torch.distributed is not yet initialized but process group is requested. (same stack trace, fairseq version 0.4.1).

Maybe it is a cuda11.6 issue since metaseq uses 11.3?

jminjie · 2022-05-11T21:58:15Z

Maybe it is a cuda11.6 issue since metaseq uses 11.3?

I'm using CUDA 11.3 and seeing the same error.

RohitNagraj · 2022-05-12T05:48:52Z

Solution

So from some research I did about torch.distributed, I found a way to get past this issue by making the following changes:
In metaseq/distributed/utils.py, between line 296 and 297, add the following code:

if not torch.distributed.is_initialized():
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
        dist.init_process_group("nccl", rank=0, world_size=1)

This initializes torch.distributed. However, I also faced issues where the code crashed sometimes due to some cuda errors, which I did not pay a lot of attention to as restarting the server just fixed it.

I am not very familiar with torch.distributed and found this solution only by Googling. So there might be better ways to fix this.

ForgetThatNight · 2022-05-16T07:47:24Z

have you resolved it? I got same error too

DGideas · 2022-05-16T08:12:39Z

have you resolved it? I got same error too

Do you want to fune-turing this model or just run it? If you want to run it, you could use OPT on HuggingFace(transformers), this method will bypasses these issues.

stephenroller · 2022-05-16T13:21:30Z

This is so strange. Can anyone provide the command they are running?

stevenkwong · 2022-05-17T07:58:17Z

This is so strange. Can anyone provide the command they are running?

I met the same problem of "RuntimeError: torch.distributed is not yet initialized but process group is requested";
I just follow the official setup instruction, but install Apex in the end.
After finishing all instruction, I run “metaseq-api-local” ，then come up with this error

I am wondering whether the requirment install order would bring this error?

xhluca · 2022-06-01T17:24:41Z

Also running into this error. I'm using only a single GPU:

Wed Jun  1 17:24:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
| N/A   30C    P0    50W / 350W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I'm using the 350M weights:

    MODEL_SHARED_FOLDER = "/home/toolkit/opt/"
    MODEL_SIZE = "350M"

which I downloaded here as a single file

Update 1:

Just to confirm that this issue also happens when I'm using 4xA100 80GB. Also, a different issue arises with 2.7B:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 298, in <module>
    cli_main()
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 294, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/toolkit/opt/metaseq/metaseq/distributed/utils.py", line 263, in call_main
    return main(cfg, **kwargs)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/toolkit/opt/metaseq/metaseq/hub_utils.py", line 492, in load_model
    build_model_hook=_build_model,
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 473, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 408, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 348, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 339, in _is_checkpoint_sharded
    size_ratio = max(sizes) / min(sizes)
ZeroDivisionError: division by zero

I think this might be because there's no reshard.pt file for 2.7B, instead it is in reshard-model_part-{i}.pt format. Note that I had to modify my constants.py:

MODEL_SIZE = "2.7B"
# where to find the raw files on nfs
CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, MODEL_SIZE)
# where to store them on SSD for faster loading
CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, MODEL_SIZE, "reshard.pt")

since there's no reshard_no_os folder (or at least I didn't use that naming convention).

yuvalkirstain · 2022-06-14T14:44:11Z

@jminjie
Regarding the exception:

RuntimeError: torch.distributed is not yet initialized but process group is requested.

what is the distributed world size you set? If it is 1, then this behavior is expected.

jalalirs · 2022-06-16T08:03:36Z

Same here… followed the exact instructions in the readme and got the same error. I am using 350m model and 8 world size.

BTW, the reason I used the mode 350m is that all models with multiple shards don’t pass and fails to load the checkpoints. I keep getting
size_ratio = max(sizes) /min(sizes)
ValueError: max() arg is an empty sequence

Another bug is that _utils.is_primitive_type function doesn’t exist anymore in omegconf

jminjie added the question Further information is requested label May 3, 2022

stephenroller added a commit that referenced this issue May 3, 2022

Update constants.py

1546035

Fixes #23.

stephenroller mentioned this issue May 3, 2022

Update constants.py #24

Merged

suchenzang closed this as completed in #24 May 3, 2022

suchenzang pushed a commit that referenced this issue May 3, 2022

Update constants.py (#24)

3591201

Fixes #23.

suchenzang reopened this May 3, 2022

hunterlang mentioned this issue May 3, 2022

Running the API #26

Closed

Mrs-Hudson mentioned this issue May 6, 2022

Using the Metaseq API #57

Closed

zhisbug mentioned this issue May 10, 2022

Meeting RuntimeError: torch.distributed is not yet initialized but process group is requested. #80

Closed

suchenzang changed the title ~~Error after setup when trying to run API~~ "RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API May 12, 2022

suchenzang mentioned this issue Oct 6, 2022

Add E2E API Integration Tests #386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

jminjie commented May 3, 2022

jminjie commented May 3, 2022

stephenroller commented May 3, 2022

jminjie commented May 3, 2022

jminjie commented May 3, 2022

hunterlang commented May 3, 2022 •

edited

jminjie commented May 3, 2022

jminjie commented May 3, 2022

stephenroller commented May 4, 2022

jminjie commented May 4, 2022

DGideas commented May 8, 2022 •

edited

seelam commented May 10, 2022 •

edited

stephenroller commented May 11, 2022

jminjie commented May 11, 2022

samuelstevens commented May 11, 2022

jminjie commented May 11, 2022

RohitNagraj commented May 12, 2022 •

edited

ForgetThatNight commented May 16, 2022

DGideas commented May 16, 2022

stephenroller commented May 16, 2022

stevenkwong commented May 17, 2022 •

edited

xhluca commented Jun 1, 2022 •

edited

yuvalkirstain commented Jun 14, 2022

jalalirs commented Jun 16, 2022

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

Comments

jminjie commented May 3, 2022

❓ Questions and Help

jminjie commented May 3, 2022

stephenroller commented May 3, 2022

jminjie commented May 3, 2022

jminjie commented May 3, 2022

hunterlang commented May 3, 2022 • edited

jminjie commented May 3, 2022

jminjie commented May 3, 2022

stephenroller commented May 4, 2022

jminjie commented May 4, 2022

DGideas commented May 8, 2022 • edited

seelam commented May 10, 2022 • edited

stephenroller commented May 11, 2022

jminjie commented May 11, 2022

samuelstevens commented May 11, 2022

jminjie commented May 11, 2022

RohitNagraj commented May 12, 2022 • edited

ForgetThatNight commented May 16, 2022

DGideas commented May 16, 2022

stephenroller commented May 16, 2022

stevenkwong commented May 17, 2022 • edited

xhluca commented Jun 1, 2022 • edited

yuvalkirstain commented Jun 14, 2022

jalalirs commented Jun 16, 2022

hunterlang commented May 3, 2022 •

edited

DGideas commented May 8, 2022 •

edited

seelam commented May 10, 2022 •

edited

RohitNagraj commented May 12, 2022 •

edited

stevenkwong commented May 17, 2022 •

edited

xhluca commented Jun 1, 2022 •

edited