Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23

Open
jminjie opened this issue May 3, 2022 · 23 comments · Fixed by #24
Labels
question Further information is requested

Comments

@jminjie
Copy link

jminjie commented May 3, 2022

❓ Questions and Help

After following setup steps I ran metaseq-api-local and got this output:

$ metaseq-api-local
Traceback (most recent call last):
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 17, in <module>
    from metaseq_internal.constants import LOCAL_SSD, MODEL_SHARED_FOLDER
ModuleNotFoundError: No module named 'metaseq_internal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/metadata.py", line 86, in load
    module = import_module(match.group('module'))
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq_cli/interactive_hosted.py", line 31, in <module>
    from metaseq.service.constants import (
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 40, in <module>
    raise RuntimeError(
RuntimeError: You must set the variables in metaseq.service.constants to launch the API.

Am I missing a step? I tried manually setting LOCAL_SSD, MODEL_SHARED_FOLDER to a new folder I created but then other things failed.

  • fairseq Version (e.g., 1.0 or master): followed setup.md
  • PyTorch Version (e.g., 1.0) followed setup.md
  • OS (e.g., Linux): Ubuntu
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): followed setup.md
  • Python version: 3.9.12
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: Quadro RTX 5000
  • Any other relevant information:
@jminjie jminjie added the question Further information is requested label May 3, 2022
stephenroller added a commit that referenced this issue May 3, 2022
@jminjie
Copy link
Author

jminjie commented May 3, 2022

1546035 doesn't seem to fix this issue (I tried it locally).

@stephenroller
Copy link
Contributor

Can you share with me how you have set things?

@jminjie
Copy link
Author

jminjie commented May 3, 2022

Sure. I have multiple CUDA versions installed, so I'm using a conda virtual environment set to use CUDA 11.3 on an Ubuntu machine with an Nvidia Quadro RTX 5000. I installed PyTorch, Apex, Megatron, fairscale, and metaseq using the instructions on setup.md. Now I'm trying to run metaseq-api-local and seeing errors.

(I'm also not sure if the intention is that I can run the API right away, or if I need to download weights or something somewhere first)

suchenzang pushed a commit that referenced this issue May 3, 2022
@jminjie
Copy link
Author

jminjie commented May 3, 2022

(This issue is not resolved btw)

I'll try to reproduce again. Let me know if there's any specifics about my setup that you need.

@suchenzang suchenzang reopened this May 3, 2022
@hunterlang
Copy link

hunterlang commented May 3, 2022

I had the same problems and got a few steps farther by directly modifying MODEL_SHARED_FOLDER and LOCAL_SSD inside constants.py to point to where I had downloaded the model files.

Then I used the dict.txt file from Stephen linked in #19

Then I think you'll also need to copy the files from https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets to your MODEL_SHARED_FOLDER.

@jminjie
Copy link
Author

jminjie commented May 3, 2022

Thanks Hunter! The sense I'm getting is that

  1. I didn't download the weights and need to do that (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)
  2. I'll try the changes in dict.txt and /assets and check back in.

Hopefully this will resolve the issues.

@jminjie
Copy link
Author

jminjie commented May 3, 2022

After downloaded weights (for a smaller model) and the dict, I'm seeing

(conda_env_opt) [~/fbopt/storage]$ metaseq-api-local   
2022-05-03 16:40:45 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy                                                                                                                    2022-05-03 16:40:45 | INFO | metaseq.tasks.language_modeling | dictionary: 50272 types                                                                                                                                        2022-05-03 16:40:45 | INFO | metaseq.hub_utils | loading model(s) from /home/jliu/fbopt/storage/175B/reshard_no_os/reshard.pt                                                                                                 2022-05-03 16:40:46 | INFO | metaseq.checkpoint_utils | Done reading from disk                                                                                                                                                Detected CUDA files, patching ldflags                                                                                                                                                                                         Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...                                                                                                                            Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)                                                                                                             [1/3] c++ -MMD -MF scaled_upper_triang_masked_softmax.o.d -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -D
PYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/tor
ch/csrc/api/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /u
sr/local/cuda-11.3/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_upper_triang_mask
ed_softmax.cpp -o scaled_upper_triang_masked_softmax.o     
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"
_cxxabi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__
CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o                                                                                                                                                                               [3/3] c++ scaled_upper_triang_masked_softmax.o scaled_upper_triang_masked_softmax_cuda.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o scaled_upper_triang_masked_softmax_cuda.so                                                                                             Loading extension module scaled_upper_triang_masked_softmax_cuda...                                                                                                                                                           Detected CUDA files, patching ldflags                                                                                                                                                                                         
Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...                                                                                                                            Building extension module scaled_masked_softmax_cuda...                             
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF scaled_masked_softmax.o.d -DTORCH_EXTENSION_NAME=scaled_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxa
bi1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isyst
em /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include
 -isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_masked_softmax.cpp -o scaled_masked_softmax
.o                                                                                                             
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=scaled_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\"
 -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home
/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -isyste
m /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-re
laxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_
CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/scaled_masked_softmax_cuda.cu -o scaled_masked_softmax_cud
a.cuda.o                                                                                                       
[3/3] c++ scaled_masked_softmax.o scaled_masked_softmax_cuda.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -lt
orch -ltorch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o scaled_masked_softmax_cuda.so
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags                                                                          
Emitting ninja build file /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...                          
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF layer_norm_cuda.o.d -DTORCH_EXTENSION_NAME=fused_mix_prec_layer_norm_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi
1011\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem
 /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -
isystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/layer_norm_cuda.cpp -o layer_norm_cuda.o 
[2/3] /usr/local/cuda-11.3/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_mix_prec_layer_norm_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi10
11\" -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /
home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/TH -isystem /home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda-11.3/include -is
ystem /home/jliu/miniconda3/envs/conda_env_opt/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --exp
t-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -maxrregcount=50 -gencode arch=compute_80,
code=sm_80 -std=c++14 -c /home/jliu/fbopt/Megatron-LM/megatron/fused_kernels/layer_norm_cuda_kernel.cu -o layer_norm_cuda_kernel.cuda.o 
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltor
ch_python -L/usr/local/cuda-11.3/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/jliu/fbopt/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/utils.py", line 226, in call_main
    main(cfg, **kwargs)                                
  File "/home/jliu/fbopt/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jliu/fbopt/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jliu/fbopt/metaseq/metaseq/checkpoint_utils.py", line 505, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/jliu/fbopt/metaseq/metaseq/hub_utils.py", line 476, in _build_model
    return fsdp_wrap(model)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 145, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/home/jliu/fbopt/fairscale/fairscale/nn/wrap/auto_wrap.py", line 170, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/home/jliu/fbopt/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 48, in __init__
    super().__init__(*args, **kwargs)
  File "/home/jliu/fbopt/fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 281, in __init__
    self.process_group = process_group or get_process_group_cached()
  File "/home/jliu/fbopt/fairscale/fairscale/utils/parallel.py", line 92, in get_process_group_cached
    raise RuntimeError("torch.distributed is not yet initialized but process group is requested.")
RuntimeError: torch.distributed is not yet initialized but process group is requested.

Any guidance here?

@stephenroller
Copy link
Contributor

Can you report your fairscale version?

@jminjie
Copy link
Author

jminjie commented May 4, 2022

I installed fairscale from source with

git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
git checkout prefetch_fsdp_params_simple
pip3 install -e .

as described in setup.md. I'm not sure how to check the version number. Based on fairscale/CHANGELOG.md it seems version 0.4.1 is the most recent version upgrade on this commit.

@DGideas
Copy link

DGideas commented May 8, 2022

I installed fairscale from source with

git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
git checkout prefetch_fsdp_params_simple
pip3 install -e .

as described in setup.md. I'm not sure how to check the version number. Based on fairscale/CHANGELOG.md it seems version 0.4.1 is the most recent version upgrade on this commit.

I got the error message "torch.distributed is not yet initialized but process group is requested" too. BTW, You could goto fairscale/fairscale/__init__.py and should noticed that __version__ = "0.4.1" on line 7

@seelam
Copy link

seelam commented May 10, 2022

I still see the issue, any resolution?

@stephenroller
Copy link
Contributor

Which model are you using? How exactly have you set all the parameters?

This looks like distributed is not being initialized which is most strange.

@jminjie
Copy link
Author

jminjie commented May 11, 2022

I really just followed your setup document, so the model and parameters are somewhat opaque to me. Can you recommend what files to check to give you the information you need?

@samuelstevens
Copy link

I am experiencing the same issue on commit 809e49c. I have changed the MODEL_SHARED_FOLDER, LOCAL_SSD, CHECKPOINT_FOLDER, and CHECKPOINT_LOCAL variables in metaseq/service/constants.py.

I'm running it with CUDA_HOME=/usr/local/cuda-11.6 metaseq-api-local.

I get RuntimeError: torch.distributed is not yet initialized but process group is requested. (same stack trace, fairseq version 0.4.1).

Maybe it is a cuda11.6 issue since metaseq uses 11.3?

@jminjie
Copy link
Author

jminjie commented May 11, 2022

Maybe it is a cuda11.6 issue since metaseq uses 11.3?

I'm using CUDA 11.3 and seeing the same error.

@suchenzang suchenzang changed the title Error after setup when trying to run API "RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API May 12, 2022
@RohitNagraj
Copy link

RohitNagraj commented May 12, 2022

Solution

So from some research I did about torch.distributed, I found a way to get past this issue by making the following changes:
In metaseq/distributed/utils.py, between line 296 and 297, add the following code:

if not torch.distributed.is_initialized():
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
        dist.init_process_group("nccl", rank=0, world_size=1)

This initializes torch.distributed. However, I also faced issues where the code crashed sometimes due to some cuda errors, which I did not pay a lot of attention to as restarting the server just fixed it.

I am not very familiar with torch.distributed and found this solution only by Googling. So there might be better ways to fix this.

@ForgetThatNight
Copy link

have you resolved it? I got same error too

@DGideas
Copy link

DGideas commented May 16, 2022

have you resolved it? I got same error too

Do you want to fune-turing this model or just run it? If you want to run it, you could use OPT on HuggingFace(transformers), this method will bypasses these issues.

@stephenroller
Copy link
Contributor

This is so strange. Can anyone provide the command they are running?

@stevenkwong
Copy link

stevenkwong commented May 17, 2022

This is so strange. Can anyone provide the command they are running?

I met the same problem of "RuntimeError: torch.distributed is not yet initialized but process group is requested";
I just follow the official setup instruction, but install Apex in the end.
After finishing all instruction, I run “metaseq-api-local” ,then come up with this error

I am wondering whether the requirment install order would bring this error?

@xhluca
Copy link
Contributor

xhluca commented Jun 1, 2022

Also running into this error. I'm using only a single GPU:

Wed Jun  1 17:24:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
| N/A   30C    P0    50W / 350W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I'm using the 350M weights:

    MODEL_SHARED_FOLDER = "/home/toolkit/opt/"
    MODEL_SIZE = "350M"

which I downloaded here as a single file

Update 1:

Just to confirm that this issue also happens when I'm using 4xA100 80GB. Also, a different issue arises with 2.7B:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 298, in <module>
    cli_main()
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 294, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/toolkit/opt/metaseq/metaseq/distributed/utils.py", line 263, in call_main
    return main(cfg, **kwargs)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/toolkit/opt/metaseq/metaseq/hub_utils.py", line 492, in load_model
    build_model_hook=_build_model,
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 473, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 408, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 348, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/toolkit/opt/metaseq/metaseq/checkpoint_utils.py", line 339, in _is_checkpoint_sharded
    size_ratio = max(sizes) / min(sizes)
ZeroDivisionError: division by zero

I think this might be because there's no reshard.pt file for 2.7B, instead it is in reshard-model_part-{i}.pt format. Note that I had to modify my constants.py:

MODEL_SIZE = "2.7B"
# where to find the raw files on nfs
CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, MODEL_SIZE)
# where to store them on SSD for faster loading
CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, MODEL_SIZE, "reshard.pt")

since there's no reshard_no_os folder (or at least I didn't use that naming convention).

@yuvalkirstain
Copy link

@jminjie
Regarding the exception:

RuntimeError: torch.distributed is not yet initialized but process group is requested.

what is the distributed world size you set? If it is 1, then this behavior is expected.

@jalalirs
Copy link

Same here… followed the exact instructions in the readme and got the same error. I am using 350m model and 8 world size.

BTW, the reason I used the mode 350m is that all models with multiple shards don’t pass and fails to load the checkpoints. I keep getting
size_ratio = max(sizes) /min(sizes)
ValueError: max() arg is an empty sequence

Another bug is that _utils.is_primitive_type function doesn’t exist anymore in omegconf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.