-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API #23
Comments
1546035 doesn't seem to fix this issue (I tried it locally). |
Can you share with me how you have set things? |
Sure. I have multiple CUDA versions installed, so I'm using a conda virtual environment set to use CUDA 11.3 on an Ubuntu machine with an Nvidia Quadro RTX 5000. I installed PyTorch, Apex, Megatron, fairscale, and metaseq using the instructions on setup.md. Now I'm trying to run metaseq-api-local and seeing errors. (I'm also not sure if the intention is that I can run the API right away, or if I need to download weights or something somewhere first) |
(This issue is not resolved btw) I'll try to reproduce again. Let me know if there's any specifics about my setup that you need. |
I had the same problems and got a few steps farther by directly modifying Then I used the Then I think you'll also need to copy the files from https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets to your |
Thanks Hunter! The sense I'm getting is that
Hopefully this will resolve the issues. |
After downloaded weights (for a smaller model) and the dict, I'm seeing
Any guidance here? |
Can you report your fairscale version? |
I installed fairscale from source with
as described in setup.md. I'm not sure how to check the version number. Based on fairscale/CHANGELOG.md it seems version 0.4.1 is the most recent version upgrade on this commit. |
I got the error message "torch.distributed is not yet initialized but process group is requested" too. BTW, You could goto |
I still see the issue, any resolution? |
Which model are you using? How exactly have you set all the parameters? This looks like distributed is not being initialized which is most strange. |
I really just followed your setup document, so the model and parameters are somewhat opaque to me. Can you recommend what files to check to give you the information you need? |
I am experiencing the same issue on commit 809e49c. I have changed the MODEL_SHARED_FOLDER, LOCAL_SSD, CHECKPOINT_FOLDER, and CHECKPOINT_LOCAL variables in metaseq/service/constants.py. I'm running it with I get Maybe it is a cuda11.6 issue since metaseq uses 11.3? |
I'm using CUDA 11.3 and seeing the same error. |
Solution So from some research I did about torch.distributed, I found a way to get past this issue by making the following changes:
This initializes torch.distributed. However, I also faced issues where the code crashed sometimes due to some cuda errors, which I did not pay a lot of attention to as restarting the server just fixed it. I am not very familiar with torch.distributed and found this solution only by Googling. So there might be better ways to fix this. |
have you resolved it? I got same error too |
Do you want to fune-turing this model or just run it? If you want to run it, you could use OPT on HuggingFace(transformers), this method will bypasses these issues. |
This is so strange. Can anyone provide the command they are running? |
I met the same problem of "RuntimeError: torch.distributed is not yet initialized but process group is requested"; I am wondering whether the requirment install order would bring this error? |
Also running into this error. I'm using only a single GPU:
I'm using the 350M weights:
which I downloaded here as a single file Update 1: Just to confirm that this issue also happens when I'm using 4xA100 80GB. Also, a different issue arises with 2.7B:
I think this might be because there's no MODEL_SIZE = "2.7B"
# where to find the raw files on nfs
CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, MODEL_SIZE)
# where to store them on SSD for faster loading
CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, MODEL_SIZE, "reshard.pt") since there's no |
@jminjie RuntimeError: torch.distributed is not yet initialized but process group is requested. what is the distributed world size you set? If it is 1, then this behavior is expected. |
Same here… followed the exact instructions in the readme and got the same error. I am using 350m model and 8 world size. BTW, the reason I used the mode 350m is that all models with multiple shards don’t pass and fails to load the checkpoints. I keep getting Another bug is that _utils.is_primitive_type function doesn’t exist anymore in omegconf |
❓ Questions and Help
After following setup steps I ran
metaseq-api-local
and got this output:Am I missing a step? I tried manually setting LOCAL_SSD, MODEL_SHARED_FOLDER to a new folder I created but then other things failed.
pip
, source): sourceThe text was updated successfully, but these errors were encountered: