New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchrun distributed running does not work #201
Comments
I have the same issue. Single-node runs are fine, while multi-node runs are gibberish |
When using multiple nodes, it should use My fix is: modify
It should produce meaningful output now. |
@YuzhongHuangCS Thanks so much, this works like a charm! |
Looks fixed nice |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Running in a distributed manner either returns an error, or with the simplest example, produce obviously incorrect output.
The following is the result of running 13B model across two nodes. Node A:
python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=0 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model
Node B:
python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=1 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model
It does complete without error, but the results are messed up:
The text was updated successfully, but these errors were encountered: