Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchrun distributed running does not work #201

Closed
catid opened this issue Mar 15, 2023 · 4 comments
Closed

Torchrun distributed running does not work #201

catid opened this issue Mar 15, 2023 · 4 comments

Comments

@catid
Copy link

catid commented Mar 15, 2023

Running in a distributed manner either returns an error, or with the simplest example, produce obviously incorrect output.

The following is the result of running 13B model across two nodes. Node A:

python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=0 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model

Node B:

python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=1 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model

It does complete without error, but the results are messed up:

image

@LucWeber
Copy link

I have the same issue. Single-node runs are fine, while multi-node runs are gibberish

@YuzhongHuangCS
Copy link

YuzhongHuangCS commented Mar 29, 2023

When using multiple nodes, it should use RANK, instead of LOCAL_RANK to determine the index of weight to load.

My fix is: modify setup_model_parallel in example.py:

def setup_model_parallel() -> Tuple[int, int]:
    rank = int(os.environ.get("RANK", -1))
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = int(os.environ.get("WORLD_SIZE", -1))
    print(rank, local_rank, world_size)

    torch.distributed.init_process_group("nccl")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    # return local_rank, world_size
    return rank, world_size

It should produce meaningful output now.

@LucWeber
Copy link

@YuzhongHuangCS Thanks so much, this works like a charm!

@catid
Copy link
Author

catid commented May 20, 2023

Looks fixed nice

@catid catid closed this as completed May 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants