Torchrun distributed running does not work #201

catid · 2023-03-15T01:11:39Z

Running in a distributed manner either returns an error, or with the simplest example, produce obviously incorrect output.

The following is the result of running 13B model across two nodes. Node A:

python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=0 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model

Node B:

python -m torch.distributed.run --nproc_per_node 1 --nnodes=2 --node_rank=1 --master_addr="gpu3.lan" --master_port=1234 example.py --ckpt_dir $MODELS/65B --tokenizer_path $MODELS/tokenizer.model

It does complete without error, but the results are messed up:

The text was updated successfully, but these errors were encountered:

LucWeber · 2023-03-18T17:41:04Z

I have the same issue. Single-node runs are fine, while multi-node runs are gibberish

YuzhongHuangCS · 2023-03-29T06:31:04Z

When using multiple nodes, it should use RANK, instead of LOCAL_RANK to determine the index of weight to load.

My fix is: modify setup_model_parallel in example.py:

def setup_model_parallel() -> Tuple[int, int]:
    rank = int(os.environ.get("RANK", -1))
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = int(os.environ.get("WORLD_SIZE", -1))
    print(rank, local_rank, world_size)

    torch.distributed.init_process_group("nccl")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    # return local_rank, world_size
    return rank, world_size

It should produce meaningful output now.

LucWeber · 2023-03-29T08:19:01Z

@YuzhongHuangCS Thanks so much, this works like a charm!

catid · 2023-05-20T14:39:09Z

Looks fixed nice

tbenst mentioned this issue Mar 18, 2023

Multi-GPU models give bizarre results on example.py #212

Open

LucWeber mentioned this issue Mar 30, 2023

Fix ranks for multi machine runs #241

Open

catid closed this as completed May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchrun distributed running does not work #201

Torchrun distributed running does not work #201

catid commented Mar 15, 2023

LucWeber commented Mar 18, 2023

YuzhongHuangCS commented Mar 29, 2023 •

edited

LucWeber commented Mar 29, 2023

catid commented May 20, 2023

Torchrun distributed running does not work #201

Torchrun distributed running does not work #201

Comments

catid commented Mar 15, 2023

LucWeber commented Mar 18, 2023

YuzhongHuangCS commented Mar 29, 2023 • edited

LucWeber commented Mar 29, 2023

catid commented May 20, 2023

YuzhongHuangCS commented Mar 29, 2023 •

edited