Skip to content

Running Megatron-3d with openmpi launcher fails #795

@sdtblck

Description

@sdtblck

I'm running this script https://github.com/EleutherAI/megatron-3d/blob/main/examples/tst_16.sh - the codebase is very similar to the Megatron-3d example in DeepSpeedExamples.

I'm trying to get onebitadam working with megatron-3d, and following advice from @awan-10 tried using the openmpi launcher to run the job.

Unfortunately though, it seems to launch many separate processes all with world-size = 1 which then error out like this:

Traceback (most recent call last):
  File "pretrain_gpt2.py", line 156, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/home/mchorse/megatron-3d/megatron/training.py", line 72, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 77, in initialize_megatron
    finish_mpu_init()
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 59, in finish_mpu_init
    _initialize_distributed()
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 151, in _initialize_distributed
    torch.distributed.init_process_group(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):

I made sure it isn't simply old processes getting in the way by pkilling all processes on each node before testing.

I expect it's because megatron's mpu uses torch.distributed for model parallelism separately to deepspeed, and the two processes are interfering with each other?

Any advice would be much appreciated.

Sharing the mpirun command below in case it's helpful:

cmd = mpirun -n 16 -hostfile /job/hostfile --mca btl ^openib --mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x PYTHONPATH=/home/mchorse/megatron-3d /usr/bin/python -u pretrain_gpt2.py --model-parallel-size 1 --pipe-parallel-size 2 --num-layers 12 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --gas 16 --train-iters 320000 --lr-decay-iters 320000 --save checkpoints/gpt2_XL_ds --load checkpoints/gpt2_XL_ds --data-path /mnt/ssd-cluster/data/enron/enron_text_document --vocab-file /mnt/ssd-cluster/data/gpt2-vocab.json --merge-file /mnt/ssd-cluster/data/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 2.0e-4 --lr-decay-style cosine --min-lr 2.0e-5 --weight-decay 0 --attention-dropout 0 --hidden-dropout 0 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 500 --eval-interval 100 --eval-iters 10 --fp16 --tensorboard-dir tensorboard_data/l_h_2n_8g_2pp_1mp_4b_ds4 --sparsity interspersed --deepspeed --deepspeed_config configs/deepspeed_configs/ds_zero_stage_1_config.json --zero-stage 1 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --checkpoint-activations --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions