Running Megatron-3d with openmpi launcher fails

I'm running this script https://github.com/EleutherAI/megatron-3d/blob/main/examples/tst_16.sh - the codebase is very similar to the Megatron-3d example in DeepSpeedExamples.

I'm trying to get onebitadam working with megatron-3d, and following advice from @awan-10 tried using the openmpi launcher to run the job.

Unfortunately though, it seems to launch many separate processes all with world-size = 1 which then error out like this:

```
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 156, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/home/mchorse/megatron-3d/megatron/training.py", line 72, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 77, in initialize_megatron
    finish_mpu_init()
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 59, in finish_mpu_init
    _initialize_distributed()
  File "/home/mchorse/megatron-3d/megatron/initialize.py", line 151, in _initialize_distributed
    torch.distributed.init_process_group(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
```

I made sure it isn't simply old processes getting in the way by pkilling all processes on each node before testing.

I expect it's because megatron's mpu uses torch.distributed for model parallelism separately to deepspeed, and the two processes are interfering with each other?

Any advice would be much appreciated.

Sharing the mpirun command below in case it's helpful:

```
cmd = mpirun -n 16 -hostfile /job/hostfile --mca btl ^openib --mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x PYTHONPATH=/home/mchorse/megatron-3d /usr/bin/python -u pretrain_gpt2.py --model-parallel-size 1 --pipe-parallel-size 2 --num-layers 12 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --gas 16 --train-iters 320000 --lr-decay-iters 320000 --save checkpoints/gpt2_XL_ds --load checkpoints/gpt2_XL_ds --data-path /mnt/ssd-cluster/data/enron/enron_text_document --vocab-file /mnt/ssd-cluster/data/gpt2-vocab.json --merge-file /mnt/ssd-cluster/data/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 2.0e-4 --lr-decay-style cosine --min-lr 2.0e-5 --weight-decay 0 --attention-dropout 0 --hidden-dropout 0 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 500 --eval-interval 100 --eval-iters 10 --fp16 --tensorboard-dir tensorboard_data/l_h_2n_8g_2pp_1mp_4b_ds4 --sparsity interspersed --deepspeed --deepspeed_config configs/deepspeed_configs/ds_zero_stage_1_config.json --zero-stage 1 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --checkpoint-activations --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Megatron-3d with openmpi launcher fails #795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running Megatron-3d with openmpi launcher fails #795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions