-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
I'm running this script https://github.com/EleutherAI/megatron-3d/blob/main/examples/tst_16.sh - the codebase is very similar to the Megatron-3d example in DeepSpeedExamples.
I'm trying to get onebitadam working with megatron-3d, and following advice from @awan-10 tried using the openmpi launcher to run the job.
Unfortunately though, it seems to launch many separate processes all with world-size = 1 which then error out like this:
Traceback (most recent call last):
File "pretrain_gpt2.py", line 156, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/home/mchorse/megatron-3d/megatron/training.py", line 72, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/home/mchorse/megatron-3d/megatron/initialize.py", line 77, in initialize_megatron
finish_mpu_init()
File "/home/mchorse/megatron-3d/megatron/initialize.py", line 59, in finish_mpu_init
_initialize_distributed()
File "/home/mchorse/megatron-3d/megatron/initialize.py", line 151, in _initialize_distributed
torch.distributed.init_process_group(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
I made sure it isn't simply old processes getting in the way by pkilling all processes on each node before testing.
I expect it's because megatron's mpu uses torch.distributed for model parallelism separately to deepspeed, and the two processes are interfering with each other?
Any advice would be much appreciated.
Sharing the mpirun command below in case it's helpful:
cmd = mpirun -n 16 -hostfile /job/hostfile --mca btl ^openib --mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x PYTHONPATH=/home/mchorse/megatron-3d /usr/bin/python -u pretrain_gpt2.py --model-parallel-size 1 --pipe-parallel-size 2 --num-layers 12 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --gas 16 --train-iters 320000 --lr-decay-iters 320000 --save checkpoints/gpt2_XL_ds --load checkpoints/gpt2_XL_ds --data-path /mnt/ssd-cluster/data/enron/enron_text_document --vocab-file /mnt/ssd-cluster/data/gpt2-vocab.json --merge-file /mnt/ssd-cluster/data/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 2.0e-4 --lr-decay-style cosine --min-lr 2.0e-5 --weight-decay 0 --attention-dropout 0 --hidden-dropout 0 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 500 --eval-interval 100 --eval-iters 10 --fp16 --tensorboard-dir tensorboard_data/l_h_2n_8g_2pp_1mp_4b_ds4 --sparsity interspersed --deepspeed --deepspeed_config configs/deepspeed_configs/ds_zero_stage_1_config.json --zero-stage 1 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --checkpoint-activations --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing