Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM for all gather comms tests #84

Open
roywei opened this issue Aug 17, 2023 · 0 comments
Open

OOM for all gather comms tests #84

roywei opened this issue Aug 17, 2023 · 0 comments
Assignees

Comments

@roywei
Copy link

roywei commented Aug 17, 2023

Hi,

I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)

PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8

for param I'm launching the following way

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      $MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
      --master-ip ip-172-31-49-213 \
      --b 32M \
     ---e 2048M \
      --n 100 \
      --z 0 \
      --backend nccl \
      --device cuda \
      --collective all_gather\

for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      bash run_nccl_test.sh

and in the bash file

export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M  -n 100  -z 0 -f 2 -g 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants