New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training hangs #217
Comments
I'm having the same problem when running on multiple nodes. I've so far figured out that it freezes in |
more precisely, in broadcast_naive() in _communication_utility.py while trying to do mpi_comm.Bcast(buf), |
and my openmpi says it has cuda support
currently using Open MPI 3.0.0 check_cuda_aware.c returns OK status as well |
Thanks for the reports. First, Alex, could you check if your issue is @andremoeller, as you are using 2.1.2, it's wired. |
Keisuke, it does indeed very much look like that, I made small example of Bcast from GPU memory though mpi4py and cffi and it freezes as message size goes over around 1K. Now about the version. We have 2.1.2 on Tsubame 3 and it was working fine, but turned out to be not supporting multi-threading which I need to do some IO stuff. So I’ve compiled same version of openmpi in userspace and I have the same problem with it. |
Hi Keisuke, I'm using
Thanks. |
@andremoeller , |
@undertherain
Is that correct? Them, hmmm. 🤔 I use 2.1.2 daily on our cluster with Infiniband and we see no problem. |
What interconnect do you use? |
I'm closing the issue, but don't hesitate to re-open it if you guys still have a problem.Thanks. |
I am having the same issue. Works fine with single node but hangs on 2 (multiple) nodes on ABCI |
Hi @ankahira , can you please provide some more details, such as your Chainer/CuPy & MPI versions? It's been a while since this issue was closed once. |
@keisuke-umezawa I figured out the issue. Unlike Slurm, the cluster manager on ABCI doesn't specify the number of tasks to launch on each node. So it was starting all the tasks in the same node. I forced mpirun to start on different nodes using "mpirun -n 16 --map-by node --oversubscribe --hostfile" |
Great, I guess you can also use '-N' option of Open MPI, or specify proc/node numbers in the hostfile like
BTW, I'm @keisukefukuda , not keisuke-umezawa. |
Hi,
I'm trying to run train_mnist.py, with multiple GPUs, but training hangs indefinitely at this point:
mpirun -np 4 python train_mnist.py
I'm using CUDA 9, NCCL 2, cuda-aware OpenMPI 2.1.2, and the these:
strace
on thempirun
says it's just polling:Any clues as to what's going wrong, or how I can figure out more about what might be going wrong?
Thanks.
The text was updated successfully, but these errors were encountered: