Multi-GPU training hangs #217

andremoeller · 2018-03-03T00:13:00Z

Hi,

I'm trying to run train_mnist.py, with multiple GPUs, but training hangs indefinitely at this point:

mpirun -np 4 python train_mnist.py

Num process (COMM_WORLD): 4
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time

I'm using CUDA 9, NCCL 2, cuda-aware OpenMPI 2.1.2, and the these:

cupy-cuda90==4.0.0b4
chainer==4.0.0b4
chainercv==0.8.0
chainermn==1.2.0

strace on the mpirun says it's just polling:

write(1, "epoch main/loss validati"..., 100epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
) = 100
clock_gettime(CLOCK_MONOTONIC, {340982, 485569071}) = 0
gettimeofday({1520035802, 195603}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=35, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=27, events=POLLIN}, {fd=34, events=POLLIN}, {fd=36, events=POLLIN}, {fd=0, events=POLLIN}, {fd=31, events=POLLIN}, {fd=26, events=POLLIN}], 13, -1

Any clues as to what's going wrong, or how I can figure out more about what might be going wrong?

Thanks.

The text was updated successfully, but these errors were encountered:

undertherain · 2018-03-03T14:57:02Z

I'm having the same problem when running on multiple nodes.
Multiple GPUs on one node work well, on two nodes and it freezes in training loop.
One one system I'm trying to switch between different MPIs and only one is working (I've made sure to reinstall without using pip cache all related python packages after switching to new MPI implementation). On another system after software update I cant find any MPI that works ><

I've so far figured out that it freezes in self.communicator.broadcast_data(target) of _MultiNodeOptimizer after processing first batch

undertherain · 2018-03-03T15:27:01Z

more precisely, in broadcast_naive() in _communication_utility.py while trying to do mpi_comm.Bcast(buf), buf being a tuple of cffi backend_buffer and mpi type :-
https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py#L81

undertherain · 2018-03-03T15:30:08Z

and my openmpi says it has cuda support

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true

currently using Open MPI 3.0.0

check_cuda_aware.c returns OK status as well

keisukefukuda · 2018-03-04T03:10:50Z

Thanks for the reports.

First, Alex, could you check if your issue is
open-mpi/ompi#3972
This issue has been there for more than half a year, and I have just started investigating it myself.
Thus we recommend using Open MPI 2.1.2.

@andremoeller, as you are using 2.1.2, it's wired.
Which version of ChainerMN are you using? 1.2 release, or the master ?

undertherain · 2018-03-04T06:05:52Z

Keisuke, it does indeed very much look like that, I made small example of Bcast from GPU memory though mpi4py and cffi and it freezes as message size goes over around 1K.
Will check their sample a bit later to rule out mpi4py influence, but 99% it’s that openmpi issue.

Now about the version. We have 2.1.2 on Tsubame 3 and it was working fine, but turned out to be not supporting multi-threading which I need to do some IO stuff. So I’ve compiled same version of openmpi in userspace and I have the same problem with it.

andremoeller · 2018-03-04T21:46:15Z

Hi Keisuke, I'm using

cupy-cuda90==4.0.0b4
chainer==4.0.0b4
chainercv==0.8.0
chainermn==1.2.0

Thanks.

keisukefukuda · 2018-03-05T03:54:52Z

@andremoeller ,
Oh, sorry, I missed it in your first comment. thanks for the info.

keisukefukuda · 2018-03-05T04:02:52Z

@undertherain
I understand that

The system's default Open MPI 2.1.2 is not compiled to support multithreading
You compiled 2.1.2 yourself and it has the same problem

Is that correct?

Them, hmmm. 🤔 I use 2.1.2 daily on our cluster with Infiniband and we see no problem.
Can you make sure your program hangs on Allreduce? Does it reproduce with a very simple test case program?

keisukefukuda · 2018-03-05T04:18:15Z

@andremoeller,

What interconnect do you use?
I guess it's Infiniband because you use NCCL.
If so, will you try pure_nccl_communicator?
It should solve the problem if MPI_Allreduce is the problem.

keisukefukuda · 2018-03-20T02:22:14Z

I'm closing the issue, but don't hesitate to re-open it if you guys still have a problem.Thanks.

ankahira · 2019-09-03T16:55:47Z

I am having the same issue. Works fine with single node but hangs on 2 (multiple) nodes on ABCI

keisukefukuda · 2019-09-03T23:27:12Z

Hi @ankahira , can you please provide some more details, such as your Chainer/CuPy & MPI versions? It's been a while since this issue was closed once.
Thanks!

ankahira · 2019-09-04T07:08:41Z

@keisuke-umezawa I figured out the issue. Unlike Slurm, the cluster manager on ABCI doesn't specify the number of tasks to launch on each node. So it was starting all the tasks in the same node. I forced mpirun to start on different nodes using "mpirun -n 16 --map-by node --oversubscribe --hostfile"

keisukefukuda · 2019-09-05T04:26:35Z

Great, I guess you can also use '-N' option of Open MPI, or specify proc/node numbers in the hostfile like

hostA slots=8
hostB slots=8

BTW, I'm @keisukefukuda , not keisuke-umezawa.

andremoeller changed the title ~~Multi-GPU training hangs with~~ Multi-GPU training hangs Mar 3, 2018

keisukefukuda self-assigned this Mar 4, 2018

keisukefukuda closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training hangs #217

Multi-GPU training hangs #217

andremoeller commented Mar 3, 2018 •

edited

undertherain commented Mar 3, 2018

undertherain commented Mar 3, 2018 •

edited

undertherain commented Mar 3, 2018 •

edited

keisukefukuda commented Mar 4, 2018

undertherain commented Mar 4, 2018 •

edited

andremoeller commented Mar 4, 2018

keisukefukuda commented Mar 5, 2018

keisukefukuda commented Mar 5, 2018 •

edited

keisukefukuda commented Mar 5, 2018

keisukefukuda commented Mar 20, 2018

ankahira commented Sep 3, 2019

keisukefukuda commented Sep 3, 2019

ankahira commented Sep 4, 2019

keisukefukuda commented Sep 5, 2019

Multi-GPU training hangs #217

Multi-GPU training hangs #217

Comments

andremoeller commented Mar 3, 2018 • edited

undertherain commented Mar 3, 2018

undertherain commented Mar 3, 2018 • edited

undertherain commented Mar 3, 2018 • edited

keisukefukuda commented Mar 4, 2018

undertherain commented Mar 4, 2018 • edited

andremoeller commented Mar 4, 2018

keisukefukuda commented Mar 5, 2018

keisukefukuda commented Mar 5, 2018 • edited

keisukefukuda commented Mar 5, 2018

keisukefukuda commented Mar 20, 2018

ankahira commented Sep 3, 2019

keisukefukuda commented Sep 3, 2019

ankahira commented Sep 4, 2019

keisukefukuda commented Sep 5, 2019

andremoeller commented Mar 3, 2018 •

edited

undertherain commented Mar 3, 2018 •

edited

undertherain commented Mar 3, 2018 •

edited

undertherain commented Mar 4, 2018 •

edited

keisukefukuda commented Mar 5, 2018 •

edited