Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training hangs #217

Closed
andremoeller opened this issue Mar 3, 2018 · 14 comments
Closed

Multi-GPU training hangs #217

andremoeller opened this issue Mar 3, 2018 · 14 comments
Assignees

Comments

@andremoeller
Copy link

andremoeller commented Mar 3, 2018

Hi,

I'm trying to run train_mnist.py, with multiple GPUs, but training hangs indefinitely at this point:

mpirun -np 4 python train_mnist.py

Num process (COMM_WORLD): 4
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time

I'm using CUDA 9, NCCL 2, cuda-aware OpenMPI 2.1.2, and the these:

cupy-cuda90==4.0.0b4
chainer==4.0.0b4
chainercv==0.8.0
chainermn==1.2.0

strace on the mpirun says it's just polling:

write(1, "epoch main/loss validati"..., 100epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
) = 100
clock_gettime(CLOCK_MONOTONIC, {340982, 485569071}) = 0
gettimeofday({1520035802, 195603}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=35, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=27, events=POLLIN}, {fd=34, events=POLLIN}, {fd=36, events=POLLIN}, {fd=0, events=POLLIN}, {fd=31, events=POLLIN}, {fd=26, events=POLLIN}], 13, -1

Any clues as to what's going wrong, or how I can figure out more about what might be going wrong?

Thanks.

@andremoeller andremoeller changed the title Multi-GPU training hangs with Multi-GPU training hangs Mar 3, 2018
@undertherain
Copy link

I'm having the same problem when running on multiple nodes.
Multiple GPUs on one node work well, on two nodes and it freezes in training loop.
One one system I'm trying to switch between different MPIs and only one is working (I've made sure to reinstall without using pip cache all related python packages after switching to new MPI implementation). On another system after software update I cant find any MPI that works ><

I've so far figured out that it freezes in self.communicator.broadcast_data(target) of _MultiNodeOptimizer after processing first batch

@undertherain
Copy link

undertherain commented Mar 3, 2018

more precisely, in broadcast_naive() in _communication_utility.py while trying to do mpi_comm.Bcast(buf), buf being a tuple of cffi backend_buffer and mpi type :-
https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py#L81

@undertherain
Copy link

undertherain commented Mar 3, 2018

and my openmpi says it has cuda support

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true

currently using Open MPI 3.0.0

check_cuda_aware.c returns OK status as well

@keisukefukuda keisukefukuda self-assigned this Mar 4, 2018
@keisukefukuda
Copy link
Member

Thanks for the reports.

First, Alex, could you check if your issue is
open-mpi/ompi#3972
This issue has been there for more than half a year, and I have just started investigating it myself.
Thus we recommend using Open MPI 2.1.2.

@andremoeller, as you are using 2.1.2, it's wired.
Which version of ChainerMN are you using? 1.2 release, or the master ?

@undertherain
Copy link

undertherain commented Mar 4, 2018

Keisuke, it does indeed very much look like that, I made small example of Bcast from GPU memory though mpi4py and cffi and it freezes as message size goes over around 1K.
Will check their sample a bit later to rule out mpi4py influence, but 99% it’s that openmpi issue.

Now about the version. We have 2.1.2 on Tsubame 3 and it was working fine, but turned out to be not supporting multi-threading which I need to do some IO stuff. So I’ve compiled same version of openmpi in userspace and I have the same problem with it.

@andremoeller
Copy link
Author

Hi Keisuke, I'm using

cupy-cuda90==4.0.0b4
chainer==4.0.0b4
chainercv==0.8.0
chainermn==1.2.0

Thanks.

@keisukefukuda
Copy link
Member

@andremoeller ,
Oh, sorry, I missed it in your first comment. thanks for the info.

@keisukefukuda
Copy link
Member

keisukefukuda commented Mar 5, 2018

@undertherain
I understand that

  • The system's default Open MPI 2.1.2 is not compiled to support multithreading
  • You compiled 2.1.2 yourself and it has the same problem

Is that correct?

Them, hmmm. 🤔 I use 2.1.2 daily on our cluster with Infiniband and we see no problem.
Can you make sure your program hangs on Allreduce? Does it reproduce with a very simple test case program?

@keisukefukuda
Copy link
Member

@andremoeller,

What interconnect do you use?
I guess it's Infiniband because you use NCCL.
If so, will you try pure_nccl_communicator?
It should solve the problem if MPI_Allreduce is the problem.

@keisukefukuda
Copy link
Member

I'm closing the issue, but don't hesitate to re-open it if you guys still have a problem.Thanks.

@ankahira
Copy link

ankahira commented Sep 3, 2019

I am having the same issue. Works fine with single node but hangs on 2 (multiple) nodes on ABCI

@keisukefukuda
Copy link
Member

Hi @ankahira , can you please provide some more details, such as your Chainer/CuPy & MPI versions? It's been a while since this issue was closed once.
Thanks!

@ankahira
Copy link

ankahira commented Sep 4, 2019

@keisuke-umezawa I figured out the issue. Unlike Slurm, the cluster manager on ABCI doesn't specify the number of tasks to launch on each node. So it was starting all the tasks in the same node. I forced mpirun to start on different nodes using "mpirun -n 16 --map-by node --oversubscribe --hostfile"

@keisukefukuda
Copy link
Member

Great, I guess you can also use '-N' option of Open MPI, or specify proc/node numbers in the hostfile like

hostA slots=8
hostB slots=8

BTW, I'm @keisukefukuda , not keisuke-umezawa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants