-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Distributed training over Infiniband #1623
Comments
can you check two things
https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L52 |
thanks for the answer. please see the output below: ifconfing on the participating nodes: 1st node [valeriuc@gcn8 ~]$ ifconfig ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 2nd node [valeriuc@gcn9 ~]$ ifconfig ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 ping between the nodes: [valeriuc@gcn9 ~]$ ping 10.202.203.89 print from van.cc: [00:02:19] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9573 I have printed my_node_.hostname here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L74 and as you can see it's different from the L52. Do you know why is this? Also, I have printed the my_node_.hostname and node.hostname in the loop here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L314 Any ideas? |
it looks normal to me. so i think all the nodes are connected. next, can you try to print at the end of |
here it is: [00:50:50] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9465 and with DMLC_INTERFACE=eth0 until the same point (after hitting Barrier): [00:54:08] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9890 [00:54:09] [00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = src/van.cc:325: my_node.hostname = 327: node.hostname = 10.3.200.105 node.port = 44416 i= 1 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.106: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 2 node.port = 48295 i= 3 [00:54:09] [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 47398src/van.cc:327: i= 3 The run with eth0 works successfully. |
Any ideas where I should look further? Thanks! |
i'll try to add a debug option there at this weekend, so you will see all connection activities. |
can you try with you need to update ps-lite to the newest version first cd ps-lite; git pull; and then rebuild mxnet make clean; make; the document http://ps-lite.readthedocs.org/en/latest/how_to.html#debug-ps-lite |
I'm running now with PS_VERBOSE=2. The run on Infiniband: [22:17:30] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9177 The run on ethernet up to the same point: [22:18:04] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9561 thanks for the help! |
it seems that sending data over ib0 is failed. you can double check with it by pull the recent ps-lite, where receiving is also logged. i also added a if you start your job on gcn8 with ib0, can you add the option |
yes, it seems that this is the case. I started the job directly from a compute node (before I was starting it from a different "scheduler" node) and it goes a bit forward ahead. But it goes only on the server/worker placed on the node that launches the job (10.3.200.92 in this case). [00:27:41] src/van.cc:76: Bind to role=schedulerid=1, ip=10.3.200.92, port=9876 but it seems strange...we have many mpi programs running over ib0. any ideas? |
the problem seems that the scheduler, which uses eth0, is failed to send can you try the On Tue, Mar 22, 2016 at 7:33 PM, vcodreanu notifications@github.com wrote:
|
but in the last test the scheduler is on the compute node, so on ib0, and it says: [00:27:42] src/van.cc:366: W[11] is connected to others so I suppose it misses S[8] and W[9] that should sit on the other ip. but why? ib bandwidth/latency tests between nodes work without any problems. |
the scheduler still uses the eth0, whose ip is obtained by [00:27:41] src/van.cc:76: Bind to role=scheduler id=1, ip=10.3.200.92, On Tue, Mar 22, 2016 at 7:54 PM, vcodreanu notifications@github.com wrote:
|
yes, now when it uses ib0 it freezes earlier: [01:13:56] src/van.cc:76: Bind to role=schedulerid=1, ip=10.202.203.92, port=9984 |
that means zmq is failed to send data on infiniband with the tcp protocol. can you try
i don't have infiniband at hand, so not sure if the above solution works... |
Hi, I met the same issue.Did you succeed to train over ib? @vcodreanu |
I am experiencing some issues when using distributed training in combination with DMLC_INTERFACE="ib0". I have successfully trained many models using the default DMLC_INTERFACE (eth0), but I've hit some bandwidth limits on some large models and thus tried the Infiniband option.
I am using the dmlc_mpi launcher and the output is as follows:
With DMLC_INTERFACE not set mxnet starts successfully:
With DMLC_INTERFACE="ib0" mxnet freezes:
And on the computing nodes I see processes like:
When I Ctrl-C the process I get:
Our system has multiple NICs (ib0 and ib1) and I get the same behavior when setting any of them. Also, it happens in the case of both kv-stores: dist_sync and dist_async.
Could you advise me on what to try? Or what can I do to add get more verbose debug information from mxnet?
The text was updated successfully, but these errors were encountered: