Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Distributed training over Infiniband #1623

Closed
vcodreanu opened this issue Mar 11, 2016 · 17 comments
Closed

Distributed training over Infiniband #1623

vcodreanu opened this issue Mar 11, 2016 · 17 comments

Comments

@vcodreanu
Copy link

I am experiencing some issues when using distributed training in combination with DMLC_INTERFACE="ib0". I have successfully trained many models using the default DMLC_INTERFACE (eth0), but I've hit some bandwidth limits on some large models and thus tried the Infiniband option.

I am using the dmlc_mpi launcher and the output is as follows:

With DMLC_INTERFACE not set mxnet starts successfully:

Currently Loaded Modulefiles:
 1) cudnn/7.0-v4-prod            4) cuda/7.5.18                 
 2) python/2.7.9                 5) opencv/gnu/2.4.10           
 3) mxnet/2016.02.23             6) mpi/mvapich2-gdr/2.1-cuda75 
mpirun -n 4  -env DMLC_ROLE server -env DMLC_PS_ROOT_PORT 9894 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
mpirun -n 4  -env DMLC_ROLE worker -env DMLC_PS_ROOT_PORT 9894 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
2016-03-11 10:48:16,220 Node[1] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,221 Node[3] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,220 Node[2] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,221 Node[0] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
[10:48:16] src/io/iter_image_recordio.cc[10:48:16] src/io/iter_image_recordio.cc:[10:48:16] src/io/iter_image_recordio.cc::212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
[10:48:16] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
2016-03-11 10:48:18,321 Node[1] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,351 Node[2] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,359 Node[3] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,396 Node[0] Start training with [gpu(0), gpu(1)]

With DMLC_INTERFACE="ib0" mxnet freezes:

Currently Loaded Modulefiles:
 1) cudnn/7.0-v4-prod            4) cuda/7.5.18                 
 2) python/2.7.9                 5) opencv/gnu/2.4.10           
 3) mxnet/2016.02.23             6) mpi/mvapich2-gdr/2.1-cuda75 
mpirun -n 4  -env DMLC_ROLE server -env DMLC_PS_ROOT_PORT 9986 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
mpirun -n 4  -env DMLC_ROLE worker -env DMLC_PS_ROOT_PORT 9986 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n

And on the computing nodes I see processes like:

/hpc/sw/mvapich2-gdr-2.1-cuda-7.5-intel/bin/hydra_pmi_proxy --control-port gcn2:51087 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
valeriuc  68642  68543  1 10:45 ?        00:00:00           python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n

When I Ctrl-C the process I get:

Press Ctrl-C again to force abort
Traceback (most recent call last):
  File "dmlc_slurm.py", line 92, in <module>
    pscmd=(' '.join(args.command) + ' ' + ' '.join(unknown)))
  File "/hpc/sw/mxnet-2016.02.23/tracker/tracker.py", line 424, in submit
    pserver.join()
  File "/hpc/sw/mxnet-2016.02.23/tracker/tracker.py", line 358, in join
    self.thread.join(100)
  File "/hpc/sw/python-2.7.9/lib/python2.7/threading.py", line 960, in join
    self.__block.wait(delay)
  File "/hpc/sw/python-2.7.9/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
KeyboardInterrupt

Our system has multiple NICs (ib0 and ib1) and I get the same behavior when setting any of them. Also, it happens in the case of both kv-stores: dist_sync and dist_async.

Could you advise me on what to try? Or what can I do to add get more verbose debug information from mxnet?

@piiswrong
Copy link
Contributor

@mli

@mli
Copy link
Contributor

mli commented Mar 16, 2016

can you check two things

  1. check the ib0 is working, i.e. get the ip of ib0 on a machine, then ping that ip from another machine.
  2. check if ps-lite get the ib0's ip correct, you can print my_node_.hostname and my_node_.port at

https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L52

@vcodreanu
Copy link
Author

thanks for the answer.

please see the output below:

ifconfing on the participating nodes:

1st node

[valeriuc@gcn8 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 08:00:38:3A:7F:D4
inet addr:10.3.200.89 Bcast:10.3.207.255 Mask:255.255.248.0
inet6 addr: fe80::a00:38ff:fe3a:7fd4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:317638866 errors:0 dropped:0 overruns:184 frame:0
TX packets:640769988 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:388633126378 (361.9 GiB) TX bytes:910276194456 (847.7 GiB)
Memory:92180000-921fffff

ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.202.203.89 Bcast:10.202.255.255 Mask:255.255.0.0
inet6 addr: fe80::a00:3800:13a:7fd7/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:632300 errors:0 dropped:0 overruns:0 frame:0
TX packets:631017 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:37926660 (36.1 MiB) TX bytes:27893020 (26.6 MiB)

2nd node

[valeriuc@gcn9 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 08:00:38:3A:7F:F8
inet addr:10.3.200.90 Bcast:10.3.207.255 Mask:255.255.248.0
inet6 addr: fe80::a00:38ff:fe3a:7ff8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:73720790 errors:0 dropped:0 overruns:3 frame:0
TX packets:573607916 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:46065774320 (42.9 GiB) TX bytes:854911119840 (796.1 GiB)
Memory:92180000-921fffff

ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.202.203.90 Bcast:10.202.255.255 Mask:255.255.0.0
inet6 addr: fe80::a00:3800:13c:ed44/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:22670 errors:0 dropped:0 overruns:0 frame:0
TX packets:21725 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:1352460 (1.2 MiB) TX bytes:1032140 (1007.9 KiB)

ping between the nodes:

[valeriuc@gcn9 ~]$ ping 10.202.203.89
PING 10.202.203.89 (10.202.203.89) 56(84) bytes of data.
64 bytes from 10.202.203.89: icmp_seq=1 ttl=64 time=5.67 ms
64 bytes from 10.202.203.89: icmp_seq=2 ttl=64 time=0.063 ms
64 bytes from 10.202.203.89: icmp_seq=3 ttl=64 time=0.048 ms
64 bytes from 10.202.203.89: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 10.202.203.89: icmp_seq=5 ttl=64 time=0.048 ms

print from van.cc:

[00:02:19] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9573
[00:02:20] src/van.cc:52: my_node_.hostname = 10.202.203.89 my_node_.port = 55631
[00:02:20] src/van.cc:75: my_node_.hostname = 10.202.203.89 my_node_.port = 55631
[00:02:20] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0
[00:02:20] src/van.cc:321: node.hostname = 10.202.203.89 node.port = 55631 i= 0
[00:02:20] src/van.cc:52: my_node_.hostname = 10.202.203.89 my_node_.port = 38399
[00:02:20] src/van.cc:75: my_node_.hostname = 10.202.203.89 my_node_.port = 38399
[00:02:20] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0
[00:02:20] src/van.cc:321: node.hostname = 10.202.203.89 node.port = 38399 i= 0
[00:02:21] src/van.cc:52: my_node_.hostname = 10.202.203.90 my_node_.port = 38179
[00:02:21] src/van.cc:75: my_node_.hostname = 10.202.203.90 my_node_.port = 38179
[00:02:21] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0
[00:02:21] src/van.cc:321: node.hostname = 10.202.203.90 node.port = 38179 i= 0
[00:02:21] src/van.c10.202.203.90c:52: my_node_.hostname = 10.202.203.90 my_node_.port = 40717
[00:02:21] src/van.cc:75: my_node_.hostname = 10.202.203.90 my_node_.port = 40717
[00:02:21] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0
[00:02:21] src/van.cc:321: node.hostname = 10.202.203.90 node.port = 40717 i= 0

I have printed my_node_.hostname here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L74 and as you can see it's different from the L52. Do you know why is this?

Also, I have printed the my_node_.hostname and node.hostname in the loop here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L314

Any ideas?

@mli
Copy link
Contributor

mli commented Mar 16, 2016

it looks normal to me. so i think all the nodes are connected.

next, can you try to print at the end of Send_() and Recv() with the number of bytes send and recved? we need to check if a node can send data from its ib0 to another node's ib0

@vcodreanu
Copy link
Author

here it is:

[00:50:50] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9465
[00:50:52] src/van.cc:52: my_node_.hostname = 10.202.203.105 my_node_.port = 44039
[00:50:52] src/van.cc:75: my_node_.hostname = 10.202.203.105 my_node_.port = 44039
[00:50:52] src/van.cc:226: send_bytes = 52
[00:50:52] src/van.cc:290: recv_bytes = 57
[00:50:52] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0
[00:50:52] src/van.cc:327: node.hostname = 10.202.203.105 node.port = 44039 i= 0
[00:50:52] src/van.cc:52: my_node_.hostname = 10.202.203.105 my_node_.port = 34601
[00:50:52] src/van.cc:75: my_node_.hostname = 10.202.203.105 my_node_.port = 34601
[00:50:52] src/van.cc:226: send_bytes = 57
[00:50:52] src/van.cc:290: recv_bytes = 62
[00:50:52] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0
[00:50:52] src/van.cc:327: node.hostname = 10.202.203.105 node.port = 34601 i= 0
[00:50:58] src/van.cc:52: my_node_.hostname = 10.202.203.106 my_node_.port = 46964
[00:50:58] src/van.cc:75: my_node_.hostname = 10.202.203.106 my_node_.port = 46964
[00:50:58] src/van.cc:226: send_bytes = 52
[00:50:58] src/van.cc:290: recv_bytes = 57
[00:50:58] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0
[00:50:58] src/van.cc:327: node.hostname = 10.202.203.106 node.port = 46964 i= 0
[00:50:58] src/van.cc:52: my_node_.hostname = 10.202.203.106 my_node_.port = 47818
[00:50:58] src/van.cc:75: my_node_.hostname = 10.202.203.106 my_node_.port = 47818
[00:50:58] src/van.cc:226: send_bytes = 57
[00:50:58] src/van.cc:290: recv_bytes = 62
[00:50:58] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0
[00:50:58] src/van.cc:327: node.hostname = 10.202.203.106 node.port = 47818 i= 0
[00:50:58] src/van.cc:226: send_bytes = 145
[00:50:58] src/van.cc:226: send_bytes = 145
[00:50:58] src/van.cc:226: send_bytes = 145
[00:50:58] src/van.cc:226: send_bytes = 145
[00:50:58] src/postoffice.cc:59: Between van start and barrier
[00:50:58] src/postoffice.cc:105: In Postoffice:: Barrier
[00:50:58] src/van.cc:226: send_bytes = 18
[00:50:58] src/van.cc:290: recv_bytes = 21

and with DMLC_INTERFACE=eth0 until the same point (after hitting Barrier):

[00:54:08] src/van.cc:75: my_node_.hostname = 10.3.200.82 my_node_.port = 9890
[00:54:09] src/van.cc:52: my_node_.hostname = 10.3.200.105 my_node_.port = 49647
[00:54:09] src/van.cc:75: my_node_.hostname = 10.3.200.105 my_node_.port = 49647
[00:54:09] src/van.cc:226: send_bytes = 50
[00:54:09] src/van.cc:290: recv_bytes = 55
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0
[00:54:09] src/van.cc:52: my_node_.hostname = 10.3.200.105 my_node_.port = 44416
[00:54:09] src/van.cc:75: my_node_.hostname = 10.3.200.105 my_node_.port = 44416
[00:54:09] src/van.cc:226: send_bytes = 55
[00:54:09] src/van.cc:290: recv_bytes = 60
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 44416 i= 0
[00:54:09] src/van.cc:52: my_node_.hostname = 10.3.200.106 my_node_.port = 47398
[00:54:09] src/van.cc:75: my_node_.hostname = 10.3.200.106 my_node_.port = 47398
[00:54:09] src/van.cc:226: send_bytes = 50
[00:54:09] src/van.cc:290: recv_bytes = 55
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398 i= 0
[00:54:09] src/van.cc:52: my_node_.hostname = 10.3.200.106 my_node_.port = 48295
[00:54:09] src/van.cc:75: my_node_.hostname = 10.3.200.106 my_node_.port = 48295
[00:54:09] src/van.cc:226: send_bytes = 55
[00:54:09] src/van.cc:290: recv_bytes = 60
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 48295 i= 0
[00:54:09] src/van.cc:226: send_bytes = 136
[00:54:09] src/van.cc:226: send_bytes = 136
[00:54:09] src/van.cc:226: send_bytes = 136
[00:54:09] src/van.cc:226: send_bytes = 136
[00:54:09[00:54:09] src/van.cc:290: recv_bytes = 139
] src/van.cc:290: recv_bytes = 139
[00:54:09] src/van.cc:[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0
[00:54:09] src/van.cc325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 0
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 1
[00:54:09] src/van.cc:327: node.hostname = :327: node.hostname = 10.3.200.105 node.port = 49647 i= 0
[00:54:09] src/van.cc10.3.200.105 node.port = 44416 i= 1
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = :325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 1
[00:54:09] src/van.cc:49647 i= 2
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398[00:54:09] src/van.cc:290: recv_bytes = 139
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 0
[00:54:09] src/van.cc:327: node.hostname = i= 2
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 310.3.200.105 node.port = 49647 i= 0
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 1
[00:54:09] src/van.cc:327: node.hostname =
[00:54:09] src/van.cc:10.3.200.105 node.port = 44416 i= 1
[00:54:09327: node.hostname = 10.3.200.106 node.port = ] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 2
[00:54:0948295 i= 3
] src/van.cc:327: node.hostname = [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 4964710.3.200.106 node.port = 47398 i= 2 i= 4

[00:54:09] [00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = src/van.cc:325: my_node.hostname = 327: node.hostname = 10.3.200.105 node.port = 44416 i= 1
[00:54:09] src/van.cc:32510.3.200.105 my_node.port = 44416 i= 39890 i= 4

[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 2 node.port = 48295 i= 3
[00:54:09]
[00:54:09] src/van.cc:327: node.hostname = src/van.cc:325: my_node.hostname = 10.3.200.106 node.port = 47398 i= 210.3.200.105 my_node.port = 44416 i= 4

[00:54:09] [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 47398src/van.cc:327: i= 3
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = 9890 i= 4
node.hostname = 10.3.200.106 node.port = 48295 i= 3
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 4
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = 9890 i= 4
[00:54:09] src/postoffice.cc:59: Between van start and barrier
[00:54:09] src/postoffice.cc:105: In Postoffice:: Barrier
[00:54:09] src/van.cc:226: send_bytes = 18
[00:54:09] src/postoffice.cc:59: Between van start and barrier
[00:54:09] src/postoffice.cc:105: In Postoffice:: Barrier
[00:54:09] src/van.cc:290: recv_bytes = 139
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 0
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 1
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 44416 i= 1
[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 2[00:54:09]
[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398 i= 2
[src/van.cc:226: send_bytes = 18
.......

The run with eth0 works successfully.

@vcodreanu
Copy link
Author

Any ideas where I should look further?

Thanks!

@mli
Copy link
Contributor

mli commented Mar 20, 2016

i'll try to add a debug option there at this weekend, so you will see all connection activities.

@mli
Copy link
Contributor

mli commented Mar 22, 2016

can you try with PS_VERBOSE=1?

you need to update ps-lite to the newest version first

cd ps-lite; git pull; 

and then rebuild mxnet

make clean; make;

the document http://ps-lite.readthedocs.org/en/latest/how_to.html#debug-ps-lite

@vcodreanu
Copy link
Author

I'm running now with PS_VERBOSE=2.

The run on Infiniband:

[22:17:30] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9177
[22:17:30] src/van.cc:76: Node Info: role=server, ip=10.202.203.97, port=38842
[22:17:30] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.97, port=38842 } }
[22:17:30] src/van.cc:76: Node Info: role=worker, ip=10.202.203.97, port=51576
[22:17:30] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.97, port=51576 } }
[22:17:30] src/van.cc:76: Node Info: role=server, ip=10.202.203.98, port=56751
[22:17:30] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.98, port=56751 } }
[22:17:30] src/van.cc:76: Node Info: role=worker, ip=10.202.203.98, port=47467
[22:17:30] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.98, port=47467 } }
[22:17:30] src/van.cc:344: assign rank=9 to node role=worker, ip=10.202.203.98, port=47467
[22:17:30] src/van.cc:344: assign rank=8 to node role=server, ip=10.202.203.98, port=56751
[22:17:30] src/van.cc:344: assign rank=10 to node role=server, ip=10.202.203.97, port=38842
[22:17:30] src/van.cc:344: assign rank=11 to node role=worker, ip=10.202.203.97, port=51576
[22:17:30] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } }
[22:17:30] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } }
[22:17:30] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } }
[22:17:30] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } }
[22:17:30] src/van.cc:355: the scheduler is connected to 2 workers and 2 servers
[22:17:30] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

The run on ethernet up to the same point:

[22:18:04] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9561
[22:18:05] src/van.cc:76: Node Info: role=server, ip=10.3.200.97, port=44366
[22:18:05] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.3.200.97, port=44366 } }
[22:18:05] src/van.cc:76: Node Info: role=worker, ip=10.3.200.97, port=47885
[22:18:05] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.3.200.97, port=47885 } }
[22:18:05] src/van.cc:76: Node Info: role=server, ip=10.3.200.98, port=49326
[22:18:05] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.3.200.98, port=49326 } }
[22:18:05] src/van.cc:76: Node Info: role=worker, ip=10.3.200.98, port=50623
[22:18:05] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.3.200.98, port=50623 } }
[22:18:05] src/van.cc:344: assign rank=8 to node role=server, ip=10.3.200.98, port=49326
[22:18:05] src/van.cc:344: assign rank=9 to node role=worker, ip=10.3.200.98, port=50623
[22:18:05] src/van.cc:344: assign rank=10 to node role=server, ip=10.3.200.97, port=44366
[22:18:05] src/van.cc:344: assign rank=11 to node role=worker, ip=10.3.200.97, port=47885
[22:18:05] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } }
[22:18:05] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } }
[22:18:05] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } }
[22:18:05] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } }
[22:18:05] src/van.cc:355: the scheduler is connected to 2 workers and 2 servers
[22:18:05] src/van.cc:[22:18:05] src/van.cc:362362: W[9] is connected to others
[22:18:05] src/van.cc:362: S[10] is connected to others
: W[11] is connected to others
[22:18:05] src/van.cc:362: S[8] is connected to others
[22:18:05] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[22:18:05] src/van.cc:226: W[9] => 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[22:18:05] src/van.cc:226: S[8] => 1: Meta: request=1, push=43, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[22:18:05] src/van.cc:226: W[11] => [22:18:05] src/van.cc:226: S[10] => 11: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
: Meta: request=1, push=43, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[22:18:05] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 }
[22:18:05] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 }
[22:18:05] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 }
[22:18:05] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 }
[22:18:05] src/van.cc:226: H[1] => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 }
[22:18:05] src/van.cc[22:18:05] src/van.cc:226: H[1] => 1: :226: W[9] => 8: Meta: request=1, push=0, simple_app=1, customer_id=0, timestamp=0, head=-2
Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

thanks for the help!

@mli
Copy link
Contributor

mli commented Mar 22, 2016

it seems that sending data over ib0 is failed. you can double check with it by pull the recent ps-lite, where receiving is also logged.

i also added a --host-ip option in the mpi tracker, see dmlc/ps-lite@f2ab107

if you start your job on gcn8 with ib0, can you add the option --host-ip 10.202.203.89 and try again?

@vcodreanu
Copy link
Author

yes, it seems that this is the case. I started the job directly from a compute node (before I was starting it from a different "scheduler" node) and it goes a bit forward ahead. But it goes only on the server/worker placed on the node that launches the job (10.3.200.92 in this case).

[00:27:41] src/van.cc:76: Bind to role=schedulerid=1, ip=10.3.200.92, port=9876
[00:27:41] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=40666
[00:27:41] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } }
[00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } }
[00:27:41] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=38127
[00:27:41] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } }
[00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } }
[00:27:42] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=49706
[00:27:42] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } }
[00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } }
[00:27:42] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=58320
[00:27:42] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } }
[00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } }
[00:27:42] src/van.cc:348: assign rank=8 to node role=server, ip=10.202.203.93, port=49706
[00:27:42] src/van.cc:348: assign rank=9 to node role=worker, ip=10.202.203.93, port=58320
[00:27:42] src/van.cc:348: assign rank=11 to node role=worker, ip=10.202.203.92, port=38127
[00:27:42] src/van.cc:348: assign rank=10 to node role=server, ip=10.202.203.92, port=40666
[00:27:42] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:359: the scheduler is connected to 2 workers and 2 servers
[00:27:42] src/van.cc:291: W <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:291: S <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:366: W[11] is connected to others
[00:27:42] src/van.cc:366: S[10] is connected to others
[00:27:42] src/van.cc:226: S[10] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:226: W[11] => 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 10: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 11: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

but it seems strange...we have many mpi programs running over ib0. any ideas?

@mli
Copy link
Contributor

mli commented Mar 22, 2016

the problem seems that the scheduler, which uses eth0, is failed to send
data to another machine's ib0 interface.

can you try the --host-ip way? namely let the scheduler also use the ib0
interface.

On Tue, Mar 22, 2016 at 7:33 PM, vcodreanu notifications@github.com wrote:

yes, it seems that this is the case. I started the job directly from a
compute node (before I was starting it from a different "scheduler" node)
and it goes a bit forward ahead. But it goes only on the server/worker
placed on the node that launches the job (10.3.200.92 in this case).

[00:27:41] src/van.cc:76: Bind to role=schedulerid=1, ip=10.3.200.92,
port=9876
[00:27:41] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=40666
[00:27:41] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } }
[00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92,
port=40666 } }
[00:27:41] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=38127
[00:27:41] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } }
[00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92,
port=38127 } }
[00:27:42] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=49706
[00:27:42] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } }
[00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93,
port=49706 } }
[00:27:42] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=58320
[00:27:42] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } }
[00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93,
port=58320 } }
[00:27:42] src/van.cc:348: assign rank=8 to node role=server,
ip=10.202.203.93, port=49706
[00:27:42] src/van.cc:348: assign rank=9 to node role=worker,
ip=10.202.203.93, port=58320
[00:27:42] src/van.cc:348: assign rank=11 to node role=worker,
ip=10.202.203.92, port=38127
[00:27:42] src/van.cc:348: assign rank=10 to node role=server,
ip=10.202.203.92, port=40666
[00:27:42] src/van.cc:226: H[1] => 9: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8,
ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320
role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10,
ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876
} }
[00:27:42] src/van.cc:226: H[1] => 11: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8,
ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320
role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10,
ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876
} }
[00:27:42] src/van.cc:226: H[1] => 8: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8,
ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320
role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10,
ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876
} }
[00:27:42] src/van.cc:226: H[1] => 10: Meta: request=0, push=0,
simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8,
ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320
role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10,
ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876
} }
[00:27:42] src/van.cc:359: the scheduler is connected to 2 workers and 2
servers
[00:27:42] src/van.cc:291: W <= 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93,
port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11,
ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666
role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:291: S <= 1: Meta: request=0, push=0, simple_app=0,
control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93,
port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11,
ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666
role=schedulerid=1, ip=10.3.200.92, port=9876 } }
[00:27:42] src/van.cc:366: W[11] is connected to others
[00:27:42] src/van.cc:366: S[10] is connected to others
[00:27:42] src/van.cc:226: S[10] => 1: Meta: request=1, push=42,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:226: W[11] => 1: Meta: request=1, push=0,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:226: H[1] => 1: Meta: request=1, push=42,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 1: Meta: request=1, push=0,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 10: Meta: request=1, push=0,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }
[00:27:42] src/van.cc:291: H[1] <= 11: Meta: request=1, push=0,
simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

but it seems strange...we have many mpi programs running over ib0. any
ideas?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1623 (comment)

@vcodreanu
Copy link
Author

but in the last test the scheduler is on the compute node, so on ib0, and it says:

[00:27:42] src/van.cc:366: W[11] is connected to others
[00:27:42] src/van.cc:366: S[10] is connected to others

so I suppose it misses S[8] and W[9] that should sit on the other ip. but why?

ib bandwidth/latency tests between nodes work without any problems.

@mli
Copy link
Contributor

mli commented Mar 22, 2016

the scheduler still uses the eth0, whose ip is obtained by
tracker/tracker.py and ignoress the DMLC_interface option... namely

[00:27:41] src/van.cc:76: Bind to role=scheduler id=1, ip=10.3.200.92,
port=9876

On Tue, Mar 22, 2016 at 7:54 PM, vcodreanu notifications@github.com wrote:

but in the last test the scheduler is on the compute node, so on ib0, and
it says:

[00:27:42] src/van.cc:366: W[11] is connected to others
[00:27:42] src/van.cc:366: S[10] is connected to others

so I suppose it misses S[8] and W[9] that should sit on the other ip. but
why?

ib bandwidth/latency tests between nodes work without any problems.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1623 (comment)

@vcodreanu
Copy link
Author

yes, now when it uses ib0 it freezes earlier:

[01:13:56] src/van.cc:76: Bind to role=schedulerid=1, ip=10.202.203.92, port=9984
[01:13:56] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=35198
[01:13:56] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=35198 } }
[01:13:56] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=35198 } }
[01:13:56] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=53948
[01:13:56] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=53948 } }
[01:13:56] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=53948 } }
[01:13:57] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=37327
[01:13:57] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=37327 } }
[01:13:57] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=50676
[01:13:57] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=50676 } }

@mli
Copy link
Contributor

mli commented Mar 23, 2016

that means zmq is failed to send data on infiniband with the tcp protocol.

can you try sdp instead. to use it, you need to hack ps-lite a little bit.

  1. go to ps-lite/src/van.cc, then replace the two tcp: by sdp:.
  2. make clean; make -j8 on mxnet's root, it should recompile ps-lite

i don't have infiniband at hand, so not sure if the above solution works...

@szha szha closed this as completed Sep 28, 2017
@chongyang-xu
Copy link

Hi, I met the same issue.Did you succeed to train over ib? @vcodreanu
I replaced the two tcp: by sdp: in ps-lite/src/van.cc. But It didn't work on my environment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants