Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the distributed training on Kubernetes #61

Closed
compete369 opened this issue Jul 16, 2019 · 19 comments
Closed

Run the distributed training on Kubernetes #61

compete369 opened this issue Jul 16, 2019 · 19 comments
Labels
distributed Distributed deployment (ps-lite, MXNet server)

Comments

@compete369
Copy link

compete369 commented Jul 16, 2019

After the successful single run on Kubernetes with the workaround, I tried to run the distributed train with 2 workers on Kubernetes. However there is only one worker running, and the another one hangs always. I assigned just 1 device (with 0 as device tag), but the running worker said 2 GPUS benchmarking. The running worker has 2 GPUs, and hanging worker has 1 GPU only.

  1. How did you benchmark? bare-mental or Kubernetes?
  2. Does it work if the worker just has 1 GPU? and is there any requirement on the GPU model?
  3. Is there any Kubernetes operator to setup bytePS?
@compete369
Copy link
Author

compete369 commented Jul 16, 2019

Is it mandatory to let scheduler server and worker located at different physical machine? In k8s, the Pod level isolation is as same as the physical machine, so could we let scheduler and server on different Pods but at same machine? Is it possible to use mxnet operator to drive the bytePS? I see the runtime structures are same.

@ymjiang
Copy link
Member

ymjiang commented Jul 16, 2019

After the successful single run on Kubernetes with the workaround, I tried to run the distributed train with 2 workers on Kubernetes. However there is only one worker running, and the another one hangs always. I assigned just 1 device (with 0 as device tag), but the running worker says 2 GPUS benchmarking. The running worker has 2 GPUs, and hanging worker has 1 GPU only.

For now BytePS assumes homogeneous setup, i.e., each worker must have the same number of GPUs. Otherwise you would get into trouble.

  1. How did you benchmark? bare-mental or Kubernetes?

We usually use docker containers. Running on bare-metal should be fine, but you need to be careful with the environments (gcc, cuda driver, etc). We never test with Kubernetes actually.

  1. Does it work if the worker just has 1 GPU? and is there any requirement on the GPU model?

Yes, it is OK if each worker only has 1 GPU. Regarding the GPU model, we have tested with 1080Ti and V100.

  1. Is there any Kubernetes operator to setup bytePS?

Unfortunately we don't have experience with Kubernetes.

@ymjiang
Copy link
Member

ymjiang commented Jul 16, 2019

Is it mandatory to let scheduler server and worker located at different physical machine? In k8s, the Pod level isolation is as same as the physical machine, so could we let scheduler and server on different Pods but at same machine?

It is not mandatory. You can put them on the same physically machine.

Is it possible to use mxnet operator to drive the bytePS? I see the runtime structures are same.

I am not sure what does "operator" means here. Can you please clarify?

@ymjiang
Copy link
Member

ymjiang commented Jul 16, 2019

Please set each worker with the same number of GPUs and then try again. If you still run into trouble, feel free to let us know.

@ymjiang ymjiang added the distributed Distributed deployment (ps-lite, MXNet server) label Jul 16, 2019
@bobzhuyb
Copy link
Member

We require that each worker has the same number of GPUs only because we need a way to correctly calculate the total number of workers and global rank.

@compete369
Copy link
Author

Is it mandatory to let scheduler server and worker located at different physical machine? In k8s, the Pod level isolation is as same as the physical machine, so could we let scheduler and server on different Pods but at same machine?

It is not mandatory. You can put them on the same physically machine.

Is it possible to use mxnet operator to drive the bytePS? I see the runtime structures are same.

I am not sure what does "operator" means here. Can you please clarify?

"Operator" is a concept of Kubernetes. Please refer to https://coreos.com/operators/. I'd like to verify the performance firstly, and then have an "operator" to let it run on k8s much easier.

@ymjiang
Copy link
Member

ymjiang commented Jul 17, 2019

"Operator" is a concept of Kubernetes. Please refer to https://coreos.com/operators/. I'd like to verify the performance firstly, and then have an "operator" to let it run on k8s much easier.

Thanks for the notes of k8s. Besides, if you want high performance, you need to put workers & servers on different physical machines. We have addressed this in README.

@compete369
Copy link
Author

compete369 commented Jul 17, 2019

How did you benchmark with 16 GPUs? how many servers did you run?
I had 2 servers with NVlink enabled (8 V100 GPUs), and they are connected with 25 Gbps TCP/IP network. The testing result is not as good as yours, just 70-100 img/sec per GPU (resnet50, batch size 64). At same time, I noticed the GPU utilization is very low compared with CPU and the network around 100MB in out equally.

image

@bobzhuyb
Copy link
Member

@compete369 Did single machine work as expected for you?

Your setup is pretty similar to ours. However, there are a few details I want to understand

  1. Are you using overlay networks? The performance of overlay networks is usually very poor. If possible, would you start with host network?
  2. How may parameter servers do you run? For two workers, we expect two PS.
  3. How many CPUs are allocated to the parameter servers, if you impose any restriction on the number of cores? For 25Gbps TCP/IP, I would expect 4-8 CPU cores per parameter server.

It's very possible to drive BytePS using the MXNet K8s operator with minor modification -- all the DMLC_* environmental variables are the same.

@compete369
Copy link
Author

@compete369 Did single machine work as expected for you?

Your setup is pretty similar to ours. However, there are a few details I want to understand

  1. Are you using overlay networks? The performance of overlay networks is usually very poor. If possible, would you start with host network?
  2. How may parameter servers do you run? For two workers, we expect two PS.
  3. How many CPUs are allocated to the parameter servers, if you impose any restriction on the number of cores? For 25Gbps TCP/IP, I would expect 4-8 CPU cores per parameter server.

It's very possible to drive BytePS using the MXNet K8s operator with minor modification -- all the DMLC_* environmental variables are the same.

The single machine run is very strange that the nvlink has no traffic. Could you point me?
image

  1. The network is good. I tried with iperf inside the docker, arriving at 22Gbps.
  2. I would like to 2 PS with 4-8 CPU cores later. Will let you know the result.

@bobzhuyb
Copy link
Member

@compete369 The single machine case is very strange. On a single machine, BytePS uses NCCL, which should use nvlink. Can you set NCCL_DEBUG=INFO when you run the test, and check the output of NCCL? What is the training speed of single machine?

@compete369
Copy link
Author

compete369 commented Jul 18, 2019

worker-pytorch-single1:25:25 [0] NCCL INFO Using internal Network Socket
NCCL version 2.3.7+cuda9.0

worker-pytorch-single1:25:25 [0] NCCL INFO Ring 00 : 0 1 2 3
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 01 : 0 2 1 3
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 02 : 0 3 1 2
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 03 : 0 3 2 1
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 04 : 0 1 2 3
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 05 : 0 2 1 3
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 06 : 0 3 1 2
worker-pytorch-single1:25:25 [0] NCCL INFO Ring 07 : 0 3 2 1
worker-pytorch-single1:38:38 [4] NCCL INFO Ring 01 : 0[4] -> 2[6] via P2P/IPC
worker-pytorch-single1:44:44 [5] NCCL INFO Ring 01 : 1[5] -> 3[7] via P2P/IPC
worker-pytorch-single1:46:46 [6] NCCL INFO Ring 01 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-single1:45:45 [7] NCCL INFO Ring 01 : 3[7] -> 0[4] via P2P/IPC

Iter #0: 141.0 img/sec per GPU
Iter #1: 143.4 img/sec per GPU

When I set NCCL_P2P_DISABLE =1
INFO NCCL_P2P_DISABLE set by environment to 1.
worker-pytorch-single1:22:22 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 1.
worker-pytorch-single1:31:31 [1] NCCL INFO NCCL_P2P_DISABLE set by environment to 1.
worker-pytorch-single1:36:36 [2] NCCL INFO NCCL_P2P_DISABLE set by environment to 1.
worker-pytorch-single1:22:22 [0] NCCL INFO Using 256 threads
worker-pytorch-single1:22:22 [0] NCCL INFO Min Comp Cap 7
worker-pytorch-single1:22:22 [0] NCCL INFO Ring 00 : 0 1 2 3
worker-pytorch-single1:35:35 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
worker-pytorch-single1:22:22 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
worker-pytorch-single1:36:36 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
worker-pytorch-single1:31:31 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory

Iter #0: 107.7 img/sec per GPU
Iter #1: 106.1 img/sec per GPU
Iter #2: 138.4 img/sec per GPU
Iter #3: 116.3 img/sec per GPU
Iter #4: 119.3 img/sec per GPU

@bobzhuyb
Copy link
Member

bobzhuyb commented Jul 18, 2019

@compete369 Thanks for the log. "P2P/IPC" is the correct status. BTW, did you try Horovod in the single machine case? Is the performance similar? As long as it's similar, you are good with the single machine case. You may also try MXNet's own NCCL implementation. These should all be very similar.

Based on your NCCL_P2P_DISABLE =1 results, I think you should expect BytePS to get similar results in the multi-worker distributed case (100~140 img/sec per GPU). I believe you can get that with two separated parameter servers. Yes, it's much slower than our results. But I think it's due to your machine setup...

In the end, BytePS can't beat the performance of a local NCCL using direct shared memory. You can compare the data path in the two cases:

Local NCCL with direct shared memory: GPU -> CPU memory -> GPU
BytePS: GPU -> CPU memory -> push to PS -> pull from PS -> CPU memory -> GPU

The best you can expect from BytePS is to help you completely hide the latency of push and pull, and gets the same performance as local NCCL.

The same goes for all the alternative distributed options you have. For example, Horovod can't get you any better results, either.

@compete369
Copy link
Author

Thanks very much. Finally, I got the expected speed on the single machine. However, I failed moving on to the 2 machines training. I had 1 scheduler, 2 server, and 2 workers. They on the different machine.

[10:03:24] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[10:03:24] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[10:03:24] src/van.cc:357: Bind to role=scheduler, id=1, ip=11.140.196.93, port=1234, is_recovery=0
[10:03:24] src/./zmq_van.h:285: Start ZMQ recv thread
[10:05:43] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.97, port=51865, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:05:47] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.98, port=60669, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:06:20] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=11.138.195.221, port=51882, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:06:20] src/van.cc:108: assign rank=8 to node role=server, ip=11.140.109.98, port=60669, is_recovery=0
[10:06:20] src/van.cc:108: assign rank=10 to node role=server, ip=11.140.109.97, port=51865, is_recovery=0
[10:06:20] src/van.cc:108: assign rank=9 to node role=worker, ip=11.138.195.221, port=51882, is_recovery=0
[10:06:20] src/van.cc:446: ? => 9. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, id=8, ip=11.140.109.98, port=60669, is_recovery=0 role=server, id=10, ip=11.140.109.97, port=51865, is_recovery=0 role=worker, id=9, ip=11.138.195.221, port=51882, is_recovery=0 role=scheduler, id=1, ip=11.140.196.93, port=1234, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:06:20] src/./zmq_van.h:234: there is no socket to node 11
terminate called after throwing an instance of 'dmlc::Error'
what(): [10:06:20] src/van.cc:442: Check failed: (send_bytes) != (-1)

Stack trace returned 9 entries:
[bt] (0) /root/incubator-mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x1bc) [0x7fdffee8ce5c]
[bt] (1) /root/incubator-mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fdffee8e1d8]
[bt] (2) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::Send(ps::Message&)+0x235) [0x7fe002374ee5]
[bt] (3) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::ProcessAddNodeCommandAtScheduler(ps::Message*, ps::Meta*, ps::Meta*)+0xf5a) [0x7fe0023792da]
[bt] (4) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::ProcessAddNodeCommand(ps::Message*, ps::Meta*, ps::Meta*)+0x556) [0x7fe002379a16]
[bt] (5) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::Receiving()+0xdad) [0x7fe00237a88d]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fe099ca2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fe0a60266ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fe0a5d5c41d]

Could you help me on this? tks!

@ymjiang
Copy link
Member

ymjiang commented Jul 18, 2019

src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.97, port=51865, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:05:47] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.98, port=60669, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:06:20] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=11.138.195.221, port=51882, is_recovery=0 } }. THIS IS NOT DATA MSG!

This is strange. Looks like Node11 (worker) does not send message to the scheduler at all, but the scheduler still behaves like all workers & servers are ready.

Can you please show the complete scripts you use to launch each scheduler / server / worker? Thanks.

Another guess: Did you set DMLC_NUM_WOKRER=1 for the scheduler?

@compete369
Copy link
Author

src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.97, port=51865, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:05:47] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=11.140.109.98, port=60669, is_recovery=0 } }. THIS IS NOT DATA MSG!
[10:06:20] src/van.cc:471: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=worker, ip=11.138.195.221, port=51882, is_recovery=0 } }. THIS IS NOT DATA MSG!

This is strange. Looks like Node11 (worker) does not send message to the scheduler at all, but the scheduler still behaves like all workers & servers are ready.

Can you please show the complete scripts you use to launch each scheduler / server / worker? Thanks.

Another guess: Did you set DMLC_NUM_WOKRER=1 for the scheduler?

@compete369
Copy link
Author

I am sorry that I had wrong DMLC_NUM_WOKRER=1 for the scheduler, should be DMLC_NUM_WOKRER=2.

@compete369 compete369 reopened this Jul 18, 2019
@compete369
Copy link
Author

I had 1 scheduler, 2 servers (8U16G), 2workers(64U,256G, 8V100GPUs). It could run successfully, but training speed is around 95 images per GPU. In a single machine, it could run 250 images per GPU. Any clue on this?

BytePS launching worker
running benchmark...
running benchmark...
running benchmark...
running benchmark...
running benchmark...
running benchmark...
running benchmark...
running benchmark...
[11:18:57] src/customer.cc:363: Do not use thread pool for receiving.
[11:18:57] src/./zmq_van.h:285: Start ZMQ recv thread
[11:19:00] src/./zmq_van.h:285: Start ZMQ recv thread
[11:19:00] src/./zmq_van.h:285: Start ZMQ recv thread
[11:19:00] src/./zmq_van.h:285: Start ZMQ recv thread

worker-pytorch-0:33:33 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:33:33 [3] NCCL INFO Using internal Network Socket
worker-pytorch-0:33:33 [3] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:33:33 [3] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:33:33 [3] NCCL INFO rank 3 nranks 4

worker-pytorch-0:24:24 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:24:24 [1] NCCL INFO Using internal Network Socket
worker-pytorch-0:24:24 [1] NCCL INFO rank 1 nranks 4

worker-pytorch-0:31:31 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:31:31 [2] NCCL INFO Using internal Network Socket
worker-pytorch-0:31:31 [2] NCCL INFO rank 2 nranks 4

worker-pytorch-0:27:27 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:27:27 [0] NCCL INFO Using internal Network Socket
NCCL version 2.3.7+cuda9.0
worker-pytorch-0:27:27 [0] NCCL INFO rank 0 nranks 4
worker-pytorch-0:24:24 [1] NCCL INFO comm 0x9249b110 rank 1 nranks 4
worker-pytorch-0:24:24 [1] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:24:24 [1] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:27:27 [0] NCCL INFO comm 0x91859a10 rank 0 nranks 4
worker-pytorch-0:27:27 [0] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:27:27 [0] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:31:31 [2] NCCL INFO comm 0x92cc7f90 rank 2 nranks 4
worker-pytorch-0:31:31 [2] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:31:31 [2] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:33:33 [3] NCCL INFO comm 0x92f3d4f0 rank 3 nranks 4
worker-pytorch-0:24:24 [1] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:24:24 [1] NCCL INFO CUDA Dev 1, IP Interfaces : eth0(SOC)
worker-pytorch-0:31:31 [2] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:31:31 [2] NCCL INFO CUDA Dev 2, IP Interfaces : eth0(SOC)
worker-pytorch-0:27:27 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:27:27 [0] NCCL INFO CUDA Dev 0, IP Interfaces : eth0(SOC)
worker-pytorch-0:33:33 [3] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:33:33 [3] NCCL INFO CUDA Dev 3, IP Interfaces : eth0(SOC)
worker-pytorch-0:33:33 [3] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:27:27 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:24:24 [1] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:31:31 [2] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:27:27 [0] NCCL INFO Using 256 threads
worker-pytorch-0:27:27 [0] NCCL INFO Min Comp Cap 7
worker-pytorch-0:27:27 [0] NCCL INFO Ring 00 : 0 1 2 3
worker-pytorch-0:27:27 [0] NCCL INFO Ring 01 : 0 2 1 3
worker-pytorch-0:27:27 [0] NCCL INFO Ring 02 : 0 3 1 2
worker-pytorch-0:27:27 [0] NCCL INFO Ring 03 : 0 3 2 1
worker-pytorch-0:27:27 [0] NCCL INFO Ring 04 : 0 1 2 3
worker-pytorch-0:27:27 [0] NCCL INFO Ring 05 : 0 2 1 3
worker-pytorch-0:27:27 [0] NCCL INFO Ring 06 : 0 3 1 2
worker-pytorch-0:27:27 [0] NCCL INFO Ring 07 : 0 3 2 1
worker-pytorch-0:31:31 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 01 : 2[2] -> 1[1] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 01 : 0[0] -> 2[2] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 01 : 3[3] -> 0[0] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 02 : 0[0] -> 3[3] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 02 : 3[3] -> 1[1] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 02 : 2[2] -> 0[0] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 03 : 2[2] -> 1[1] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 03 : 0[0] -> 3[3] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 03 : 3[3] -> 2[2] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 03 : 1[1] -> 0[0] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 04 : 1[1] -> 2[2] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 04 : 3[3] -> 0[0] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 04 : 2[2] -> 3[3] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 04 : 0[0] -> 1[1] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 05 : 0[0] -> 2[2] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 05 : 1[1] -> 3[3] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 05 : 2[2] -> 1[1] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 05 : 3[3] -> 0[0] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 06 : 0[0] -> 3[3] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 06 : 3[3] -> 1[1] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 06 : 1[1] -> 2[2] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 06 : 2[2] -> 0[0] via P2P/IPC
worker-pytorch-0:31:31 [2] NCCL INFO Ring 07 : 2[2] -> 1[1] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO Ring 07 : 3[3] -> 2[2] via P2P/IPC
worker-pytorch-0:24:24 [1] NCCL INFO Ring 07 : 1[1] -> 0[0] via P2P/IPC
worker-pytorch-0:27:27 [0] NCCL INFO Ring 07 : 0[0] -> 3[3] via P2P/IPC
worker-pytorch-0:33:33 [3] NCCL INFO comm 0x92f3d4f0 rank 3 nranks 4 - COMPLETE
worker-pytorch-0:27:27 [0] NCCL INFO comm 0x91859a10 rank 0 nranks 4 - COMPLETE
worker-pytorch-0:24:24 [1] NCCL INFO comm 0x9249b110 rank 1 nranks 4 - COMPLETE
worker-pytorch-0:31:31 [2] NCCL INFO comm 0x92cc7f90 rank 2 nranks 4 - COMPLETE
worker-pytorch-0:27:623 [0] NCCL INFO Launch mode Parallel

worker-pytorch-0:46:46 [7] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:46:46 [7] NCCL INFO Using internal Network Socket
worker-pytorch-0:46:46 [7] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:46:46 [7] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:46:46 [7] NCCL INFO rank 3 nranks 4

worker-pytorch-0:40:40 [4] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:40:40 [4] NCCL INFO Using internal Network Socket
NCCL version 2.3.7+cuda9.0
worker-pytorch-0:40:40 [4] NCCL INFO rank 0 nranks 4

worker-pytorch-0:43:43 [5] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:43:43 [5] NCCL INFO Using internal Network Socket
worker-pytorch-0:43:43 [5] NCCL INFO rank 1 nranks 4

worker-pytorch-0:45:45 [6] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
worker-pytorch-0:45:45 [6] NCCL INFO Using internal Network Socket
worker-pytorch-0:45:45 [6] NCCL INFO rank 2 nranks 4
worker-pytorch-0:46:46 [7] NCCL INFO comm 0x920ec650 rank 3 nranks 4
worker-pytorch-0:43:43 [5] NCCL INFO comm 0x93b8fe20 rank 1 nranks 4
worker-pytorch-0:40:40 [4] NCCL INFO comm 0x91c0ed20 rank 0 nranks 4
worker-pytorch-0:45:45 [6] NCCL INFO comm 0x8ed6c220 rank 2 nranks 4
worker-pytorch-0:45:45 [6] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:40:40 [4] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:43:43 [5] NCCL INFO NET : Using interface eth0:11.138.195.228<0>
worker-pytorch-0:45:45 [6] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:40:40 [4] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:43:43 [5] NCCL INFO NET/Socket : 1 interfaces found
worker-pytorch-0:43:43 [5] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:43:43 [5] NCCL INFO CUDA Dev 5, IP Interfaces : eth0(SOC)
worker-pytorch-0:45:45 [6] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:45:45 [6] NCCL INFO CUDA Dev 6, IP Interfaces : eth0(SOC)
worker-pytorch-0:40:40 [4] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:40:40 [4] NCCL INFO CUDA Dev 4, IP Interfaces : eth0(SOC)
worker-pytorch-0:46:46 [7] NCCL INFO Could not find real path of /sys/class/net/eth0/device
worker-pytorch-0:46:46 [7] NCCL INFO CUDA Dev 7, IP Interfaces : eth0(SOC)
worker-pytorch-0:46:46 [7] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:40:40 [4] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:43:43 [5] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:45:45 [6] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
worker-pytorch-0:40:40 [4] NCCL INFO Using 256 threads
worker-pytorch-0:40:40 [4] NCCL INFO Min Comp Cap 7
worker-pytorch-0:40:40 [4] NCCL INFO Ring 00 : 0 1 2 3
worker-pytorch-0:40:40 [4] NCCL INFO Ring 01 : 0 2 1 3
worker-pytorch-0:40:40 [4] NCCL INFO Ring 02 : 0 3 1 2
worker-pytorch-0:40:40 [4] NCCL INFO Ring 03 : 0 3 2 1
worker-pytorch-0:40:40 [4] NCCL INFO Ring 04 : 0 1 2 3
worker-pytorch-0:40:40 [4] NCCL INFO Ring 05 : 0 2 1 3
worker-pytorch-0:40:40 [4] NCCL INFO Ring 06 : 0 3 1 2
worker-pytorch-0:40:40 [4] NCCL INFO Ring 07 : 0 3 2 1
worker-pytorch-0:46:46 [7] NCCL INFO Ring 00 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 00 : 2[6] -> 3[7] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 00 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 00 : 0[4] -> 1[5] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 01 : 0[4] -> 2[6] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 01 : 1[5] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 01 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 01 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 02 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 02 : 3[7] -> 1[5] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 02 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 02 : 2[6] -> 0[4] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 03 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 03 : 3[7] -> 2[6] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 03 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 03 : 1[5] -> 0[4] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 04 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 04 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 04 : 2[6] -> 3[7] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 04 : 0[4] -> 1[5] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 05 : 0[4] -> 2[6] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 05 : 1[5] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 05 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 05 : 3[7] -> 0[4] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 06 : 3[7] -> 1[5] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 06 : 1[5] -> 2[6] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 06 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 06 : 2[6] -> 0[4] via P2P/IPC
worker-pytorch-0:45:45 [6] NCCL INFO Ring 07 : 2[6] -> 1[5] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO Ring 07 : 3[7] -> 2[6] via P2P/IPC
worker-pytorch-0:43:43 [5] NCCL INFO Ring 07 : 1[5] -> 0[4] via P2P/IPC
worker-pytorch-0:40:40 [4] NCCL INFO Ring 07 : 0[4] -> 3[7] via P2P/IPC
worker-pytorch-0:46:46 [7] NCCL INFO comm 0x920ec650 rank 3 nranks 4 - COMPLETE
worker-pytorch-0:43:43 [5] NCCL INFO comm 0x93b8fe20 rank 1 nranks 4 - COMPLETE
worker-pytorch-0:45:45 [6] NCCL INFO comm 0x8ed6c220 rank 2 nranks 4 - COMPLETE
worker-pytorch-0:40:40 [4] NCCL INFO comm 0x91c0ed20 rank 0 nranks 4 - COMPLETE
worker-pytorch-0:40:660 [4] NCCL INFO Launch mode Parallel
Model: resnet50
Batch size: 64
Number of GPUs: 16
Running warmup...
Running benchmark...
Iter #0: 88.9 img/sec per GPU
Iter #1: 82.0 img/sec per GPU
Iter #2: 85.4 img/sec per GPU
Iter #3: 82.9 img/sec per GPU
Iter #4: 86.7 img/sec per GPU
Iter #5: 85.2 img/sec per GPU

And GPU utilization is very low.

@bobzhuyb
Copy link
Member

Let's continue the discussion in your performance issue. #68

pleasantrabbit pushed a commit that referenced this issue Aug 13, 2020
* 1bit: not need to do wd mom for uncompressed gradients

* 1bit: fix typo

* 1bit: normal weight decay

* 1bit: update

* 1bit: update

* misc: fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Distributed deployment (ps-lite, MXNet server)
Projects
None yet
Development

No branches or pull requests

3 participants