When doing the distributed training, the bandwidth becomes the bottleneck. #10017

GoodJoey · 2018-03-07T02:53:01Z

is there any experience/magic solutions when using mxnet to do distribute training?
say, is there any method, force the worker communicates with the parameter server on the same node to save the bandwidth? or something like ring-allreduce(horovod)?

huyangc · 2018-03-07T03:09:13Z

You could try the gradient compression first. It is more easier.

sandeep-krishnamurthy · 2018-03-09T19:13:34Z

@rahul003

rahul003 · 2018-03-09T19:16:09Z

What kind of a network are you training?

GoodJoey · 2018-03-16T08:14:46Z

such as ResNet-101.
when you got a lot of data to process, you need to do data parallel using distributed training to save time.

sandeep-krishnamurthy added the Question label Mar 9, 2018

GoodJoey closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When doing the distributed training, the bandwidth becomes the bottleneck. #10017

When doing the distributed training, the bandwidth becomes the bottleneck. #10017

GoodJoey commented Mar 7, 2018

huyangc commented Mar 7, 2018

sandeep-krishnamurthy commented Mar 9, 2018

rahul003 commented Mar 9, 2018

GoodJoey commented Mar 16, 2018

When doing the distributed training, the bandwidth becomes the bottleneck. #10017

When doing the distributed training, the bandwidth becomes the bottleneck. #10017

Comments

GoodJoey commented Mar 7, 2018

huyangc commented Mar 7, 2018

sandeep-krishnamurthy commented Mar 9, 2018

rahul003 commented Mar 9, 2018

GoodJoey commented Mar 16, 2018