Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

When doing the distributed training, the bandwidth becomes the bottleneck. #10017

Closed
GoodJoey opened this issue Mar 7, 2018 · 4 comments
Closed

Comments

@GoodJoey
Copy link

GoodJoey commented Mar 7, 2018

is there any experience/magic solutions when using mxnet to do distribute training?
say, is there any method, force the worker communicates with the parameter server on the same node to save the bandwidth? or something like ring-allreduce(horovod)?

@huyangc
Copy link

huyangc commented Mar 7, 2018

You could try the gradient compression first. It is more easier.

@sandeep-krishnamurthy
Copy link
Contributor

@rahul003

@rahul003
Copy link
Member

rahul003 commented Mar 9, 2018

What kind of a network are you training?

@GoodJoey
Copy link
Author

such as ResNet-101.
when you got a lot of data to process, you need to do data parallel using distributed training to save time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants