Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Asynchronous] Why is asynchronous slower than synchronous? #271

Open
idoh opened this issue Jul 18, 2020 · 4 comments
Open

[Asynchronous] Why is asynchronous slower than synchronous? #271

idoh opened this issue Jul 18, 2020 · 4 comments

Comments

@idoh
Copy link

idoh commented Jul 18, 2020

After the BytePS benchmark I found that asynchronous training was slower than synchronous training:
https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

The asynchronous training was around 144 images/sec and the synchronous training was around 176 images/sec. In both cases, I followed the instructions in the distributed section.

Setup
I used an 8 x RTX 2080ti server with 64 CPU threads. I used the latest BytePS-PyTorch docker images. I increased the thread count of the parameter server to 32 and for the scheduler to 16 to see if they were the bottleneck.

Expected behavior
I expected the asynchronous training to be as fast if not faster than the synchronous training.

@idoh idoh changed the title [Asynchronous] Why is asynchronous training slower than synchronous? [Asynchronous] Why is asynchronous slower than synchronous? Jul 18, 2020
@ymjiang
Copy link
Member

ymjiang commented Jul 18, 2020

Can you share your detailed setup? For example, how many workers & servers do you use?

One reason I can think of is that the asynchronous design of BytePS involves extra memory copy before sending out the tensors, while the synchronous implementation does not have such copy. If the speed of all workers are similar and the network is pretty fast, the copy overhead might dominates. So you won't see asynchronous benefits in such cases.

@idoh
Copy link
Author

idoh commented Jul 18, 2020

Thanks for the explanation and fast response.
I'm using a single machine with 2 workers, 1 parameter server and 1 scheduler on it. I know that training it locally would be better but I wanted to try the distributed setup before launching on real servers. I'm using the BytePS ResNet-50 benchmark script: byteps/example/pytorch/benchmark_byteps.py. I trained both synchronous and asynchronous in this setup.

I don't understand what additional copy overhead does the asynchronous training have if both are trained distributed?

@ymjiang
Copy link
Member

ymjiang commented Jul 18, 2020

See the memory copy operation in pytorch: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L191. (we also have similar implementation for TF and MXNet)

As I said, you may expect to see the asynchronous advantages when using a real distributed setup, where the workers have distinct training speed.

@idoh
Copy link
Author

idoh commented Nov 25, 2020

After a bit of digging, I believe the problem is that for asynchronous training the communications don't overlap with the backward computations: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L132

After changing the asynchronous training to send gradients in the backward pass, it is slightly faster training than synchronous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants