-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Asynchronous] Why is asynchronous slower than synchronous? #271
Comments
Can you share your detailed setup? For example, how many workers & servers do you use? One reason I can think of is that the asynchronous design of BytePS involves extra memory copy before sending out the tensors, while the synchronous implementation does not have such copy. If the speed of all workers are similar and the network is pretty fast, the copy overhead might dominates. So you won't see asynchronous benefits in such cases. |
Thanks for the explanation and fast response. I don't understand what additional copy overhead does the asynchronous training have if both are trained distributed? |
See the memory copy operation in pytorch: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L191. (we also have similar implementation for TF and MXNet) As I said, you may expect to see the asynchronous advantages when using a real distributed setup, where the workers have distinct training speed. |
After a bit of digging, I believe the problem is that for asynchronous training the communications don't overlap with the backward computations: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L132 After changing the asynchronous training to send gradients in the backward pass, it is slightly faster training than synchronous. |
After the BytePS benchmark I found that asynchronous training was slower than synchronous training:
https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md
The asynchronous training was around 144 images/sec and the synchronous training was around 176 images/sec. In both cases, I followed the instructions in the distributed section.
Setup
I used an 8 x RTX 2080ti server with 64 CPU threads. I used the latest BytePS-PyTorch docker images. I increased the thread count of the parameter server to 32 and for the scheduler to 16 to see if they were the bottleneck.
Expected behavior
I expected the asynchronous training to be as fast if not faster than the synchronous training.
The text was updated successfully, but these errors were encountered: