[Asynchronous] Why is asynchronous slower than synchronous? #271

idoh · 2020-07-18T07:05:50Z

After the BytePS benchmark I found that asynchronous training was slower than synchronous training:
https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

The asynchronous training was around 144 images/sec and the synchronous training was around 176 images/sec. In both cases, I followed the instructions in the distributed section.

Setup
I used an 8 x RTX 2080ti server with 64 CPU threads. I used the latest BytePS-PyTorch docker images. I increased the thread count of the parameter server to 32 and for the scheduler to 16 to see if they were the bottleneck.

Expected behavior
I expected the asynchronous training to be as fast if not faster than the synchronous training.

ymjiang · 2020-07-18T08:25:23Z

Can you share your detailed setup? For example, how many workers & servers do you use?

One reason I can think of is that the asynchronous design of BytePS involves extra memory copy before sending out the tensors, while the synchronous implementation does not have such copy. If the speed of all workers are similar and the network is pretty fast, the copy overhead might dominates. So you won't see asynchronous benefits in such cases.

idoh · 2020-07-18T09:38:45Z

Thanks for the explanation and fast response.
I'm using a single machine with 2 workers, 1 parameter server and 1 scheduler on it. I know that training it locally would be better but I wanted to try the distributed setup before launching on real servers. I'm using the BytePS ResNet-50 benchmark script: byteps/example/pytorch/benchmark_byteps.py. I trained both synchronous and asynchronous in this setup.

I don't understand what additional copy overhead does the asynchronous training have if both are trained distributed?

ymjiang · 2020-07-18T13:25:18Z

See the memory copy operation in pytorch: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L191. (we also have similar implementation for TF and MXNet)

As I said, you may expect to see the asynchronous advantages when using a real distributed setup, where the workers have distinct training speed.

idoh · 2020-11-25T13:34:19Z

After a bit of digging, I believe the problem is that for asynchronous training the communications don't overlap with the backward computations: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L132

After changing the asynchronous training to send gradients in the backward pass, it is slightly faster training than synchronous.

idoh changed the title ~~[Asynchronous] Why is asynchronous training slower than synchronous?~~ [Asynchronous] Why is asynchronous slower than synchronous? Jul 18, 2020

ymjiang mentioned this issue Jan 19, 2021

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Asynchronous] Why is asynchronous slower than synchronous? #271

[Asynchronous] Why is asynchronous slower than synchronous? #271

idoh commented Jul 18, 2020

ymjiang commented Jul 18, 2020

idoh commented Jul 18, 2020

ymjiang commented Jul 18, 2020 •

edited

Loading

idoh commented Nov 25, 2020

[Asynchronous] Why is asynchronous slower than synchronous? #271

[Asynchronous] Why is asynchronous slower than synchronous? #271

Comments

idoh commented Jul 18, 2020

ymjiang commented Jul 18, 2020

idoh commented Jul 18, 2020

ymjiang commented Jul 18, 2020 • edited Loading

idoh commented Nov 25, 2020

ymjiang commented Jul 18, 2020 •

edited

Loading