Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SINGA-487 Fusing gradients to reduce network latency #560

Merged
merged 2 commits into from Nov 13, 2019

Conversation

chrishkchris
Copy link
Contributor

@chrishkchris chrishkchris commented Nov 12, 2019

This PR reduces the network latency by fusing gradients in the same memory buffer before sending out with NCCL.
This can reduce much of the TCP/IP latency by reducing the number of NCCL API call.

Together with the result of PR #555, here is a simple test to make sure the training is correct:

ubuntu@ip-172-31-26-214:~/singa/examples/autograd$ python3 mnist_multiprocess.py
Starting Epoch 0:
Training loss = 831.072205, training accuracy = 0.700454
Evaluation accuracy = 0.927015, Elapsed Time = 0.676089s
Starting Epoch 1:
Training loss = 248.684601, training accuracy = 0.916183
Evaluation accuracy = 0.958265, Elapsed Time = 0.545179s
Starting Epoch 2:
Training loss = 172.330597, training accuracy = 0.943042
Evaluation accuracy = 0.967928, Elapsed Time = 0.543617s
Starting Epoch 3:
Training loss = 139.254807, training accuracy = 0.953425
Evaluation accuracy = 0.973067, Elapsed Time = 0.530805s
Starting Epoch 4:
Training loss = 115.329491, training accuracy = 0.960737
Evaluation accuracy = 0.976049, Elapsed Time = 0.530590s
Starting Epoch 5:
Training loss = 101.911728, training accuracy = 0.966179
Evaluation accuracy = 0.974095, Elapsed Time = 0.529574s
Starting Epoch 6:
Training loss = 90.820244, training accuracy = 0.969969
Evaluation accuracy = 0.980983, Elapsed Time = 0.530502s
Starting Epoch 7:
Training loss = 86.718071, training accuracy = 0.971037
Evaluation accuracy = 0.977590, Elapsed Time = 0.531085s
Starting Epoch 8:
Training loss = 79.507553, training accuracy = 0.973675
Evaluation accuracy = 0.976562, Elapsed Time = 0.529935s
Starting Epoch 9:
Training loss = 78.784409, training accuracy = 0.974025
Evaluation accuracy = 0.980469, Elapsed Time = 0.530919s

@chrishkchris chrishkchris changed the title SINGA-487 Accumulate gradients to reduce network latency SINGA-487 Fusing gradients to reduce network latency Nov 12, 2019
@nudles nudles merged commit 58e346e into apache:master Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants