Fix gradient not averaged when parallel training. #1104

shishaochen · 2021-09-06T07:10:32Z

The injection of gradient averaging OPs happens at horovod.tensorflow.DistributedOptimizer.compute_gradients.

Before this pull request, each worker trains its own data without synchronization except variables' broadcast at the beginning.

This BUG may explain no acceleration on loss convergence when parallel training.

…aining tutorial.

codecov-commenter · 2021-09-06T07:14:55Z

Codecov Report

Merging #1104 (639ec28) into devel (a5bdd14) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            devel    #1104      +/-   ##
==========================================
- Coverage   75.72%   75.71%   -0.01%     
==========================================
  Files          88       88              
  Lines        6998     6997       -1     
==========================================
- Hits         5299     5298       -1     
  Misses       1699     1699

Impacted Files	Coverage Δ
deepmd/train/trainer.py	`72.83% <100.00%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5bdd14...639ec28. Read the comment docs.

shishaochen added 2 commits September 6, 2021 14:08

Fix gradient not averaged when parallel training.

a38e69b

Correct throughput metrics and explain CPU runtime in the parallel-tr…

639ec28

…aining tutorial.

amcadmus approved these changes Sep 6, 2021

View reviewed changes

amcadmus merged commit 32ccbb5 into deepmodeling:devel Sep 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient not averaged when parallel training. #1104

Fix gradient not averaged when parallel training. #1104

shishaochen commented Sep 6, 2021 •

edited

Loading

codecov-commenter commented Sep 6, 2021 •

edited

Loading

Fix gradient not averaged when parallel training. #1104

Fix gradient not averaged when parallel training. #1104

Conversation

shishaochen commented Sep 6, 2021 • edited Loading

codecov-commenter commented Sep 6, 2021 • edited Loading

Codecov Report

shishaochen commented Sep 6, 2021 •

edited

Loading

codecov-commenter commented Sep 6, 2021 •

edited

Loading