Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gradient not averaged when parallel training. #1104

Merged
merged 2 commits into from
Sep 6, 2021

Conversation

shishaochen
Copy link
Collaborator

@shishaochen shishaochen commented Sep 6, 2021

The injection of gradient averaging OPs happens at horovod.tensorflow.DistributedOptimizer.compute_gradients.

Before this pull request, each worker trains its own data without synchronization except variables' broadcast at the beginning.

This BUG may explain no acceleration on loss convergence when parallel training.

@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2021

Codecov Report

Merging #1104 (639ec28) into devel (a5bdd14) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #1104      +/-   ##
==========================================
- Coverage   75.72%   75.71%   -0.01%     
==========================================
  Files          88       88              
  Lines        6998     6997       -1     
==========================================
- Hits         5299     5298       -1     
  Misses       1699     1699              
Impacted Files Coverage Δ
deepmd/train/trainer.py 72.83% <100.00%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5bdd14...639ec28. Read the comment docs.

@amcadmus amcadmus merged commit 32ccbb5 into deepmodeling:devel Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants