Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do some small optimization to ops #943

Merged
merged 5 commits into from
Feb 11, 2022

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Aug 10, 2021

avoid concat or add in loops. Instead, append tensors to a list, and concat or accumulate_n after loops

1. avoid concat or add in loops. Instead, append tensors to a list, and concat or accumulate_n after loops
2. remove a duplicated reshape
@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2021

Codecov Report

Merging #943 (30f8e7c) into devel (0d8fe0a) will increase coverage by 0.01%.
The diff coverage is 81.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #943      +/-   ##
==========================================
+ Coverage   75.67%   75.68%   +0.01%     
==========================================
  Files          92       92              
  Lines        7671     7671              
==========================================
+ Hits         5805     5806       +1     
+ Misses       1866     1865       -1     
Impacted Files Coverage Δ
deepmd/fit/polar.py 49.75% <50.00%> (+0.48%) ⬆️
deepmd/descriptor/se_a.py 94.17% <100.00%> (ø)
deepmd/fit/dipole.py 93.24% <100.00%> (ø)
deepmd/fit/ener.py 90.90% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d8fe0a...30f8e7c. Read the comment docs.

@amcadmus
Copy link
Member

Have you benchmarked these optimization? Do they help improving the efficiency?

@njzjz njzjz marked this pull request as draft August 10, 2021 02:40
@njzjz
Copy link
Member Author

njzjz commented Aug 10, 2021

Have you benchmarked these optimization? Do they help improving the efficiency?

I just benchmarked it. The answer is no😂

@njzjz njzjz closed this Aug 10, 2021
@njzjz njzjz reopened this Jan 13, 2022
@njzjz njzjz removed the request for review from denghuilu January 13, 2022 13:50
@njzjz
Copy link
Member Author

njzjz commented Jan 13, 2022

I think these optimizations may be more important to CPUs, compared to GPUs. I will recheck this PR.

@njzjz
Copy link
Member Author

njzjz commented Jan 14, 2022

Do some profiling here:
(1) + vs accumulate_n
+ one by one has more ops than accumulate_n once.

image
+

image
accumulate_n

(2) concat
image

@njzjz njzjz marked this pull request as ready for review January 14, 2022 07:36
@@ -797,12 +798,12 @@ def _filter(
bavg = bavg,
trainable = trainable,
suffix = "_"+str(type_i))
if type_i == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we have a bug here? if type_i == 0 and (type_input, type_i) in self.exclude_types we had ret accumulated.

@wanghan-iapcm
Copy link
Collaborator

wanghan-iapcm commented Jan 15, 2022

@denghuilu Would the revised code be faster on GPUs?

@njzjz
Copy link
Member Author

njzjz commented Jan 16, 2022

I think one cannot see any difference if there are only one or two elements. A system with at least 10 atom types should be tested.

@denghuilu
Copy link
Member

There is a slight performance penalty on V100 GPU with the water benchmark system:

optimize-ops branch


DEEPMD INFO    batch     100 training time 3.36 s, testing time 2.34 s
DEEPMD INFO    batch     200 training time 1.73 s, testing time 2.32 s
DEEPMD INFO    batch     300 training time 1.75 s, testing time 2.32 s
DEEPMD INFO    batch     400 training time 1.73 s, testing time 2.41 s
DEEPMD INFO    batch     500 training time 1.72 s, testing time 2.37 s
DEEPMD INFO    batch     600 training time 1.74 s, testing time 2.36 s
DEEPMD INFO    batch     700 training time 1.76 s, testing time 2.43 s
DEEPMD INFO    batch     800 training time 1.77 s, testing time 2.48 s
DEEPMD INFO    batch     900 training time 1.75 s, testing time 2.47 s
DEEPMD INFO    batch    1000 training time 1.72 s, testing time 2.41 s

devel branch

DEEPMD INFO    batch     100 training time 3.03 s, testing time 0.02 s
DEEPMD INFO    batch     200 training time 1.60 s, testing time 0.02 s
DEEPMD INFO    batch     300 training time 1.63 s, testing time 0.02 s
DEEPMD INFO    batch     400 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     500 training time 1.58 s, testing time 0.02 s
DEEPMD INFO    batch     600 training time 1.62 s, testing time 0.02 s
DEEPMD INFO    batch     700 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     800 training time 1.58 s, testing time 0.02 s
DEEPMD INFO    batch     900 training time 1.60 s, testing time 0.02 s

Maybe the GPU implementation did not use the stream parallelization.

@wanghan-iapcm
Copy link
Collaborator

There is a slight performance penalty on V100 GPU with the water benchmark system:

optimize-ops branch


DEEPMD INFO    batch     100 training time 3.36 s, testing time 2.34 s
DEEPMD INFO    batch     200 training time 1.73 s, testing time 2.32 s
DEEPMD INFO    batch     300 training time 1.75 s, testing time 2.32 s
DEEPMD INFO    batch     400 training time 1.73 s, testing time 2.41 s
DEEPMD INFO    batch     500 training time 1.72 s, testing time 2.37 s
DEEPMD INFO    batch     600 training time 1.74 s, testing time 2.36 s
DEEPMD INFO    batch     700 training time 1.76 s, testing time 2.43 s
DEEPMD INFO    batch     800 training time 1.77 s, testing time 2.48 s
DEEPMD INFO    batch     900 training time 1.75 s, testing time 2.47 s
DEEPMD INFO    batch    1000 training time 1.72 s, testing time 2.41 s

devel branch

DEEPMD INFO    batch     100 training time 3.03 s, testing time 0.02 s
DEEPMD INFO    batch     200 training time 1.60 s, testing time 0.02 s
DEEPMD INFO    batch     300 training time 1.63 s, testing time 0.02 s
DEEPMD INFO    batch     400 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     500 training time 1.58 s, testing time 0.02 s
DEEPMD INFO    batch     600 training time 1.62 s, testing time 0.02 s
DEEPMD INFO    batch     700 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     800 training time 1.58 s, testing time 0.02 s
DEEPMD INFO    batch     900 training time 1.60 s, testing time 0.02 s

Maybe the GPU implementation did not use the stream parallelization.

Why the testing time of optimize-ops is so long?

@njzjz
Copy link
Member Author

njzjz commented Feb 10, 2022

Why the testing time of optimize-ops is so long?

It was fixed by #1419 -- this branch is behind devel.

@denghuilu
Copy link
Member

It did have some benefits:

DEEPMD INFO    batch     200 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     300 training time 1.56 s, testing time 0.02 s
DEEPMD INFO    batch     400 training time 1.57 s, testing time 0.02 s
DEEPMD INFO    batch     500 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     600 training time 1.59 s, testing time 0.02 s
DEEPMD INFO    batch     700 training time 1.60 s, testing time 0.02 s
DEEPMD INFO    batch     800 training time 1.60 s, testing time 0.02 s
DEEPMD INFO    batch     900 training time 1.60 s, testing time 0.02 s
DEEPMD INFO    batch    1000 training time 1.57 s, testing time 0.02 s

@wanghan-iapcm wanghan-iapcm merged commit 82c787d into deepmodeling:devel Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants