-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Performance regression in v2.1.0 #1621
Comments
related PR: #1406 |
Interesting, so its related to the gelu activation function? |
|
If the performance regression I see using v2.1.0 is from the change in the gelu activation function, then I would say it affects the performance a lot. Also I noticed that on PR: #1406 there was a concern about running long MD simulations using |
The original algorithm of our op_module.gelu is indeed the same as the approximated tf.nn.gelu function: The major difference is that the TensorFlow's tf.nn.gelu function was implemented by the python API, which was inefficient in a CUDA environment. Another reason of #1406 was to try to get closer to TensorFlow's framework, reducing some maintenance tasks of our own code. Maybe we should return to our implementation in such a large performance gap. The error of running long MD simulations using op_module.gelu is still unclear. |
@denghuilu we can provide both options, maybe called gelu and gelu_tf. |
Sounds good |
|
Bug summary
Hi,
I noticed that potentials trained with deepmd-kit v2.1.0 are much slower than those trained with version v2.0.3.
Using the same input, a potential trained with v2.1.0 can only achieve around 90 timesteps/s in my MD simulations (LAMMPS).
While for the same simulations, using a potential trained with deepmd-kit v2.0.3 I can achieve around 160 timesteps/s.
(note this is using the same version of LAMMPS, the one distributed with deepmd-kit v2.1.0)
I also noticed that the number of ops in the final graph is different.
trained with v2.1.0: 979 ops in the final graph.
trained with v2.0.3: 614 ops in the final graph.
again this is using the same input files for training.
deepmd-kit is installed using conda
conda create -n deepmd deepmd-kit=*=gpu libdeepmd==*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org
/Daniel
DeePMD-kit Version
2.1.0
TensorFlow Version
2.7.0
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
Due to the sensitive nature of the project I can only include a redacted input file.
input.zip
Steps to Reproduce
Train a potential using deepmd-kit v2.1.0/v2.0.3 and compare their performance in a LAMMPS simulation.
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: