Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance regression in v2.1.0 #1621

Closed
Dankomaister opened this issue Apr 5, 2022 · 8 comments · Fixed by #1795
Closed

[BUG] Performance regression in v2.1.0 #1621

Dankomaister opened this issue Apr 5, 2022 · 8 comments · Fixed by #1795
Assignees
Labels

Comments

@Dankomaister
Copy link

Bug summary

Hi,

I noticed that potentials trained with deepmd-kit v2.1.0 are much slower than those trained with version v2.0.3.
Using the same input, a potential trained with v2.1.0 can only achieve around 90 timesteps/s in my MD simulations (LAMMPS).
While for the same simulations, using a potential trained with deepmd-kit v2.0.3 I can achieve around 160 timesteps/s.
(note this is using the same version of LAMMPS, the one distributed with deepmd-kit v2.1.0)

I also noticed that the number of ops in the final graph is different.
trained with v2.1.0: 979 ops in the final graph.
trained with v2.0.3: 614 ops in the final graph.
again this is using the same input files for training.

deepmd-kit is installed using conda
conda create -n deepmd deepmd-kit=*=gpu libdeepmd==*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org

/Daniel

DeePMD-kit Version

2.1.0

TensorFlow Version

2.7.0

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

Due to the sensitive nature of the project I can only include a redacted input file.
input.zip

Steps to Reproduce

Train a potential using deepmd-kit v2.1.0/v2.0.3 and compare their performance in a LAMMPS simulation.

Further Information, Files, and Links

No response

@njzjz
Copy link
Member

njzjz commented Apr 5, 2022

related PR: #1406

@Dankomaister
Copy link
Author

Interesting, so its related to the gelu activation function?
Which version (tf.nn.gelu or op_module.gelu) is used in 2.0.3 vs 2.1.0?

@njzjz
Copy link
Member

njzjz commented Apr 5, 2022

Interesting, so its related to the gelu activation function?

Which version (tf.nn.gelu or op_module.gelu) is used in 2.0.3 vs 2.1.0?

tf.nn.gelu is used in v2.1.0. Maybe we should switched back and compare the performance.

@Dankomaister
Copy link
Author

Dankomaister commented Apr 6, 2022

If the performance regression I see using v2.1.0 is from the change in the gelu activation function, then I would say it affects the performance a lot. Also I noticed that on PR: #1406 there was a concern about running long MD simulations using op_module.gelu. I have done many very long MD simulations (over 200 million timesteps each) using v2.0.3 with the gelu activation function without any problems.

@denghuilu
Copy link
Member

The original algorithm of our op_module.gelu is indeed the same as the approximated tf.nn.gelu function:
op_module.gelu: https://github.com/leeleolay/deepmd-kit/blob/b4603e3cacc121eab6fa77ecf15bee8b20b72369/source/lib/src/cuda/gelu.cu#L4-L15
tf.nn.gelu:https://github.com/tensorflow/tensorflow/blob/cbeb0a4c6ddf02fddf6635b92f5e5dcb3a2a04be/tensorflow/python/ops/nn_ops.py#L3665-L3713

The major difference is that the TensorFlow's tf.nn.gelu function was implemented by the python API, which was inefficient in a CUDA environment.

Another reason of #1406 was to try to get closer to TensorFlow's framework, reducing some maintenance tasks of our own code. Maybe we should return to our implementation in such a large performance gap. The error of running long MD simulations using op_module.gelu is still unclear.

@njzjz
Copy link
Member

njzjz commented Apr 6, 2022

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

@denghuilu
Copy link
Member

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

Sounds good

@denghuilu
Copy link
Member

#1795

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

Sounds good

@njzjz njzjz linked a pull request Jun 28, 2022 that will close this issue
@njzjz njzjz closed this as completed Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants