[BUG] Performance regression in v2.1.0 #1621

Dankomaister · 2022-04-05T04:44:20Z

Bug summary

Hi,

I noticed that potentials trained with deepmd-kit v2.1.0 are much slower than those trained with version v2.0.3.
Using the same input, a potential trained with v2.1.0 can only achieve around 90 timesteps/s in my MD simulations (LAMMPS).
While for the same simulations, using a potential trained with deepmd-kit v2.0.3 I can achieve around 160 timesteps/s.
(note this is using the same version of LAMMPS, the one distributed with deepmd-kit v2.1.0)

I also noticed that the number of ops in the final graph is different.
trained with v2.1.0: 979 ops in the final graph.
trained with v2.0.3: 614 ops in the final graph.
again this is using the same input files for training.

deepmd-kit is installed using conda
conda create -n deepmd deepmd-kit=*=gpu libdeepmd==*gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org

/Daniel

DeePMD-kit Version

2.1.0

TensorFlow Version

2.7.0

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

Due to the sensitive nature of the project I can only include a redacted input file.
input.zip

Steps to Reproduce

Train a potential using deepmd-kit v2.1.0/v2.0.3 and compare their performance in a LAMMPS simulation.

Further Information, Files, and Links

No response

njzjz · 2022-04-05T07:05:45Z

related PR: #1406

Dankomaister · 2022-04-05T08:14:56Z

Interesting, so its related to the gelu activation function?
Which version (tf.nn.gelu or op_module.gelu) is used in 2.0.3 vs 2.1.0?

njzjz · 2022-04-05T15:47:52Z

Interesting, so its related to the gelu activation function?

Which version (tf.nn.gelu or op_module.gelu) is used in 2.0.3 vs 2.1.0?

tf.nn.gelu is used in v2.1.0. Maybe we should switched back and compare the performance.

Dankomaister · 2022-04-06T00:51:58Z

If the performance regression I see using v2.1.0 is from the change in the gelu activation function, then I would say it affects the performance a lot. Also I noticed that on PR: #1406 there was a concern about running long MD simulations using op_module.gelu. I have done many very long MD simulations (over 200 million timesteps each) using v2.0.3 with the gelu activation function without any problems.

denghuilu · 2022-04-06T13:41:04Z

The original algorithm of our op_module.gelu is indeed the same as the approximated tf.nn.gelu function:
op_module.gelu: https://github.com/leeleolay/deepmd-kit/blob/b4603e3cacc121eab6fa77ecf15bee8b20b72369/source/lib/src/cuda/gelu.cu#L4-L15
tf.nn.gelu:https://github.com/tensorflow/tensorflow/blob/cbeb0a4c6ddf02fddf6635b92f5e5dcb3a2a04be/tensorflow/python/ops/nn_ops.py#L3665-L3713

The major difference is that the TensorFlow's tf.nn.gelu function was implemented by the python API, which was inefficient in a CUDA environment.

Another reason of #1406 was to try to get closer to TensorFlow's framework, reducing some maintenance tasks of our own code. Maybe we should return to our implementation in such a large performance gap. The error of running long MD simulations using op_module.gelu is still unclear.

njzjz · 2022-04-06T18:39:33Z

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

denghuilu · 2022-04-11T15:28:10Z

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

Sounds good

denghuilu · 2022-06-28T15:21:22Z

#1795

@denghuilu we can provide both options, maybe called gelu and gelu_tf.

Sounds good

Dankomaister added the bug label Apr 5, 2022

njzjz assigned denghuilu Apr 5, 2022

denghuilu mentioned this issue Jun 28, 2022

support custom gelu implementation #1795

Merged

njzjz linked a pull request Jun 28, 2022 that will close this issue

support custom gelu implementation #1795

Merged

njzjz closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Performance regression in v2.1.0 #1621

[BUG] Performance regression in v2.1.0 #1621

Dankomaister commented Apr 5, 2022

njzjz commented Apr 5, 2022

Dankomaister commented Apr 5, 2022

njzjz commented Apr 5, 2022

Dankomaister commented Apr 6, 2022 •

edited

Loading

denghuilu commented Apr 6, 2022

njzjz commented Apr 6, 2022

denghuilu commented Apr 11, 2022

denghuilu commented Jun 28, 2022

[BUG] Performance regression in v2.1.0 #1621

[BUG] Performance regression in v2.1.0 #1621

Comments

Dankomaister commented Apr 5, 2022

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Apr 5, 2022

Dankomaister commented Apr 5, 2022

njzjz commented Apr 5, 2022

Dankomaister commented Apr 6, 2022 • edited Loading

denghuilu commented Apr 6, 2022

njzjz commented Apr 6, 2022

denghuilu commented Apr 11, 2022

denghuilu commented Jun 28, 2022

Dankomaister commented Apr 6, 2022 •

edited

Loading