-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable mixed precision support for deepmd-kit #1285
Conversation
@wanghan-iapcm an import error is caught in the latest dpdata |
Codecov Report
@@ Coverage Diff @@
## devel #1285 +/- ##
=======================================
Coverage 64.28% 64.28%
=======================================
Files 5 5
Lines 14 14
=======================================
Hits 9 9
Misses 5 5 Continue to review full report at Codecov.
|
deepmd/train/trainer.py
Outdated
@@ -345,6 +348,9 @@ def _build_training(self): | |||
optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer) | |||
else: | |||
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | |||
if DP_ENABLE_MIXED_PRECISION: | |||
# enable dynamic loss scale of the gradients | |||
optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has been moved to tf.mixed_precision.enable_mixed_precision_graph_rewrite
. https://www.tensorflow.org/api_docs/python/tf/compat/v1/mixed_precision/enable_mixed_precision_graph_rewrite What TF version do you use? Do you know if it is supported in all TF versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was found in Nvidia's official documentation. I have tested it with the TF-1.14.0 and TF-2.6.0 environment. Since it is a deprecated function, I will use the new tf.mixed_precision.enable_mixed_precision_graph_rewrite
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method was available since v1.12 (tensorflow/tensorflow@02730dc) and then was renamed in v2.4 (tensorflow/tensorflow@0112286). We may need to raise an error for TF<1.12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
pymatgen... could you please help fix it? thanks! |
|
There are some problems in the mixed precision training on the descriptors of se_r and se_t types, which are under investigation. |
deepmd/train/trainer.py
Outdated
@@ -345,6 +358,12 @@ def _build_training(self): | |||
optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer) | |||
else: | |||
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | |||
if self.mixed_prec is not None: | |||
# check the TF_VERSION, when TF < 1.12, mixed precision is not allowed | |||
if TF_VERSION < "1.12": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> "1.8"<"1.12"
False
Can you also support hybrid? |
As we said, there's still some errors when using the |
It will be useful to |
deepmd/utils/network.py
Outdated
trainable = True, | ||
trainable = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo for debug, I'll fix it
deepmd/utils/network.py
Outdated
@@ -44,6 +49,12 @@ def one_layer(inputs, | |||
b_initializer, | |||
trainable = trainable) | |||
variable_summaries(b, 'bias') | |||
|
|||
if mixed_prec is not None and outputs_size != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this idea.
For dipole and polar, the size of output layer is not 1, but they are using fp16, which is not what we want.
deepmd/utils/network.py
Outdated
if mixed_prec is not None and outputs_size != 1: | ||
idt = tf.cast(idt, get_precision(mixed_prec['compute_prec'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again outputs_size != 1
may not be a good idea.
if self.mixed_prec is not None: | ||
inputs = tf.cast(inputs, get_precision(self.mixed_prec['compute_prec'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line? the inputs
are anyway cast to compute_prec
in networks.one_layer or networks.embedding_net
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There's matrix multiplication outside the embedding net, we need to cast the inputs to match the dtype of the embedding net output.
- Half precision slicing will be more efficient.
deepmd/descriptor/descriptor.py
Outdated
def enable_mixed_precision(self, mixed_prec : dict = None) -> None: | ||
""" | ||
Reveive the mixed precision setting. | ||
|
||
Parameters | ||
---------- | ||
mixed_prec | ||
The mixed precision setting used in the embedding net | ||
|
||
Notes | ||
----- | ||
This method is called by others when the descriptor supported compression. | ||
""" | ||
raise NotImplementedError( | ||
"Descriptor %s doesn't support mixed precision training!" % type(self).__name__) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint errors appear here
deepmd/train/trainer.py
Outdated
@@ -345,6 +358,15 @@ def _build_training(self): | |||
optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer) | |||
else: | |||
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate) | |||
if self.mixed_prec is not None: | |||
TF_VERSION_LIST = [int(item) for item in TF_VERSION.split('.')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(item)
will cause an error if the version is a pre-release, e.g. v2.6.0-rc1
. See https://github.com/tensorflow/tensorflow/blob/ff68385595088304cf772086b9a259a65b007622/tensorflow/core/public/version.h#L35-L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to use a third-party class Specifiers
Argument("output_prec", str, optional=True, default="float32", doc=doc_output_prec), | ||
Argument("compute_prec", str, optional=False, default="float16", doc=doc_compute_prec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior is to enable mixed precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mixed_precision
session is optional within the training session(see line 617), so it's false by default. However, when one have set the mixed_precision session, one must provide the compute_prec
key.
This PR has enabled the mixed-precision training as well as the mixed precision inference process for deepmd-kit. Without any change of the input script, one can easily enable the mixed precision training by simply setting the environment variable
DP_ENABLE_MIXED_PREC
tofp16
.Main changes:
DP_ENABLE_MIXED_PREC
environmental variable for the control of mixed precision training. Note currently onlytf.float16
precision is enabled with the mixed precision setting.argcheck.py
according to the environment variableDP_INTERFACE_PREC
.According to our example water benchmark system, with
TF-2.6.0
,CUDA-11.0
andNVIDIA-V100 GPU
environment, the speed of the dp training process decreased slightly, while the inference process with 12288 atoms has gained a speedup by a factor of 3.It is strongly recommended to enable the mixed precision settings with CUDA-11.0 or above CUDA-toolkit.