Fix compress training bug within the dp train --init-frz-model interface #1233

denghuilu · 2021-10-22T17:29:56Z

The compress training code uses the tf.import_graph_def function to load the tf.Tensor and tf.Operation objects from the old graph def to the current default graph.

However, this could lead to a variable name conflict during the model freeze process. And that's the reason for the issue #1194 .According to the tensorflow doc :

This function provides a way to import a serialized TensorFlow GraphDef protocol buffer, and extract individual objects in the GraphDef as tf.Tensor and tf.Operation objects. Once extracted, these objects are placed into the current default Graph. See tf.Graph.as_graph_def for a way to create a GraphDef proto.

In this PR, the following changes are adopted to address the #1194 :

Set the frozen fitting net nodes with the trainable fitting net variables when using the compress training interface.
Put the patterns, including the EMBEDDING_NET_PATTERN, FITTING_NET_PATTERN as well as the TRANSFER_PATTERN, to the deepmd.env module.

Note that this PR does not use the prefix parameter of the tf.import_graph_def function to solve the #1194 , although it is easier to do so, it will change the node name permanently. Instead this PR will not affect the graph structures as well as the node names within the graph, which is very important for the model maintenance.

codecov-commenter · 2021-10-22T17:34:04Z

Codecov Report

Merging #1233 (992f168) into devel (6c41aa3) will decrease coverage by 0.03%.
The diff coverage is 78.37%.

@@            Coverage Diff             @@
##            devel    #1233      +/-   ##
==========================================
- Coverage   76.02%   75.99%   -0.04%     
==========================================
  Files          91       91              
  Lines        7367     7389      +22     
==========================================
+ Hits         5601     5615      +14     
- Misses       1766     1774       +8

Impacted Files	Coverage Δ
deepmd/entrypoints/freeze.py	`71.23% <61.11%> (-4.21%)`	⬇️
deepmd/utils/graph.py	`73.56% <92.30%> (-0.56%)`	⬇️
deepmd/entrypoints/transfer.py	`72.54% <100.00%> (ø)`
deepmd/env.py	`75.53% <100.00%> (+1.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c41aa3...992f168. Read the comment docs.

njzjz · 2021-10-23T07:45:13Z

deepmd/entrypoints/freeze.py

+            raw_graph_def,  # The graph_def is used to retrieve the nodes
+            [n + '_1' for n in old_graph_nodes],  # The output node names are used to select the usefull nodes
+        )
+    except Exception:


Is there any specific exception?

All fitting net variables are added the _1 suffix, we can check it by the tf.trainable_variables() function. I think this is the default node naming method of TensorFlow: When a specific variable name is not available in the graph(due to the usage of tf.import_graph_def), TF will automatically add a number suffix to that name. And each fitting_net node name are unique within the original graph(with a suffix matrix, bias or idt), so we are fine to do so.

I mean could you catch a specific exception (such as RuntimeError, etc) instead of general Exception?

Sure. It's the AssertionError.

wanghan-iapcm · 2021-10-25T00:19:49Z

deepmd/entrypoints/freeze.py

@@ -21,6 +21,36 @@

 log = logging.getLogger(__name__)

+def _transfer_graph_def(sess, old_graph_def, raw_graph_def):


_transfer_graph_def is not a good name for this function. It should specified which variables are transferred

…to compress-training fix pip CI problem

fix compress training bug within the dp train --init-frz-model interface

12ee741

denghuilu requested review from njzjz, amcadmus and wanghan-iapcm and removed request for njzjz and amcadmus October 22, 2021 17:30

denghuilu requested a review from njzjz October 23, 2021 04:11

njzjz reviewed Oct 23, 2021

View reviewed changes

address comments

b17b0df

wanghan-iapcm reviewed Oct 25, 2021

View reviewed changes

denghuilu added 2 commits October 25, 2021 09:07

Merge branch 'devel' of https://github.com/deepmodeling/deepmd-kit in…

a039a5f

…to compress-training fix pip CI problem

rename _transfer_graph_def function within freeze.py

992f168

wanghan-iapcm requested a review from njzjz October 25, 2021 02:38

wanghan-iapcm approved these changes Oct 25, 2021

View reviewed changes

njzjz approved these changes Oct 25, 2021

View reviewed changes

wanghan-iapcm merged commit 1a8fd73 into deepmodeling:devel Oct 26, 2021

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023

bypass the upstream pymatgen error for CI (deepmodeling#1233)

e68580c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compress training bug within the dp train --init-frz-model interface #1233

Fix compress training bug within the dp train --init-frz-model interface #1233

denghuilu commented Oct 22, 2021 •

edited

Loading

codecov-commenter commented Oct 22, 2021 •

edited

Loading

njzjz Oct 23, 2021

denghuilu Oct 24, 2021 •

edited

Loading

njzjz Oct 24, 2021 •

edited

Loading

denghuilu Oct 24, 2021

wanghan-iapcm Oct 25, 2021

		@@ -21,6 +21,36 @@

		log = logging.getLogger(__name__)

		def _transfer_graph_def(sess, old_graph_def, raw_graph_def):

Fix compress training bug within the dp train --init-frz-model interface #1233

Fix compress training bug within the dp train --init-frz-model interface #1233

Conversation

denghuilu commented Oct 22, 2021 • edited Loading

codecov-commenter commented Oct 22, 2021 • edited Loading

Codecov Report

njzjz Oct 23, 2021

Choose a reason for hiding this comment

denghuilu Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

njzjz Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

denghuilu Oct 24, 2021

Choose a reason for hiding this comment

wanghan-iapcm Oct 25, 2021

Choose a reason for hiding this comment

denghuilu commented Oct 22, 2021 •

edited

Loading

codecov-commenter commented Oct 22, 2021 •

edited

Loading

denghuilu Oct 24, 2021 •

edited

Loading

njzjz Oct 24, 2021 •

edited

Loading