Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compress training bug within the dp train --init-frz-model interface #1233

Merged
merged 4 commits into from
Oct 26, 2021

Conversation

denghuilu
Copy link
Member

@denghuilu denghuilu commented Oct 22, 2021

The compress training code uses the tf.import_graph_def function to load the tf.Tensor and tf.Operation objects from the old graph def to the current default graph.

However, this could lead to a variable name conflict during the model freeze process. And that's the reason for the issue #1194 .According to the tensorflow doc :

This function provides a way to import a serialized TensorFlow GraphDef protocol buffer, and extract individual objects in the GraphDef as tf.Tensor and tf.Operation objects. Once extracted, these objects are placed into the current default Graph. See tf.Graph.as_graph_def for a way to create a GraphDef proto.

In this PR, the following changes are adopted to address the #1194 :

  • Set the frozen fitting net nodes with the trainable fitting net variables when using the compress training interface.
  • Put the patterns, including the EMBEDDING_NET_PATTERN, FITTING_NET_PATTERN as well as the TRANSFER_PATTERN, to the deepmd.env module.

Note that this PR does not use the prefix parameter of the tf.import_graph_def function to solve the #1194 , although it is easier to do so, it will change the node name permanently. Instead this PR will not affect the graph structures as well as the node names within the graph, which is very important for the model maintenance.

@denghuilu denghuilu requested review from njzjz, amcadmus and wanghan-iapcm and removed request for njzjz and amcadmus October 22, 2021 17:30
@codecov-commenter
Copy link

codecov-commenter commented Oct 22, 2021

Codecov Report

Merging #1233 (992f168) into devel (6c41aa3) will decrease coverage by 0.03%.
The diff coverage is 78.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #1233      +/-   ##
==========================================
- Coverage   76.02%   75.99%   -0.04%     
==========================================
  Files          91       91              
  Lines        7367     7389      +22     
==========================================
+ Hits         5601     5615      +14     
- Misses       1766     1774       +8     
Impacted Files Coverage Δ
deepmd/entrypoints/freeze.py 71.23% <61.11%> (-4.21%) ⬇️
deepmd/utils/graph.py 73.56% <92.30%> (-0.56%) ⬇️
deepmd/entrypoints/transfer.py 72.54% <100.00%> (ø)
deepmd/env.py 75.53% <100.00%> (+1.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c41aa3...992f168. Read the comment docs.

@denghuilu denghuilu requested a review from njzjz October 23, 2021 04:11
raw_graph_def, # The graph_def is used to retrieve the nodes
[n + '_1' for n in old_graph_nodes], # The output node names are used to select the usefull nodes
)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific exception?

Copy link
Member Author

@denghuilu denghuilu Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fitting net variables are added the _1 suffix, we can check it by the tf.trainable_variables() function. I think this is the default node naming method of TensorFlow: When a specific variable name is not available in the graph(due to the usage of tf.import_graph_def), TF will automatically add a number suffix to that name. And each fitting_net node name are unique within the original graph(with a suffix matrix, bias or idt), so we are fine to do so.

Copy link
Member

@njzjz njzjz Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean could you catch a specific exception (such as RuntimeError, etc) instead of general Exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It's the AssertionError.

@@ -21,6 +21,36 @@

log = logging.getLogger(__name__)

def _transfer_graph_def(sess, old_graph_def, raw_graph_def):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_transfer_graph_def is not a good name for this function. It should specified which variables are transferred

@wanghan-iapcm wanghan-iapcm merged commit 1a8fd73 into deepmodeling:devel Oct 26, 2021
njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants