Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue of Saving Checkpoint #167

Closed
Lihengwannafly opened this issue Apr 20, 2022 · 1 comment
Closed

Issue of Saving Checkpoint #167

Lihengwannafly opened this issue Apr 20, 2022 · 1 comment
Assignees

Comments

@Lihengwannafly
Copy link

Lihengwannafly commented Apr 20, 2022

First, we will train a baseline model, then we will restore the parameters of the baseline model, continue to train. When we restore parameters, our code is as follows.

    vars_to_warm_start = ['^((?!Adam)(?!pos_dense).)*$']
    variables = self.restore_variables()
    restorer = tf.compat.v1.train.Saver(var_list=variables, max_to_keep=1)
    restorer.restore(session, base_checkpoint_path)
    saver= tf.compat.v1.train.Saver(max_to_keep=1)

    def restore_variables(self):
        list_of_vars = None
        if 'vars_to_warm_start' in _Hyperparams:
            vars_to_warm_start = _Hyperparams['vars_to_warm_start']
            if isinstance(vars_to_warm_start, str) or vars_to_warm_start is None:
                # Both vars_to_warm_start = '.*' and vars_to_warm_start = None will match
                # everything (in TRAINABLE_VARIABLES) here.
                self.logger.info("Warm-starting variables only in GLOBAL_VARIABLES.")
                list_of_vars = ops.get_collection(
                    ops.GraphKeys.GLOBAL_VARIABLES, scope=vars_to_warm_start)
                self.logger.info('Loading base model variables: {}'.format(list_of_vars))
                saveable_objects = tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
                                                               scope=vars_to_warm_start)
                self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
                list_of_vars += saveable_objects
            elif isinstance(vars_to_warm_start, list):
                if all(isinstance(v, str) for v in vars_to_warm_start):
                    self.logger.info("Warm-starting partial variables in GLOBAL_VARIABLES.")
                    list_of_vars = []
                    saveable_objects = []
                    for v in vars_to_warm_start:
                        list_of_vars += ops.get_collection(
                            ops.GraphKeys.GLOBAL_VARIABLES, scope=v)
                        saveable_objects += tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
                                                                        scope=v)
                    self.logger.info('Loading base model variables: {}'.format(list_of_vars))
                    self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
                    list_of_vars += saveable_objects
        return list_of_vars

We enable GlobalStepEvict for imei feature at two stage.
If we enable GlobalStepEvict when restoring the baseline model, it will failed when saving checkpoint via saver. The core dump info is:

tensorflow::SaveV2::Compute (this=0x7f8fd20bdec0, context=<optimized out>) at 
tensorflow/core/kernels/save_restore_v2_ops.cc:177
tensor_name = "feature_processing/imei_embedding/embedding_weights/Adam"

It seems that there exists a problem when saving the Adam parameters.
If we only resotre tf.trainable_variables(), it saved checkpoint successfully. It failed when restore tf.global_variables() where including Adam parameters.

If we disable GlobalStepEvict when restoring the baseline model, it will run normally, but loss, AUC will be poor.

@candyzone
Copy link
Collaborator

@Lihengwannafly could you show your full information with gdb command ‘backtrace’ ?

marvin-Yu pushed a commit to changqi1/DeepRec that referenced this issue Apr 9, 2023
…5/fix_threadpool_lock

[TF 1.15] Port #46562 (fix threadpool bug) from master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants