## Checkpoints

This document examines how to save and restore TensorFlow models built with Estimators. TensorFlow provides two model formats:

> **checkpoints**, which is a format dependent on the code that created the model.

> **SavedModel**, which is a format independent of the code that created the model.

**This document focuses on checkpoints**. For details on SavedModel, see the Saving and Restoring chapter of the TensorFlow Programmer's Guide.

### Getting the iris training data and creating feature column

In [8]:
import tensorflow as tf
import pandas as pd

TRAIN_URL = "http://download.tensorflow.org/data/iris_training.csv"
# Create a local copy of the training set.
train_path = tf.keras.utils.get_file(fname=TRAIN_URL.split('/')[-1],
                                         origin=TRAIN_URL)
print("Train path :", train_path)

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth',
                    'PetalLength', 'PetalWidth', 'Species']
# Parse the local CSV file.
train = pd.read_csv(filepath_or_buffer=train_path,
                    names=CSV_COLUMN_NAMES,  # list of column names
                    header=0  # ignore the first row of the CSV file.
                    )

train_features, train_label = train, train.pop('Species')

my_feature_columns = []
for key in train_features.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

Train path : /home/sankaran/.keras/datasets/iris_training.csv


### Default checkpoint config

Estimators automatically write the following to disk:

> **checkpoints**, which are versions of the model created during training.

> **event files**, which contain information that TensorBoard uses to create visualizations.

To specify the top-level directory in which the Estimator stores its information, assign a value to the optional **model_dir** argument of any Estimator's constructor. If you don't specify model_dir in an Estimator's constructor, the Estimator writes checkpoint files to a temporary directory chosen by Python's **tempfile.mkdtemp** function. Eg,

```python
    print(classifier.model_dir)

    Out : '/tmp/tmpc626k_qy'
```

**By default**, the Estimator saves checkpoints in the model_dir according to the following schedule:

    Writes a checkpoint every 10 minutes (600 seconds).
    Writes a checkpoint when the train method starts (first iteration) and completes (final iteration).
    Retains only the 5 most recent checkpoints in the directory.

In [9]:
classifier = tf.estimator.DNNClassifier(
        feature_columns=my_feature_columns,
        # Two hidden layers of 10 nodes each.
        hidden_units=[10, 10],
        # The model must choose between 3 classes.
        n_classes=3,
        model_dir='models/iris')

import code.iris_data as iris_data
# Train the Model.
classifier.train(
    input_fn=lambda:iris_data.train_input_fn(train_features, train_label, 10),
    steps=1000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_id': 0, '_tf_random_seed': None, '_session_config': None, '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_master': '', '_model_dir': 'models/iris', '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_save_checkpoints_secs': 600, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fef66346cf8>, '_log_step_count_steps': 100, '_is_chief': True, '_evaluation_master': ''}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/iris/model.ckpt-1003
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1004 into models/iris/mode

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7fef66346be0>

In [10]:
import os
os.listdir("models/iris/")

['model.ckpt-2003.data-00000-of-00001',
 'events.out.tfevents.1520235075.sankaran-System-Product-Name',
 'model.ckpt-1001.meta',
 'events.out.tfevents.1520237587.sankaran-System-Product-Name',
 'model.ckpt-1003.index',
 'model.ckpt-2003.index',
 'events.out.tfevents.1520238141.sankaran-System-Product-Name',
 'model.ckpt-1002.data-00000-of-00001',
 'events.out.tfevents.1520237708.sankaran-System-Product-Name',
 'model.ckpt-1002.index',
 'events.out.tfevents.1520238164.sankaran-System-Product-Name',
 'eval',
 'events.out.tfevents.1520237617.sankaran-System-Product-Name',
 'model.ckpt-1002.meta',
 'model.ckpt-1001.index',
 'checkpoint',
 'model.ckpt-1004.index',
 'model.ckpt-2003.meta',
 'model.ckpt-1003.data-00000-of-00001',
 'graph.pbtxt',
 'model.ckpt-1004.meta',
 'model.ckpt-1004.data-00000-of-00001',
 'model.ckpt-1003.meta',
 'model.ckpt-1001.data-00000-of-00001']

### Altering the default checkpoint config

You may alter the default schedule by taking the following steps:

    1) Create a RunConfig object that defines the desired schedule.
    2) When instantiating the Estimator, pass that RunConfig object to the Estimator's config argument.
    
**tf.estimator.RunConfig**

__init__(

    model_dir=None,
    tf_random_seed=None,
    save_summary_steps=100,
    save_checkpoints_steps=_USE_DEFAULT,
    save_checkpoints_secs=_USE_DEFAULT,
    session_config=None,
    keep_checkpoint_max=5,
    keep_checkpoint_every_n_hours=10000,
    log_step_count_steps=100
)



In [11]:
my_checkpointing_config = tf.estimator.RunConfig(
    save_checkpoints_secs = 20*60,  # Save checkpoints every 20 minutes.
    keep_checkpoint_max = 10,       # Retain the 10 most recent checkpoints.
)

classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 10],
    n_classes=3,
    model_dir='models/iris',
    config=my_checkpointing_config)


INFO:tensorflow:Using config: {'_task_id': 0, '_tf_random_seed': None, '_session_config': None, '_keep_checkpoint_max': 10, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_master': '', '_model_dir': 'models/iris', '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_save_checkpoints_secs': 1200, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fef66346940>, '_log_step_count_steps': 100, '_is_chief': True, '_evaluation_master': ''}


### Restoring training

The estimator automatically loads the latest checkpoint from **model_dir**. Thus, if you re-run this code 10 times with 1000 steps each, the model would have actually been trained for 10*1000 steps. 

If **model_dir** is **not specified**, automatic restoration will not happen.

**IMPORTANT** : Whenever you change the network architecture (or restart training), you have to change (or clear) model_dir. Otherwise, you will encounter a mis-match

#### Eg, 

I changed the network-arch without changing the model dir, and got the following error. As you can see, the **estimator class** first calls **model_fn** before training, which restores the latest parameters

In [7]:
my_checkpointing_config = tf.estimator.RunConfig(
    save_checkpoints_secs = 20*60,  # Save checkpoints every 20 minutes.
    keep_checkpoint_max = 10,       # Retain the 10 most recent checkpoints.
)

classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 11],
    n_classes=3,
    model_dir='models/iris',
    config=my_checkpointing_config)

classifier.train(
    input_fn=lambda:iris_data.train_input_fn(train_features, train_label, 10),
    steps=1000)

INFO:tensorflow:Using config: {'_task_id': 0, '_tf_random_seed': None, '_session_config': None, '_keep_checkpoint_max': 10, '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_master': '', '_model_dir': 'models/iris', '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_task_type': 'worker', '_save_checkpoints_secs': 1200, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fef658ad7b8>, '_log_step_count_steps': 100, '_is_chief': True, '_evaluation_master': ''}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/iris/model.ckpt-1003


InvalidArgumentError: tensor_name = dnn/hiddenlayer_1/bias; shape in shape_and_slice spec [11] does not match the shape stored in checkpoint: [10]
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.4/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/kernelapp.py", line 478, in start
    self.io_loop.start()
  File "/usr/lib/python3/dist-packages/zmq/eventloop/ioloop.py", line 160, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python3.4/dist-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python3.4/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 433, in _handle_events
    self._handle_recv()
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 465, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.4/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 2856, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-b88bc9d75048>", line 15, in <module>
    steps=1000)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/estimator/estimator.py", line 899, in _train_model
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 384, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 795, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 518, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 986, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 675, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 437, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/monitored_session.py", line 214, in finalize
    self._saver.build()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 1302, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 1339, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 790, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 502, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 449, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 847, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1469, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 3259, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): tensor_name = dnn/hiddenlayer_1/bias; shape in shape_and_slice spec [11] does not match the shape stored in checkpoint: [10]
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
