## 1.新手入门

### 1.5. 保存和恢复模型

模型进度Model progress可在**训练期间**和**训练之后**保存。这意味着:
1. 可以从上次暂停处继续训练模型，避免训练时间过长。
2. 可以分享模型;

发布模型时，大部分会分享以下内容：
1. 用于创建模型的代码;
2. 模型的训练权重或参数;

**选项Option**

根据使用Tensorflow API的不同，有多种不同保存Tensorflow模型的方法。本教程使用的是[tf.keras]() API，它是一种用于在Tensorflow中构建和训练模型的高阶API。关于保存Tensorflow模型的其他方法，请参阅Tensorflow [Save and Restore]()或[Saving in eager]()(Eager也是一种api)

[tf.keras]: https://www.tensorflow.org/guide/keras?hl=zh-cn;
[Save and Restore]: https://www.tensorflow.org/guide/saved_model?hl=zh-cn;
[Saving in eager]: https://www.tensorflow.org/guide/eager?hl=zh-cn#object_based_saving;

**设置**

---

**安装和导入**

安装并导入Tensorflow和依赖项：

`!pip install -q h5py pyyaml`

**获取示例数据集**

我们将使用[MNIST dataset](http://yann.lecun.com/exdb/mnist/)数据集训练模型，以演示如何保存权重。要加快演示运行速度，请仅使用前1000个样本：


In [3]:
from __future__ import absolute_import, division, print_function

import os
import ssl

import tensorflow as tf
from tensorflow import keras

ssl._create_default_https_context = ssl._create_unverified_context  # 验证ssl证书，不然下载MNIST数据集时会报错

print(tf.__version__)

1.12.0


In [9]:
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

train_labels = train_labels[:1000]
test_labels = test_labels[:1000]

# print('train_images.reshape(-1, 28*28)', train_images[:1000].reshape(-1, 28*28))

train_images = train_images[:1000].reshape(-1, 28*28) / 255.0  # 数据预处理：由于模型没有Flatten层，所以需手动转换成784的1维数组，并max-min归一化
test_images = test_images[:1000].reshape(-1, 28*28) / 255.0  # 数据预处理：由于模型没有Flatten层，所以需手动转换成784的1维数组，并max-min归一化

**定义模型**

我们来构建一个简单的模型，以演示如何保存save和加载load权重。


In [12]:
# Returns a short sequential model
def create_model():
    model = keras.models.Sequential([
        keras.layers.Dense(512, activation=tf.nn.relu, input_shape=(784,)),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(10, activation=tf.nn.softmax)
    ])
    
    model.compile(optimizer=tf.keras.optimizers.Adam(),  # 注意与basic classification示例中的不同
                  loss=tf.keras.losses.sparse_categorical_crossentropy,  # 注意与basic classification示例中的不同
                  metrics=['accuracy'])
    
    return model


# Create a basic model instance
model = create_model()
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 512)               401920    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________


**在训练期间保存检查点**

---

`tf.keras.callbacks.ModelCheckpoint`是自动保存检查点的回调函数。该回调函数需要几个参数来配置检查点。使用`tf.keras.callbacks.ModelCheckpoint`：
+ 可以使用经过训练的模型，而无需重新训练该模型；
+ 可以从上次暂停处继续训练，避免训练过程中断。

**检查点回调函数用法**
训练模型，并将`ModelCheckpoint`回调传递给该模型：


In [13]:
checkpoint_path = 'training_1/cp.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create checkpoint callback
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
model = create_model()

model.fit(train_images, train_labels, epochs = 10,
          validation_data = (test_images, test_labels),
          callbacks = [cp_callback])  # pass callback to training

Train on 1000 samples, validate on 1000 samples
Epoch 1/10
Epoch 00001: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 2/10
Epoch 00002: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 3/10
Epoch 00003: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 4/10
Epoch 00004: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 5/10
Epoch 00005: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 6/10
Epoch 00006: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 7/10
Epoch 00007: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 8/10
Epoch 00008: saving model to training_1/cp.ckpt

Consider using a TensorFlow optimizer from `tf.train`.
Epoch 9/10
Epoch 00009: saving m

<tensorflow.python.keras.callbacks.History at 0x12cecf588>

上述代码将创建一个Tensorflow检查点文件集合，这些文件在每个周期结束时更新：

In [15]:
!ls {checkpoint_dir}

checkpoint                  cp.ckpt.index
cp.ckpt.data-00000-of-00001


创建一个未经训练的全新模型。仅通过权重恢复模型时，您必须有一个与原始模型架构相同的模型。由于模型架构相同，因此我们可以分享权重（尽管是不同的模型实例）。

现在，重新构建一个未经训练的全新模型，并用测试集对其进行评估。未训练模型的表现有很大的偶然性（准确率约为 10%）：

In [17]:
model = create_model()

loss, acc = model.evaluate(test_images, test_labels)
print("Untrained model, accuracy: {:5.2f}%".format(100*acc))


Untrained model, accuracy:  8.50%


然后从检查点加载权重，并重新评估：

In [18]:
model.load_weights(checkpoint_path)
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

Restored model, accuracy: 87.60%


**检查点回调函数参数**

检查点回调函数提供了很多参数：
注释：
1. tensorflow.keras.callbacks.ModelCheckpoint():
```
tensorflow.keras.callbacks.ModelCheckpoint = class ModelCheckpoint(Callback)
 |  Save the model after every epoch.
 |
 |  `filepath` can contain named formatting options,
 |  which will be filled the value of `epoch` and
 |  keys in `logs` (passed in `on_epoch_end`).
 |
 |  For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
 |  then the model checkpoints will be saved with the epoch number and
 |  the validation loss in the filename.
 |
 |  Arguments:
 |      filepath: string, path to save the model file.
 |      monitor: quantity to monitor.
 |      verbose: verbosity mode, 0 or 1.
 |      save_best_only: if `save_best_only=True`,
 |          the latest best model according to
 |          the quantity monitored will not be overwritten.
 |      mode: one of {auto, min, max}.
 |          If `save_best_only=True`, the decision
 |          to overwrite the current save file is made
 |          based on either the maximization or the
 |          minimization of the monitored quantity. For `val_acc`,
 |          this should be `max`, for `val_loss` this should
 |          be `min`, etc. In `auto` mode, the direction is
 |          automatically inferred from the name of the monitored quantity.
 |      save_weights_only: if True, then only the model's weights will be
 |          saved (`model.save_weights(filepath)`), else the full model
 |          is saved (`model.save(filepath)`).
 |      period: Interval (number of epochs) between checkpoints.
```

In [19]:
# include the epoch in the file name. (uses `str.format`) 见上面注释即可理解
checkpoint_path = 'training_2/cp-{epoch:04d}.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)
print(os.path.dirname(checkpoint_path))

cp_callback = tf.keras.callbacks.ModelCheckpoint(
    checkpoint_path, verbose=1, save_weights_only=True,
    # Save weights, every 5-epochs.
    period=5)

model = create_model()
model.fit(train_images, train_labels,
          epochs=50, callbacks=[cp_callback],
          validation_data=(test_images, test_labels),
          verbose=0)


training_2

Epoch 00005: saving model to training_2/cp-0005.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00010: saving model to training_2/cp-0010.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00015: saving model to training_2/cp-0015.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00020: saving model to training_2/cp-0020.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00025: saving model to training_2/cp-0025.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00030: saving model to training_2/cp-0030.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00035: saving model to training_2/cp-0035.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00040: saving model to training_2/cp-0040.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

Epoch 00045: saving model to training_2/cp-0045.ckpt

Consider using a TensorFlow optimizer from `tf.train`.

<tensorflow.python.keras.callbacks.History at 0x1282d5a90>

现在，看一下生成的检查点并选择最新检查点

In [27]:
# import pathlib

# # Sort the checkpoints by modification time.
# checkpoints = pathlib.Path(checkpoint_dir).glob("*.index")
# print(pathlib.Path(checkpoint_dir).glob("*.index"))
# checkpoints = sorted(checkpoints, key=lambda cp:cp.stat().st_mtime)
# checkpoints = [cp.with_suffix('') for cp in checkpoints]
# latest = str(checkpoints[-1])
# print(checkpoints)

!ls {checkpoint_dir}

print('*****' * 15)

latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

checkpoint                       cp-0030.ckpt.data-00000-of-00001
cp-0005.ckpt.data-00000-of-00001 cp-0030.ckpt.index
cp-0005.ckpt.index               cp-0035.ckpt.data-00000-of-00001
cp-0010.ckpt.data-00000-of-00001 cp-0035.ckpt.index
cp-0010.ckpt.index               cp-0040.ckpt.data-00000-of-00001
cp-0015.ckpt.data-00000-of-00001 cp-0040.ckpt.index
cp-0015.ckpt.index               cp-0045.ckpt.data-00000-of-00001
cp-0020.ckpt.data-00000-of-00001 cp-0045.ckpt.index
cp-0020.ckpt.index               cp-0050.ckpt.data-00000-of-00001
cp-0025.ckpt.data-00000-of-00001 cp-0050.ckpt.index
cp-0025.ckpt.index
***************************************************************************


'training_2/cp-0050.ckpt'

要进行测试，请重置模型并加载最新的检查点：

In [40]:
model = create_model()
model.load_weights(latest)
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

Restored model, accuracy: 87.60%


**这些是什么文件**

---

上述代码将权重存储在检查点格式的文件集合中，这些文件仅包含二进制格式的经过训练的权重。检查点包括： 
+ 包含模型权重的一个或多个分片。
+ 指示哪些权重存储在哪些分片中的索引文件。

如果您仅在一台机器上训练模型，则您将有1个后缀为 .data-00000-of-00001 的分片。

**手动保存权重**

---

使用`Model.save_weights`方法；
注释
1. tensorflow.keras.Model.save_weights：
```
tensorflow.keras.Model.save_weights = save_weights(self, filepath, overwrite=True, save_format=None)
    Saves all layer weights.

    Either saves in HDF5 or in TensorFlow format based on the `save_format`
    argument.

    When saving in HDF5 format, the weight file has:
      - `layer_names` (attribute), a list of strings
          (ordered names of model layers).
      - For every layer, a `group` named `layer.name`
          - For every such layer group, a group attribute `weight_names`,
              a list of strings
              (ordered names of weights tensor of the layer).
          - For every weight in the layer, a dataset
              storing the weight value, named after the weight tensor.

    When saving in TensorFlow format, all objects referenced by the network are
    saved in the same format as `tf.train.Checkpoint`, including any `Layer`
    instances or `Optimizer` instances assigned to object attributes. For
    networks constructed from inputs and outputs using `tf.keras.Model(inputs,
    outputs)`, `Layer` instances used by the network are tracked/saved
    automatically. For user-defined classes which inherit from `tf.keras.Model`,
    `Layer` instances must be assigned to object attributes, typically in the
    constructor. See the documentation of `tf.train.Checkpoint` and
    `tf.keras.Model` for details.

    Arguments:
        filepath: String, path to the file to save the weights to. When saving
            in TensorFlow format, this is the prefix used for checkpoint files
            (multiple files are generated). Note that the '.h5' suffix causes
            weights to be saved in HDF5 format.
        overwrite: Whether to silently overwrite any existing file at the
            target location, or provide the user with a manual prompt.
        save_format: Either 'tf' or 'h5'. A `filepath` ending in '.h5' or
            '.keras' will default to HDF5 if `save_format` is `None`. Otherwise
            `None` defaults to 'tf'.

    Raises:
        ImportError: If h5py is not available when attempting to save in HDF5
            format.
        ValueError: For invalid/unknown format arguments.
```
2. tensorflow.keras.Model.load_weights():
```
tensorflow.keras.Model.load_weights = load_weights(self, filepath, by_name=False)
    Loads all layer weights, either from a TensorFlow or an HDF5 weight file.

    If `by_name` is False weights are loaded based on the network's
    topology. This means the architecture should be the same as when the weights
    were saved.  Note that layers that don't have weights are not taken into
    account in the topological ordering, so adding or removing layers is fine as
    long as they don't have weights.

    If `by_name` is True, weights are loaded into layers only if they share the
    same name. This is useful for fine-tuning or transfer-learning models where
    some of the layers have changed.

    Only topological loading (`by_name=False`) is supported when loading weights
    from the TensorFlow format. Note that topological loading differs slightly
    between TensorFlow and HDF5 formats for user-defined classes inheriting from
    `tf.keras.Model`: HDF5 loads based on a flattened list of weights, while the
    TensorFlow format loads based on the object-local names of attributes to
    which layers are assigned in the `Model`'s constructor.

    Arguments:
        filepath: String, path to the weights file to load. For weight files in
            TensorFlow format, this is the file prefix (the same as was passed
            to `save_weights`).
        by_name: Boolean, whether to load weights by name or by topological
            order. Only topological loading is supported for weight files in
            TensorFlow format.

    Returns:
        When loading a weight file in TensorFlow format, returns the same status
        object as `tf.train.Checkpoint.restore`. When graph building, restore
        ops are run automatically as soon as the network is built (on first call
        for user-defined classes inheriting from `Model`, immediately if it is
        already built).

        When loading weights in HDF5 format, returns `None`.

    Raises:
        ImportError: If h5py is not available and the weight file is in HDF5
            format.
```

In [41]:
# Save the weights
model.save_weights('./checkpoints/my_checkpoint')

# Restore the weights
model = create_model()
model.load_weights('./checkpoints/my_checkpoint')

loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))



Consider using a TensorFlow optimizer from `tf.train`.
Restored model, accuracy: 87.60%


**保存整个模型**

---

整个模型可以保存到一个文件中，其中包含**权重值、模型配置乃至优化器配置**。这样，您就可以为模型设置检查点，并稍后从完全相同的状态继续训练，而无需访问原始代码。

在 Keras 中保存完全可正常使用的模型非常有用，您可以在[TensorFlow.js](https://js.tensorflow.org/tutorials/import-keras.html)中加载它们，然后在网络浏览器中训练和运行它们。

Keras 使用[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)标准提供基本的保存格式。对于我们来说，可将保存的模型视为一个二进制 blob。


In [48]:
model = create_model()

model.fit(train_images, train_labels, epochs=5)

# Save entire model to a HDF5 file
model.save('my_model.h5')

# Save entire model in a tf format not HDF5
model.save('my_model_test')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Now recreate the model from that file:

In [49]:
# Recreate the exact same model, including weights and optimizer.
new_model = keras.models.load_model('my_model.h5')
new_model.summary()

# check tf format model saved file.
!ls

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_42 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_21 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_43 (Dense)             (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________
1_tf_basic_classification.ipynb     my_model_test
5_tf_save_restore_keras_model.ipynb [1m[36mtraining_1[m[m
[1m[36mcheckpoints[m[m                         [1m[36mtraining_2[m[m
my_model.h5


In [47]:
# Check its accuracy:
loss, acc = new_model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))


Restored model, accuracy: 86.40%


此技巧可保存以下所有内容：

+ 权重值；
+ 模型配置（架构）；
+ 优化器配置；

Keras 通过检查架构来保存模型。目前，它无法保存 TensorFlow 优化器（来自 tf.train）。使用此类优化器时，您需要在加载模型后对其进行重新编译，使优化器的状态变松散。Keras saves models by inspecting the architecture. Currently, it is not able to save TensorFlow optimizers (from tf.train). When using those you will need to re-compile the model after loading, and you will loose the state of the optimizer.