# Save and Resume a Tensorflow MNIST ConvNet Model

This jupyter notebook, show you how to save and resume a Tensorflow Model. In this example we will use the Deep Learning hello-world!: the MNIST classification task.
Note: to run code cell you have to press **`Shift + Enter`**.

### Import Packages
First we need a single point with all the dependencies:

In [36]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
import shutil, os

tf.logging.set_verbosity(tf.logging.INFO)

### Hyper Parameters and Variables

Even for Hyper-Parameters and Variables is a good practice have a single point, this improve code readability and experiments interation.

In [37]:
# Where to save Checkpoint(In the /output folder)
resumepath ="/model/mnist_convnet_model"
filepath = "/output/mnist_convnet_model" 

# Hyper-parameters
batch_size = 128
num_classes = 10
num_epochs = 12
learning_rate = 1e-3

### Resuming from Previuos Run

If we have mounted a previuos run, copy the checkpoint to the `/output` folder so that the Model will continue from that and save everything in it.

In [38]:
# If exists an checkpoint model, move it into the /output folder
if os.path.exists(resumepath):
    shutil.copytree(resumepath, filepath)

### Data Processing and Transformation
Next, we process the dataset sample in tensor, ready to be feed into the model.

In [3]:
# Load training and eval data
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images  # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images  # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)

print (train_data.shape)
print (eval_data.shape)

Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
(55000, 784)
(10000, 784)


### Define the Model
A ConvNet Model, state of the art for image classification task.

In [4]:
def cnn_model_fn(features, labels, mode):
    """Model function for CNN."""
    # Input Layer
    # Reshape X to 4-D tensor: [batch_size, width, height, channels]
    # MNIST images are 28x28 pixels, and have one color channel
    input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])

    # Convolutional Layer #1
    # Computes 32 features using a 3x3 filter with ReLU activation.
    # Input Tensor Shape: [batch_size, 28, 28, 1]
    # Output Tensor Shape: [batch_size, 26, 26, 32]
    conv1 = tf.layers.conv2d(
      inputs=input_layer,
      filters=32,
      kernel_size=[3, 3],
      activation=tf.nn.relu)

    # Convolutional Layer #2
    # Computes 64 features using a 3x3 filter.
    # Input Tensor Shape: [batch_size, 26, 26 32]
    # Output Tensor Shape: [batch_size, 24, 24, 64]
    conv2 = tf.layers.conv2d(
      inputs=conv1,
      filters=64,
      kernel_size=[3, 3],
      activation=tf.nn.relu)

    # Pooling Layer
    # Max pooling layer with a 2x2 filter and stride of 2
    # Input Tensor Shape: [batch_size, 24, 24, 64]
    # Output Tensor Shape: [batch_size, 12, 12, 64]
    pool = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

    # Dropout # 1
    # Add dropout operation; 0.25 probability that element will be kept
    dropout = tf.layers.dropout(
      inputs=pool, rate=0.25, training=mode == tf.estimator.ModeKeys.TRAIN)

    # Flatten tensor into a batch of vectors
    # Input Tensor Shape: [batch_size, 12, 12, 64]
    # Output Tensor Shape: [batch_size, 12 * 12 * 64]
    flat = tf.reshape(dropout, [-1, 12 * 12 * 64])  # 9216

    
    # Dense Layer # 1
    # Densely connected layer with 128 neurons
    # Input Tensor Shape: [batch_size, 12 * 12 * 64] (batch_size, 9216)
    # Output Tensor Shape: [batch_size, 128]
    dense1 = tf.layers.dense(inputs=flat, units=128, activation=tf.nn.relu)
    
    # Dropout # 2
    # Add dropout operation; 0.5 probability that element will be kept
    dropout2 = tf.layers.dropout(
      inputs=dense1, rate=0.5, training=mode == tf.estimator.ModeKeys.TRAIN)

    # Logits layer
    # Input Tensor Shape: [batch_size, 128]
    # Output Tensor Shape: [batch_size, 10]
    logits = tf.layers.dense(inputs=dropout2, units=num_classes)

    predictions = {
        # Generate predictions (for PREDICT and EVAL mode)
        "classes": tf.argmax(input=logits, axis=1),
        # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
        # `logging_hook`.
        "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }
    # Inference (for TEST mode)
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)
    onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
    # Cross Entropy
    loss = tf.losses.softmax_cross_entropy(
      onehot_labels=onehot_labels, logits=logits)

    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        # AdamOptimizer
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=tf.train.get_global_step())
        eval_metric_ops = {
          "accuracy": tf.metrics.accuracy(
              labels=labels, predictions=predictions["classes"])}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op, eval_metric_ops=eval_metric_ops)

    # Add evaluation metrics (for EVAL mode)
    eval_metric_ops = {
      "accuracy": tf.metrics.accuracy(
          labels=labels, predictions=predictions["classes"])}
    return tf.estimator.EstimatorSpec(
      mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

### Checkpoint Strategy

The strategy we have adopted for the this example is the following:

- Keep only one checkpoints
- Trigger the strategy at the end of every epoch

In [25]:
# Checkpoint Strategy configuration
run_config = tf.contrib.learn.RunConfig(
    model_dir=filepath,
    keep_checkpoint_max=1)

In [26]:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
      model_fn=cnn_model_fn, config=run_config)

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f50ed5cf828>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 1, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/output/mnist_convnet_model'}


### Training
Let's train the model and see our checkpoint strategy in action.

In [34]:
# Keep track of the best accuracy
best_acc = 0

# Training for num_epochs
for i in range(num_epochs):
    print("Begin Training - Epoch {}/{}".format(i+1, num_epochs))
    # Train the model for 1 epoch
    train_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": train_data},
        y=train_labels,
        batch_size=batch_size,
        num_epochs=1,
        shuffle=True)

    mnist_classifier.train(
        input_fn=train_input_fn)

    # Evaluate the model and print results
    eval_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": eval_data},
        y=eval_labels,
        num_epochs=1,
        shuffle=False)
    
    eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
    
    accuracy = eval_results["accuracy"] * 100
    # Set the best acc if we have a new best or if it is the first step 
    if accuracy > best_acc or i == 0:
        best_acc = accuracy
        print ("=> New Best Accuracy {}".format(accuracy))
    else:
        print("=> Validation Accuracy did not improve")

Begin Training - Epoch 1/12
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /output/mnist_convnet_model/model.ckpt-22791
INFO:tensorflow:Saving checkpoints for 22792 into /output/mnist_convnet_model/model.ckpt.
INFO:tensorflow:loss = 0.000462755, step = 22792
INFO:tensorflow:global_step/sec: 54.9188
INFO:tensorflow:loss = 0.00165718, step = 22892 (1.822 sec)
INFO:tensorflow:global_step/sec: 55.3138
INFO:tensorflow:loss = 0.0959034, step = 22992 (1.808 sec)
INFO:tensorflow:global_step/sec: 55.2494
INFO:tensorflow:loss = 0.0139612, step = 23092 (1.810 sec)
INFO:tensorflow:global_step/sec: 55.3391
INFO:tensorflow:loss = 0.013573, step = 23192 (1.807 sec)
INFO:tensorflow:Saving checkpoints for 23221 into /output/mnist_convnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.000644423.
INFO:tensorflow:Starting evaluation at 2017-11-17-16:09:55
INFO:tensorflow:Restoring parameters from /output/mnist_convnet_model/model.ckpt-23221
INFO:tensorflow:

INFO:tensorflow:Saving dict for global step 25801: accuracy = 0.9941, global_step = 25801, loss = 0.0420431
=> Validation Accuracy did not improve
Begin Training - Epoch 8/12
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /output/mnist_convnet_model/model.ckpt-25801
INFO:tensorflow:Saving checkpoints for 25802 into /output/mnist_convnet_model/model.ckpt.
INFO:tensorflow:loss = 0.000456069, step = 25802
INFO:tensorflow:global_step/sec: 54.252
INFO:tensorflow:loss = 0.00102052, step = 25902 (1.845 sec)
INFO:tensorflow:global_step/sec: 54.7519
INFO:tensorflow:loss = 0.00514327, step = 26002 (1.826 sec)
INFO:tensorflow:global_step/sec: 54.7565
INFO:tensorflow:loss = 0.00298428, step = 26102 (1.826 sec)
INFO:tensorflow:global_step/sec: 54.5546
INFO:tensorflow:loss = 0.00353552, step = 26202 (1.833 sec)
INFO:tensorflow:Saving checkpoints for 26231 into /output/mnist_convnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 2.79366e-05.
INFO:tensorf

### Resume the checkpoint after the training
Let's take a look at the checkpoint just created. (you should see the `mnist_convnet_model` folder)

In [39]:
% ls

[0m[01;34mMNIST-data[0m/     command.sh                     [01;34mmnist_convnet_model[0m/
README.md       keras_mnist_cnn.py             pytorch_mnist_cnn.py
Untitled.ipynb  keras_mnist_cnn_jupyter.ipynb  pytorch_mnist_cnn_jupyter.ipynb


Jupyter Notebook run in the `/output folder`, so it's here. If you want to load it, you only need to restart the **Training** Cell Code, the Estimator will take care of everything.