### Using DVC checkpoints to train an MNIST classifier.

This notebook introduces users to DVC checkpoints. This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) data of handwritten digits and builds a classification model to predict the digit (0-9) in each image.

### Model script


The model is built in [keras](https://keras.io/) as a convolutional neural network with an architecture that is simple enough to run quickly with few resources.

Let's look at the model training script:

In [1]:
%%bash
cat train.py

import os
import yaml
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.random import set_seed

from dvc.api import make_checkpoint

num_classes = 10
input_shape = (28, 28, 1)
epochs = 10
batch_size = 128


def get_data():
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    x_train = x_train.astype("float32") / 255
    x_test = x_test.astype("float32") / 255
    x_train = np.expand_dims(x_train, -1)
    x_test = np.expand_dims(x_test, -1)

    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_test = keras.utils.to_categorical(y_test, num_classes)

    return (x_train, y_train), (x_test, y_test)

def get_model():
    set_seed(0)
    model= keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(4, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(8, kernel_size=(3, 3), activation="relu"),
        layers.Max

The script does the following:
1. Loads or creates the model (want to be able to load existing model to iteratively train from warm start).
2. Loads MNIST data, including transformations and train/test splits.
3. For each epoch, trains, evaluates, and saves.

To enable DVC checkpoints, the `make_checkpoint()` function is called after each epoch.

### DVC Pipeline

The last step to set up checkpoints is a `dvc.yaml` file to describe the the dependencies and outputs of the model training stage. Review the contents of `dvc.yaml`:

In [2]:
%%bash
cat dvc.yaml

stages:
  # Arbitraty name for model training stage
  train:
    # Command to execute
    cmd: python train.py
    # Dependencies
    deps:
    - train.py
    # Outputs
    outs:
    - model.tf:
        # Required for checkpoints
        checkpoint: true
    # Metrics
    metrics:
    - metrics.yaml:
        # Track with git instead of dvc
        cache: false


The `dvc.yaml` includes a single stage arbitrarily named `train` that executes the command `python train.py`. The `train.py` script is its only dependency. The model output is saved in the `model.tf` directory and metrics are saved to `metrics.yaml`.

For users familiar with DVC, nothing about this stage is unusual except for `checkpoint: true`, which tells DVC to treat `model.tf` differently from a typical output, since `train.py` also loads the previously trained model from there.

### Train the model

Run an experiment to start training the model. Even with a simple architecture, the model may take a few minutes to complete since it will train for 10 epochs.

**NOTE:** Make sure the repo is up to date in git (commit all changes) before running experiments.

In [3]:
%%bash
dvc exp run

Running stage 'train':
> python train.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '39bcf5f'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'c8c9a9e'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '3c5df70'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0d13eab'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'b62bbf3'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'e956ef1'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0429ca1'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '87166e5'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'ddcfcd6'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'd1e41ba'.

To track the changes with git, run:

	git add dvc.lock metrics.yaml .gitignore dvc.yaml train.py

Reprod

2021-02-03 20:18:51.582807: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-03 20:18:51.582837: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-02-03 20:18:54.006770: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-03 20:18:54.007953: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-02-03 20:18:54.007980: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-02-03 20:18:54.008004: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running 

Review the output of the run, including identifying hashes, metrics, and parameters:

In [4]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃    acc ┃     loss ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │ 0.9753 │ 0.088837 │
│ keras         │ 08:03 PM │      - │          │
│ │ ╓ exp-b1860 │ 08:20 PM │ 0.9753 │ 0.088837 │
│ │ ╟ ddcfcd6   │ 08:20 PM │ 0.9748 │ 0.092141 │
│ │ ╟ 87166e5   │ 08:20 PM │ 0.9745 │  0.09462 │
│ │ ╟ 0429ca1   │ 08:20 PM │  0.974 │  0.10015 │
│ │ ╟ e956ef1   │ 08:19 PM │ 0.9713 │   0.1044 │
│ │ ╟ b62bbf3   │ 08:19 PM │ 0.9689 │  0.11313 │
│ │ ╟ 0d13eab   │ 08:19 PM │  0.967 │   0.1253 │
│ │ ╟ 3c5df70   │ 08:19 PM │ 0.9617 │  0.14378 │
│ │ ╟ c8c9a9e   │ 08:19 PM │  0.954 │  0.18039 │
│ ├─╨ 39bcf5f   │ 08:19 PM │ 0.9324 │  0.28014 │
└───────────────┴──────────┴────────┴──────────┘


`exp-504ba` was created in the `keras` branch, and a checkpoint was generated for each of the 10 epochs when `make_checkpoint()` was called.

The results from the final epoch are present in the workspace, which can be confirmed either manually or using `dvc metrics show`:

In [5]:
%%bash
cat metrics.yaml

acc: 0.9753000140190125
loss: 0.08883664757013321


In [6]:
%%bash
dvc metrics show

Path          acc     loss
metrics.yaml  0.9753  0.08884


### Adding checkpoints

Run the experiment again to continue training:

In [7]:
%%bash
dvc exp run

Existing checkpoint experiment 'exp-b1860' will be resumed
Running stage 'train':
> python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'faec3e2'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '6bec53c'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '360db5d'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'ac5d7d7'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0f0d857'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'b40a747'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '4e1ca13'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'a4a59d1'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '5c19076'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'c08f528'.

To track the changes with git, run:

	git add dvc.yaml dvc.lock metrics.yaml train.py

Reproduced experiment(s): exp-b1860
Experiment results have been applied to your w

2021-02-03 20:20:38.612842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-03 20:20:38.612877: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-02-03 20:20:41.080750: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-03 20:20:41.081931: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-02-03 20:20:41.081951: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-02-03 20:20:41.081978: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running 

In [8]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃    acc ┃     loss ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │ 0.9793 │ 0.073416 │
│ keras         │ 08:03 PM │      - │          │
│ │ ╓ exp-b1860 │ 08:22 PM │ 0.9793 │ 0.073416 │
│ │ ╟ 5c19076   │ 08:22 PM │ 0.9795 │ 0.073426 │
│ │ ╟ a4a59d1   │ 08:21 PM │ 0.9797 │ 0.073562 │
│ │ ╟ 4e1ca13   │ 08:21 PM │ 0.9786 │ 0.075069 │
│ │ ╟ b40a747   │ 08:21 PM │ 0.9776 │ 0.076026 │
│ │ ╟ 0f0d857   │ 08:21 PM │ 0.9773 │ 0.079366 │
│ │ ╟ ac5d7d7   │ 08:21 PM │ 0.9769 │ 0.078135 │
│ │ ╟ 360db5d   │ 08:21 PM │ 0.9772 │ 0.079842 │
│ │ ╟ 6bec53c   │ 08:21 PM │ 0.9776 │ 0.081476 │
│ │ ╟ faec3e2   │ 08:20 PM │ 0.9773 │ 0.082802 │
│ │ ╟ d1e41ba   │ 08:20 PM │ 0.9753 │ 0.088837 │
│ │ ╟ ddcfcd6   │ 08:20 PM │ 0.9748 │ 0.092141 │
│ │ ╟ 87166e5   │ 08:20 PM │ 0.9745 │  0.09462 │
│ │ ╟ 0429ca1   │ 08:20 PM │  0.974 │  0.10015 │
│ │ ╟ e956ef1   │ 08:19 PM │ 0.9713 │   0.1044 │
│ │ ╟ b62bbf3   │ 08