This notebook shows how to use [dvc](https://dvc.org/) [experiments](https://github.com/iterative/dvc/wiki/Experiments) in model development. This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) data of handwritten digits and builds a classification model to predict the digit (0-9) in each image. The model is built in [pytorch](https://pytorch.org/) as convolutional neural network with a simplified architecture, which should be able to quickly run on most computers.

### Get started

To get started, clone this repository and navigate to it.

The only other prerequisite is [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/). Once conda is installed, create a virtual environment from the existing `environment.yaml` file and activate it:

```bash
conda env create -f environment.yml
conda activate dvc
```

If you want to run this notebook directly, do so after activating the conda environment.

Finally, enable the experiments feature:

In [1]:
%%bash
dvc config --global core.experiments true

### Establish the pipeline DAG

Before experimenting, a dvc pipeline must be established (see the docs if you are new to dvc). Review the contents of `dvc.yaml` below to see the pipeline.

In [2]:
%%bash
cat dvc.yaml

stages:
  download:
    cmd: python download.py
    deps:
    - download.py
    outs:
    - data/MNIST
  train:
    cmd: python train.py
    deps:
    - data/MNIST
    - train.py
    params:
    - lr
    - weight_decay
    outs:
    - model.pt:
        checkpoint: true
    metrics:
    - metrics.yaml


The download stage gets the data using the `download.py` script. The train stage performs model training and evaluation on the downloaded data using the `train.py` script. The train stage uses the lr and weight_decay metrics defined in `params.yaml`. The model output is saved to `model.pt`, and the metrics are saved to `metrics.yaml`.

Execute the download stage to get the data.

In [3]:
%%bash
dvc repro download

Running stage 'download' with command:
	python download.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.


0it [00:00, ?it/s]  0%|          | 0/9912422 [00:00<?, ?it/s]  3%|▎         | 303104/9912422 [00:00<00:03, 2967847.96it/s]  8%|▊         | 778240/9912422 [00:00<00:02, 3343138.53it/s] 14%|█▍        | 1400832/9912422 [00:00<00:02, 3873153.72it/s] 19%|█▉        | 1892352/9912422 [00:00<00:01, 4117377.48it/s] 24%|██▍       | 2367488/9912422 [00:00<00:01, 4276873.52it/s] 30%|██▉       | 2973696/9912422 [00:00<00:01, 4675121.28it/s] 36%|███▌      | 3530752/9912422 [00:00<00:01, 4876393.56it/s] 41%|████      | 4055040/9912422 [00:00<00:01, 4954997.67it/s] 46%|████▌     | 4554752/9912422 [00:01<00:01, 4887251.30it/s] 51%|█████     | 5046272/9912422 [00:01<00:01, 4781421.15it/s] 57%|█████▋    | 5652480/9912422 [00:01<00:00, 5092519.19it/s] 63%|██████▎   | 6225920/9912422 [00:01<00:00, 5256550.74it/s] 69%|██████▉   | 6815744/9912422 [00:01<00:00, 5251676.70it/s] 74%|███████▍  | 7348224/9912422 [00:01<00:00, 5004181.83it/s] 80%|███████▉  | 7905280/9912422 [00:01<00:00, 5160535.6

**IMPORTANT:** Be sure to run the `git add` command above and also `git commit` before running experiments. Anytime you modify the pipeline, be sure to `dvc repro` and track changes with git before running experiments.

In [4]:
%%bash
git add dvc.lock data/.gitignore
git commit -m "download data"

[queue 567b12c] download data
 1 file changed, 1 insertion(+), 1 deletion(-)


### Run an experiment

Run an experiment with the default parameters defined in `params.yaml`.

In [5]:
%%bash
dvc exp run

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '1c8cc5f'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '92c86fa'.
Reproduced experiment '92c86fa'.


Review the output of the run, including identifying hashes, metrics, and parameters:

In [6]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ queue       │ 06:18 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 92c86fa │ 06:18 PM │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ ├─╨ 1c8cc5f │ 06:18 PM │ 0.0958 │ 2.3017 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Note that two experiments were run. These are checkpoints for the run. It's not necessary to have checkpoints for experiments, but they can be helpful for models that may be run for a number of epochs. See below for more information about how checkpoints work.

### Experiment with different parameters

Experiments can be run and compared with different parameters.

In [7]:
%%bash
dvc exp run --params weight_decay=0.1

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'fe6c4ad'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '9806686'.
Reproduced experiment '9806686'.


In [8]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.0911 │ 2.3028 │ 0.001 │ 0.1          │
│ queue       │ 06:18 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 9806686 │ 06:19 PM │ 0.0911 │ 2.3028 │ 0.001 │ 0.1          │
│ ├─╨ fe6c4ad │ 06:18 PM │ 0.1009 │ 2.3035 │ 0.001 │ 0.1          │
│ │ ╓ 92c86fa │ 06:18 PM │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ ├─╨ 1c8cc5f │ 06:18 PM │ 0.0958 │ 2.3017 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Increasing `weight_decay` didn't help, so revert back to original parameters:

In [9]:
%%bash
git checkout params.yaml

Experiments can also be added in bulk to the queue and executed on demand (see the `-j` flag for parallel execution!).

In [10]:
%%bash
dvc exp run --params lr=0.01 --queue
dvc exp run --params lr=0.1 --queue

Queued experiment '4e935b0' for future execution.
Queued experiment '772d57a' for future execution.


In [11]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.0911 │ 2.3028 │ 0.001 │ 0            │
│ queue        │ 06:18 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 9806686  │ 06:19 PM │ 0.0911 │ 2.3028 │ 0.001 │ 0.1          │
│ ├─╨ fe6c4ad  │ 06:18 PM │ 0.1009 │ 2.3035 │ 0.001 │ 0.1          │
│ │ ╓ 92c86fa  │ 06:18 PM │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ ├─╨ 1c8cc5f  │ 06:18 PM │ 0.0958 │ 2.3017 │ 0.001 │ 0            │
│ ├── *772d57a │ 06:19 PM │      - │      - │ 0.1   │ 0            │
│ └── *4e935b0 │ 06:19 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [13]:
%%bash
dvc exp run --run-all

Stage '../../../../tmp/tmpwk9v4vcq/dvc.yaml:download' didn't change, skipping
Running stage '../../../../tmp/tmpwk9v4vcq/dvc.yaml:train' with command:
	python train.py


ERROR: Failed to reproduce experiment '772d57a' - Stage: '../../../../tmp/tmpwk9v4vcq/dvc.yaml:download'
ERROR: Error generating checkpoint, stage: '../../../../tmp/tmpwk9v4vcq/dvc.yaml:train' will be aborted - file path '/home/dave/Code/dvc-exp-mnist' is outside of DVC repo
ERROR: Failed to reproduce experiment '4e935b0' - [Errno 2] No such file or directory: '/tmp/tmpb4dtl678'


### Iteratively train using checkpoints

Use checkpoints to periodically save the model during training (as shown above), and to resume training from previously saved state. Resume training experiment with best accuracy.

In [15]:
%%bash
dvc exp show --sort-by acc

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.0911 │ 2.3028 │ 0.001 │ 0            │
│ queue        │ 06:18 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 9806686  │ 06:19 PM │ 0.0911 │ 2.3028 │ 0.001 │ 0.1          │
│ ├─╨ fe6c4ad  │ 06:18 PM │ 0.1009 │ 2.3035 │ 0.001 │ 0.1          │
│ │ ╓ 92c86fa  │ 06:18 PM │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ ├─╨ 1c8cc5f  │ 06:18 PM │ 0.0958 │ 2.3017 │ 0.001 │ 0            │
│ ├── *772d57a │ 06:19 PM │      - │      - │ 0.1   │ 0            │
│ └── *4e935b0 │ 06:19 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [17]:
%%bash
dvc exp res -r 92c86fa

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '076688e'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '2645236'.
Reproduced experiment '2645236'.


In [18]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.1135 │ 2.2774 │ 0.001 │ 0            │
│ queue        │ 06:18 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 2645236  │ 06:27 PM │ 0.1135 │ 2.2774 │ 0.001 │ 0            │
│ │ ╟ 076688e  │ 06:26 PM │ 0.1135 │ 2.2871 │ 0.001 │ 0            │
│ │ ╟ 92c86fa  │ 06:18 PM │ 0.1178 │ 2.2949 │ 0.001 │ 0            │
│ ├─╨ 1c8cc5f  │ 06:18 PM │ 0.0958 │ 2.3017 │ 0.001 │ 0            │
│ │ ╓ 9806686  │ 06:19 PM │ 0.0911 │ 2.3028 │ 0.001 │ 0.1          │
│ ├─╨ fe6c4ad  │ 06:18 PM │ 0.1009 │ 2.3035 │ 0.001 │ 0.1          │
│ ├── *772d57a │ 06:19 PM │      - │      - │ 0.1   │ 0            │
│ └── *4e935b0 │ 06:19 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


### Persist models

Additonal epochs didn't improve accuracy, so commit the model iteration with peak accuracy. Checkout the experiment rev in dvc and then commit to git.

In [20]:
%%bash
dvc exp checkout 92c86fa
cat metrics.yaml

acc: 0.1135
loss: 2.2773613929748535


ERROR: Experiment derived from '1c8cc5f', expected '567b12c'.
