This notebook shows how to use [dvc](https://dvc.org/) [experiments](https://github.com/iterative/dvc/wiki/Experiments) in model development. This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) data of handwritten digits and builds a classification model to predict the digit (0-9) in each image. The model is built in [pytorch](https://pytorch.org/) as convolutional neural network with a simplified architecture, which should be able to quickly run on most computers.

### Getting started

To get started, clone this repository and navigate to it.

The only other prerequisite is [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/). Once conda is installed, create a virtual environment from the existing `environment.yaml` file and activate it:

```bash
conda env create -f environment.yml
conda activate dvc
```

If you want to run this notebook directly, do so after activating the conda environment.

### Establishing the pipeline DAG

Before experimenting, a dvc pipeline must be established (see the docs if you are new to dvc). Review the contents of `dvc.yaml` below to see the pipeline.

In [1]:
%%bash
cat dvc.yaml

stages:
  download:
    cmd: python download.py
    deps:
    - download.py
    outs:
    - data/MNIST
  train:
    cmd: python train.py
    deps:
    - data/MNIST
    - train.py
    params:
    - lr
    - weight_decay
    outs:
    - model.pt:
        checkpoint: true
    metrics:
    - metrics.yaml


The download stage gets the data using the `download.py` script. The train stage performs model training and evaluation on the downloaded data using the `train.py` script. The train stage uses the lr and weight_decay metrics defined in `params.yaml`. The model output is saved to `model.pt`, and the metrics are saved to `metrics.yaml`.

Execute the download stage to get the data.

In [2]:
%%bash
dvc repro download

Running stage 'download' with command:
	python download.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.


0it [00:00, ?it/s]  0%|          | 8192/9912422 [00:00<02:02, 80703.65it/s]  3%|▎         | 294912/9912422 [00:00<01:24, 113872.25it/s]  8%|▊         | 786432/9912422 [00:00<00:56, 161039.63it/s] 14%|█▎        | 1351680/9912422 [00:00<00:37, 227218.51it/s] 20%|█▉        | 1933312/9912422 [00:00<00:25, 319147.24it/s] 26%|██▌       | 2547712/9912422 [00:00<00:16, 445783.93it/s] 32%|███▏      | 3145728/9912422 [00:00<00:10, 615735.88it/s] 38%|███▊      | 3719168/9912422 [00:00<00:07, 840477.56it/s] 43%|████▎     | 4235264/9912422 [00:00<00:05, 1117781.30it/s] 48%|████▊     | 4792320/9912422 [00:01<00:03, 1470223.89it/s] 54%|█████▍    | 5390336/9912422 [00:01<00:02, 1897021.34it/s] 60%|█████▉    | 5947392/9912422 [00:01<00:01, 2356138.14it/s] 65%|██████▌   | 6488064/9912422 [00:01<00:01, 2801808.13it/s] 71%|███████   | 7061504/9912422 [00:01<00:00, 3301772.00it/s] 78%|███████▊  | 7700480/9912422 [00:01<00:00, 3849461.32it/s] 84%|████████▍ | 8331264/9912422 [00:01<00:00, 43

**IMPORTANT:** Be sure to run the `git add` command above and also `git commit` before running experiments. Anytime you modify the pipeline, be sure to `dvc repro` and track changes with git before running experiments.

In [3]:
%%bash
git add dvc.lock data/.gitignore
git commit -m "download data"

[queue ca7b766] download data
 1 file changed, 1 insertion(+), 1 deletion(-)


### Run an experiment

Run an experiment with the default parameters defined in `params.yaml`.

In [4]:
%%bash
dvc exp run

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'b5b98cd'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'e2ffa86'.
Reproduced experiment 'e2ffa86'.


Review the output of the run, including identifying hashes, metrics, and parameters:

In [5]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.4525 │ 2.1344 │ 0.001 │ 0            │
│ queue       │ 05:50 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ e2ffa86 │ 05:50 PM │ 0.4525 │ 2.1344 │ 0.001 │ 0            │
│ ├─╨ b5b98cd │ 05:50 PM │ 0.3122 │ 2.2351 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Note that two experiments were run. These are checkpoints for the run. It's not necessary to have checkpoints for experiments, but they can be helpful for models that may be run for a number of epochs. See below for more information about how checkpoints work.

### Experimenting with different parameters

Experiments can be run and compared with different parameters.

In [6]:
%%bash
dvc exp run --params weight_decay=0.1

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '19ebc1a'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0344357'.
Reproduced experiment '0344357'.


In [5]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │  0.098 │ 2.3037 │ 0.001 │ 0.1          │
│ exp         │ 04:05 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ dcae795 │ 04:06 PM │  0.098 │ 2.3037 │ 0.001 │ 0.1          │
│ ├─╨ bbe9fde │ 04:06 PM │  0.098 │ 2.3042 │ 0.001 │ 0.1          │
│ │ ╓ b8d3628 │ 04:05 PM │ 0.1998 │ 2.2452 │ 0.001 │ 0            │
│ ├─╨ 656cb11 │ 04:05 PM │ 0.1216 │ 2.2836 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Increasing `weight_decay` didn't help, so revert back to original parameters:

In [7]:
%%bash
git checkout params.yaml

Experiments can also be added in bulk to the queue and executed on demand (see the `-j` flag for parallel execution!).

In [8]:
%%bash
dvc exp run --params lr=0.01 --queue
dvc exp run --params lr=0.1 --queue

Queued experiment '408a8a1' for future execution.
Queued experiment '9d44f85' for future execution.


In [9]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.1734 │ 2.2749 │ 0.001 │ 0            │
│ queue        │ 05:50 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 0344357  │ 05:51 PM │ 0.1734 │ 2.2749 │ 0.001 │ 0.1          │
│ ├─╨ 19ebc1a  │ 05:51 PM │ 0.1458 │ 2.2951 │ 0.001 │ 0.1          │
│ │ ╓ e2ffa86  │ 05:50 PM │ 0.4525 │ 2.1344 │ 0.001 │ 0            │
│ ├─╨ b5b98cd  │ 05:50 PM │ 0.3122 │ 2.2351 │ 0.001 │ 0            │
│ ├── *9d44f85 │ 05:52 PM │      - │      - │ 0.1   │ 0            │
│ └── *408a8a1 │ 05:52 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [10]:
%%bash
dvc exp run --run-all

ERROR: Failed to reproduce experiment '9d44f85' - Stage: '../../../../tmp/tmpg52_n3qc/dvc.yaml:download'
--- Logging error ---
Traceback (most recent call last):
  File "/home/dave/.conda/envs/dvc/lib/python3.8/site-packages/dvc/logger.py", line 134, in emit
    msg = self.format(record)
  File "/home/dave/.conda/envs/dvc/lib/python3.8/logging/__init__.py", line 925, in format
    return fmt.format(record)
  File "/home/dave/.conda/envs/dvc/lib/python3.8/site-packages/dvc/logger.py", line 94, in format
    cause = ": ".join(_iter_causes(record.exc_info[1]))
  File "/home/dave/.conda/envs/dvc/lib/python3.8/site-packages/dvc/logger.py", line 155, in _iter_causes
    yield str(exc)
  File "/home/dave/.conda/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 166, in __repr__
    return f"Stage: '{self.addressing}'"
  File "/home/dave/.conda/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 635, in addressing
    if self.path and self.relpath == PIPELINE_FILE:

CalledProcessError: Command 'b'dvc exp run --run-all\n'' returned non-zero exit status 1.

### Checkpoints

Use checkpoints to periodically save the model during training (as shown above), and to resume training from previously saved state.

**NOTE:** Using `dvc exp checkout` does not cause checkpoints to resume from that experiment. Instead, the latest experiment seems to be used.

In [11]:
%%bash
dvc exp checkout e2ffa86

Changes for experiment 'e2ffa86' have been applied to your current workspace.


In [13]:
%%bash
dvc exp res

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '6f0e255'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'c365431'.
Reproduced experiment 'c365431'.


In [15]:
%%bash
dvc exp show --sort-by acc

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.2381 │ 2.2053 │ 0.001 │ 0.1          │
│ queue        │ 05:50 PM │      - │      - │ 0.001 │ 0            │
│ │ ╓ c365431  │ 05:53 PM │ 0.2381 │ 2.2053 │ 0.001 │ 0.1          │
│ │ ╟ 6f0e255  │ 05:53 PM │ 0.1737 │ 2.2443 │ 0.001 │ 0.1          │
│ │ ╟ 0344357  │ 05:51 PM │ 0.1734 │ 2.2749 │ 0.001 │ 0.1          │
│ ├─╨ 19ebc1a  │ 05:51 PM │ 0.1458 │ 2.2951 │ 0.001 │ 0.1          │
│ │ ╓ e2ffa86  │ 05:50 PM │ 0.4525 │ 2.1344 │ 0.001 │ 0            │
│ ├─╨ b5b98cd  │ 05:50 PM │ 0.3122 │ 2.2351 │ 0.001 │ 0            │
│ ├── *9d44f85 │ 05:52 PM │      - │      - │ 0.1   │ 0            │
│ └── *408a8a1 │ 05:52 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘
