This notebook shows how to use [dvc](https://dvc.org/) [experiments](https://github.com/iterative/dvc/wiki/Experiments) in model development. This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) data of handwritten digits and builds a classification model to predict the digit (0-9) in each image. The model is built in [pytorch](https://pytorch.org/) as convolutional neural network with a simplified architecture, which should be able to quickly run on most computers.

### Getting started

To get started, clone this repository and navigate to it.

The only other prerequisite is [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/). Once conda is installed, create a virtual environment from the existing `environment.yaml` file and activate it:

```bash
conda env create -f environment.yml
conda activate dvc
```

If you want to run this notebook directly, do so after activating the conda environment.

Finally, intialize dvc and enable the experiments feature:

In [1]:
%%bash
dvc init
dvc config --global core.experiments true


You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Establishing the pipeline DAG

Before experimenting, a dvc pipeline must be established (see the docs if you are new to dvc). Review the contents of `dvc.yaml` below to see the pipeline.

In [2]:
%%bash
cat dvc.yaml

stages:
  download:
    cmd: python download.py
    deps:
    - download.py
    outs:
    - data/MNIST
  train:
    cmd: python train.py --model_path=model.pt --metrics_path=metrics.yaml
    deps:
    - data/MNIST
    - train.py
    params:
    - lr
    - weight_decay
    outs:
    - model.pt
    metrics:
    - metrics.yaml
  train_checkpoint:
    cmd: python train.py --model_path=model_checkpoint.pt --metrics_path=metrics_checkpoint.yaml --checkpoint=5
    deps:
    - data/MNIST
    - train.py
    params:
    - lr
    - weight_decay
    outs:
    - model_checkpoint.pt:
        checkpoint: true
    metrics:
    - metrics_checkpoint.yaml


The download stage gets the data using the `download.py` script. The train stage performs model training and evaluation on the downloaded data using the `train.py` script. The train stage uses the lr and weight_decay metrics defined in `params.yaml`. The model output is saved to `model.pt`, and the metrics are saved to `metrics.yaml`. The train_checkpoint stage is similar but saves output periodically.

Execute the pipeline to reproduce the train stage:

In [3]:
%%bash
dvc repro train

Running stage 'download' with command:
	python download.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

Running stage 'train' with command:
	python train.py --model_path=model.pt --metrics_path=metrics.yaml
Updating 

0it [00:00, ?it/s]  0%|          | 0/9912422 [00:00<?, ?it/s]  3%|▎         | 303104/9912422 [00:00<00:03, 2989559.83it/s]  9%|▉         | 909312/9912422 [00:00<00:02, 3518514.87it/s] 15%|█▍        | 1441792/9912422 [00:00<00:02, 3910836.44it/s] 20%|█▉        | 1957888/9912422 [00:00<00:01, 4212361.69it/s] 26%|██▌       | 2555904/9912422 [00:00<00:01, 4621392.01it/s] 32%|███▏      | 3137536/9912422 [00:00<00:01, 4917156.86it/s] 37%|███▋      | 3670016/9912422 [00:00<00:01, 5027441.63it/s] 42%|████▏     | 4177920/9912422 [00:01<00:01, 4126412.41it/s] 47%|████▋     | 4653056/9912422 [00:01<00:01, 4278473.35it/s] 53%|█████▎    | 5226496/9912422 [00:01<00:01, 4625782.62it/s] 58%|█████▊    | 5758976/9912422 [00:01<00:00, 4802865.26it/s] 64%|██████▎   | 6316032/9912422 [00:01<00:00, 4987157.22it/s] 71%|███████   | 7004160/9912422 [00:01<00:00, 5435057.33it/s] 76%|███████▋  | 7569408/9912422 [00:01<00:00, 5474155.64it/s] 83%|████████▎ | 8192000/9912422 [00:01<00:00, 5657605.6

Run the `git add` command above and also `git commit` before running experiments. Anytime you modify the pipeline, be sure to `dvc repro` and track changes with git before running experiments.

In [4]:
%%bash
git add dvc.lock
git commit -m "run train stage"

[1.10.2 c031ef3] run train stage
 1 file changed, 32 insertions(+)
 create mode 100644 dvc.lock


### Run an experiment

Run an experiment with the default parameters defined in `params.yaml`.

In [5]:
%%bash
dvc exp run train

Stage 'download' didn't change, skipping
Stage 'train' didn't change, skipping
Reproduced experiment 'c031ef3'.


Since the pipeline was already reproduced with these parameters, this experiment didn't actually execute. Review the output of the run, including identifying hashes, metrics, and parameters:

In [6]:
%%bash
dvc exp show

┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace  │ -        │ 0.1901 │ 2.1917 │ 0.001 │ 0            │
│ 1.10.2     │ 03:14 PM │ 0.1901 │ 2.1917 │ 0.001 │ 0            │
└────────────┴──────────┴────────┴────────┴───────┴──────────────┘


### Experimenting with different parameters

Experiments can be run and compared with different parameters.

In [7]:
%%bash
dvc exp run train --params weight_decay=0.1

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py --model_path=model.pt --metrics_path=metrics.yaml
Updating lock file 'dvc.lock'
Reproduced experiment 'd920437'.




In [8]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
│ 1.10.2      │ 03:14 PM │ 0.1901 │ 2.1917 │ 0.001 │ 0            │
│ └── d920437 │ 03:15 PM │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Increasing weight_decay didn't help, so revert back to original parameters:

In [9]:
%%bash
git checkout params.yaml

Experiments can also be added in bulk to the queue and executed on demand (see the -j flag for parallel execution!).

In [10]:
%%bash
dvc exp run train --params lr=0.01 --queue
dvc exp run train --params lr=0.1 --queue

Queued experiment 'b87e095' for future execution.
Queued experiment 'bf88987' for future execution.


In [11]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.0982 │ 2.2982 │ 0.001 │ 0            │
│ 1.10.2       │ 03:14 PM │ 0.1901 │ 2.1917 │ 0.001 │ 0            │
│ ├── d920437  │ 03:15 PM │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
│ ├── *bf88987 │ 03:15 PM │      - │      - │ 0.1   │ 0            │
│ └── *b87e095 │ 03:15 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [12]:
%%bash
dvc exp run train --run-all

Stage '../../../../tmp/tmp8rp8mrkt/dvc.yaml:download' didn't change, skipping
Running stage '../../../../tmp/tmp8rp8mrkt/dvc.yaml:train' with command:
	python train.py --model_path=model.pt --metrics_path=metrics.yaml


ERROR: Failed to reproduce experiment 'b87e095' - Stage: '../../../../tmp/tmp8rp8mrkt/dvc.yaml:train'
ERROR: Failed to reproduce experiment 'bf88987' - failed to reproduce '../../../../tmp/tmp8rp8mrkt/dvc.yaml': file path '/home/dave/Code/dvc-exp-mnist' is outside of DVC repo


### Persist models

Find the training experiment with the best accuracy and commit it.

In [14]:
%%bash
dvc exp show --sort-by acc

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.0982 │ 2.2982 │ 0.001 │ 0            │
│ 1.10.2       │ 03:14 PM │ 0.1901 │ 2.1917 │ 0.001 │ 0            │
│ ├── d920437  │ 03:15 PM │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
│ ├── *bf88987 │ 03:15 PM │      - │      - │ 0.1   │ 0            │
│ └── *b87e095 │ 03:15 PM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [16]:
%%bash
dvc exp checkout d920437

Changes for experiment 'd920437' have been applied to your current workspace.




In [17]:
%%bash
git add dvc.lock params.yaml
git commit -m "hyperparameter tuning"

[1.10.2 b432fac] hyperparemeter tuning
 2 files changed, 5 insertions(+), 5 deletions(-)


Other experiments are now hidden by default, but they can still be shown and retrieved as needed. See the documentation or help commands for more info.

In [18]:
%%bash
dvc exp show

┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace  │ -        │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
│ 1.10.2     │ 03:19 PM │ 0.0982 │ 2.2982 │ 0.001 │ 0.1          │
└────────────┴──────────┴────────┴────────┴───────┴──────────────┘


### Iteratively train using checkpoints

Use checkpoints to periodically save the model during training, and to resume training from a previously saved state.

In [19]:
%%bash
dvc exp run train_checkpoint

Stage 'download' didn't change, skipping
Running stage 'train_checkpoint' with command:
	python train.py --model_path=model_checkpoint.pt --metrics_path=metrics_checkpoint.yaml --checkpoint=5
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '6c7c3a3'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '28d417a'.
Reproduced experiment '28d417a'.


In [20]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr     ┃ weight_decay ┃       ┃     ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━┩
│ workspace   │ -        │ 0.0982 │ 2.2982 │ 0.1147 │ 2.3029       │ 0.001 │ 0.1 │
│ 1.10.2      │ 03:19 PM │ 0.0982 │ 2.2982 │ 0.001  │ 0.1          │       │     │
│ │ ╓ 28d417a │ 03:22 PM │ 0.0982 │ 2.2982 │ 0.1147 │ 2.3029       │ 0.001 │ 0.1 │
│ ├─╨ 6c7c3a3 │ 03:22 PM │ 0.0982 │ 2.2982 │ 0.1216 │ 2.3033       │ 0.001 │ 0.1 │
└─────────────┴──────────┴────────┴────────┴────────┴──────────────┴───────┴─────┘


Checkpoints are grouped together when showing the experiments. Run a couple more epochs to see if accuracy increases.

In [21]:
%%bash
dvc exp res train_checkpoint

Stage 'download' didn't change, skipping
Running stage 'train_checkpoint' with command:
	python train.py --model_path=model_checkpoint.pt --metrics_path=metrics_checkpoint.yaml --checkpoint=5
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0e72239'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '1b221c6'.
Reproduced experiment '1b221c6'.


In [22]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr     ┃ weight_decay ┃       ┃     ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━┩
│ workspace   │ -        │ 0.1135 │ 2.3024 │ 0.0982 │ 2.2982       │ 0.001 │ 0.1 │
│ 1.10.2      │ 03:19 PM │ 0.0982 │ 2.2982 │ 0.001  │ 0.1          │       │     │
│ │ ╓ 1b221c6 │ 03:23 PM │ 0.1135 │ 2.3024 │ 0.0982 │ 2.2982       │ 0.001 │ 0.1 │
│ │ ╟ 0e72239 │ 03:23 PM │ 0.0982 │ 2.2982 │ 0.1135 │ 2.3026       │ 0.001 │ 0.1 │
│ │ ╟ 28d417a │ 03:22 PM │ 0.1147 │ 2.3029 │ 0.0982 │ 2.2982       │ 0.001 │ 0.1 │
│ ├─╨ 6c7c3a3 │ 03:22 PM │ 0.0982 │ 2.2982 │ 0.1216 │ 2.3033       │ 0.001 │ 0.1 │
└─────────────┴──────────┴────────┴────────┴────────┴──────────────┴───────┴─────┘
