This notebook shows how to use [dvc](https://dvc.org/) [experiments](https://github.com/iterative/dvc/wiki/Experiments) in model development. This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) data of handwritten digits and builds a classification model to predict the digit (0-9) in each image. The model is built in [pytorch](https://pytorch.org/) as a convolutional neural network with a simplified architecture, which should be able to quickly run on most computers.

### Get started

To get started, clone this repository and navigate to it.

The only other prerequisite is [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/). Once conda is installed, create a virtual environment from the existing `environment.yaml` file and activate it:

```bash
conda env create -f environment.yml
conda activate dvc
```

If you want to run this notebook directly, do so after activating the conda environment.

Finally, initialize dvc and enable the experiments feature:

In [1]:
%%bash
dvc init
dvc config --global core.experiments true


You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Establish the pipeline DAG

Before experimenting, a dvc pipeline must be established (see the docs if you are new to dvc). Review the contents of `dvc.yaml` below to see the pipeline.

In [2]:
%%bash
cat dvc.yaml

stages:
  download:
    cmd: python download.py
    deps:
    - download.py
    outs:
    - data/MNIST
  train:
    cmd: python train.py
    deps:
    - data/MNIST
    - train.py
    params:
    - lr
    - weight_decay
    outs:
    - model.pt:
        checkpoint: true
    metrics:
    - metrics.yaml


The download stage gets the data using the `download.py` script. The train stage performs model training and evaluation on the downloaded data using the `train.py` script. The train stage uses the lr and weight_decay metrics defined in `params.yaml`. The model output is saved to `model.pt`, and the metrics are saved to `metrics.yaml`.

Execute the download stage to get the data.

In [3]:
%%bash
dvc repro download

Running stage 'download' with command:
	python download.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock
Use `dvc push` to send your updates to re

0it [00:00, ?it/s]  0%|          | 0/9912422 [00:00<?, ?it/s]  2%|▏         | 204800/9912422 [00:00<00:04, 2017174.20it/s]  7%|▋         | 712704/9912422 [00:00<00:03, 2455194.68it/s] 12%|█▏        | 1196032/9912422 [00:00<00:03, 2870480.29it/s] 17%|█▋        | 1646592/9912422 [00:00<00:02, 3216814.61it/s] 21%|██▏       | 2121728/9912422 [00:00<00:02, 3552520.50it/s] 25%|██▌       | 2498560/9912422 [00:00<00:02, 3540783.13it/s] 29%|██▉       | 2883584/9912422 [00:00<00:01, 3618272.38it/s] 33%|███▎      | 3260416/9912422 [00:01<00:02, 3289215.12it/s] 38%|███▊      | 3751936/9912422 [00:01<00:01, 3644156.44it/s] 43%|████▎     | 4300800/9912422 [00:01<00:01, 4052673.62it/s] 49%|████▉     | 4890624/9912422 [00:01<00:01, 4461635.98it/s] 56%|█████▌    | 5545984/9912422 [00:01<00:00, 4931269.58it/s] 61%|██████▏   | 6086656/9912422 [00:01<00:00, 5034092.49it/s] 67%|██████▋   | 6619136/9912422 [00:01<00:00, 4965019.68it/s] 72%|███████▏  | 7135232/9912422 [00:01<00:00, 4765531.3

**IMPORTANT:** Be sure to run the `git add` command above and also `git commit` before running experiments. Anytime you modify the pipeline, be sure to `dvc repro` and track changes with git before running experiments.

In [4]:
%%bash
git add dvc.lock data/.gitignore
git commit -m "download data"

[dev f82a506] download data
 1 file changed, 1 insertion(+), 1 deletion(-)


### Run an experiment

Run an experiment with the default parameters defined in `params.yaml`.

In [5]:
%%bash
dvc exp run

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'b884c0c'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '56555fb'.
Reproduced experiment '56555fb'.


Review the output of the run, including identifying hashes, metrics, and parameters:

In [6]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ dev         │ 10:46 AM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 56555fb │ 10:47 AM │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ ├─╨ b884c0c │ 10:46 AM │ 0.2664 │ 2.2524 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Note that two experiments were run. These are checkpoints for the run. It's not necessary to have checkpoints for experiments, but they can be helpful for models that may be run for a number of epochs. See below for more information about how checkpoints work.

### Experiment with different parameters

Experiments can be run and compared with different parameters.

In [7]:
%%bash
dvc exp run --params weight_decay=0.1

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '5152381'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '31d5005'.
Reproduced experiment '31d5005'.


In [8]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment  ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace   │ -        │ 0.1585 │ 2.2906 │ 0.001 │ 0.1          │
│ dev         │ 10:46 AM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 31d5005 │ 10:47 AM │ 0.1585 │ 2.2906 │ 0.001 │ 0.1          │
│ ├─╨ 5152381 │ 10:47 AM │ 0.1316 │ 2.3009 │ 0.001 │ 0.1          │
│ │ ╓ 56555fb │ 10:47 AM │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ ├─╨ b884c0c │ 10:46 AM │ 0.2664 │ 2.2524 │ 0.001 │ 0            │
└─────────────┴──────────┴────────┴────────┴───────┴──────────────┘


Increasing `weight_decay` didn't help, so reset the parameters.

In [11]:
%%bash
git checkout params.yaml
cat params.yaml

lr: 0.001
weight_decay: 0


Next, try different `lr` parameters. Experiments can be added in bulk to the queue and executed on demand (see the `-j` flag for parallel execution!).

In [12]:
%%bash
dvc exp run --params lr=0.01 --queue
dvc exp run --params lr=0.1 --queue

Queued experiment '05c0223' for future execution.
Queued experiment 'ea532e1' for future execution.


In [13]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.1585 │ 2.2906 │ 0.001 │ 0            │
│ dev          │ 10:46 AM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 31d5005  │ 10:47 AM │ 0.1585 │ 2.2906 │ 0.001 │ 0.1          │
│ ├─╨ 5152381  │ 10:47 AM │ 0.1316 │ 2.3009 │ 0.001 │ 0.1          │
│ │ ╓ 56555fb  │ 10:47 AM │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ ├─╨ b884c0c  │ 10:46 AM │ 0.2664 │ 2.2524 │ 0.001 │ 0            │
│ ├── *ea532e1 │ 10:55 AM │      - │      - │ 0.1   │ 0            │
│ └── *05c0223 │ 10:55 AM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [14]:
%%bash
dvc exp run --run-all

Stage '../../../../tmp/tmpra373gup/dvc.yaml:download' didn't change, skipping
Running stage '../../../../tmp/tmpra373gup/dvc.yaml:train' with command:
	python train.py


ERROR: Failed to reproduce experiment 'ea532e1' - Stage: '../../../../tmp/tmpra373gup/dvc.yaml:download'
ERROR: Error generating checkpoint, stage: '../../../../tmp/tmpra373gup/dvc.yaml:train' will be aborted - file path '/home/dave/Code/dvc-exp-mnist' is outside of DVC repo
ERROR: Failed to reproduce experiment '05c0223' - [Errno 2] No such file or directory: '/tmp/tmpi7bvuzuz'


### Iteratively train using checkpoints

Use checkpoints to periodically save the model during training (as shown above), and to resume training from previously saved state. Resume training the experiment with the best accuracy.

In [15]:
%%bash
dvc exp show --sort-by acc

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.1585 │ 2.2906 │ 0.001 │ 0            │
│ dev          │ 10:46 AM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 31d5005  │ 10:47 AM │ 0.1585 │ 2.2906 │ 0.001 │ 0.1          │
│ ├─╨ 5152381  │ 10:47 AM │ 0.1316 │ 2.3009 │ 0.001 │ 0.1          │
│ │ ╓ 56555fb  │ 10:47 AM │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ ├─╨ b884c0c  │ 10:46 AM │ 0.2664 │ 2.2524 │ 0.001 │ 0            │
│ ├── *ea532e1 │ 10:55 AM │      - │      - │ 0.1   │ 0            │
│ └── *05c0223 │ 10:55 AM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


In [16]:
%%bash
dvc exp res -r 56555fb

Stage 'download' didn't change, skipping
Running stage 'train' with command:
	python train.py
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '150e65b'.
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '153c96b'.
Reproduced experiment '153c96b'.


In [17]:
%%bash
dvc exp show

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Experiment   ┃ Created  ┃    acc ┃   loss ┃ lr    ┃ weight_decay ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│ workspace    │ -        │ 0.6524 │ 1.9192 │ 0.001 │ 0            │
│ dev          │ 10:46 AM │      - │      - │ 0.001 │ 0            │
│ │ ╓ 153c96b  │ 10:57 AM │ 0.6524 │ 1.9192 │ 0.001 │ 0            │
│ │ ╟ 150e65b  │ 10:56 AM │ 0.6104 │ 2.0579 │ 0.001 │ 0            │
│ │ ╟ 56555fb  │ 10:47 AM │ 0.3842 │ 2.1689 │ 0.001 │ 0            │
│ ├─╨ b884c0c  │ 10:46 AM │ 0.2664 │ 2.2524 │ 0.001 │ 0            │
│ │ ╓ 31d5005  │ 10:47 AM │ 0.1585 │ 2.2906 │ 0.001 │ 0.1          │
│ ├─╨ 5152381  │ 10:47 AM │ 0.1316 │ 2.3009 │ 0.001 │ 0.1          │
│ ├── *ea532e1 │ 10:55 AM │      - │      - │ 0.1   │ 0            │
│ └── *05c0223 │ 10:55 AM │      - │      - │ 0.01  │ 0            │
└──────────────┴──────────┴────────┴────────┴───────┴──────────────┘


### Persist models

Commit the model iteration with peak accuracy. Checkout the experiment rev in dvc and then commit to git.

In [18]:
%%bash
dvc exp checkout 153c96b
cat metrics.yaml

Changes for experiment '153c96b' have been applied to your current workspace.
acc: 0.6524
loss: 1.9191707372665405
