# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information on their [website](https://dvc.org).

As a showcase we will implement a simple regression pipeline.

### Some Preparations
We create a new directory, copy some files and change the cwd.

In [34]:
%%bash
rm -rf /workshop/workspace/dvc_intro
mkdir /workshop/workspace/dvc_intro -p
cp /workshop/notebooks/dvc/{dvc_exercise.py,deployment_location,dvc_introduction.py,params.yaml} /workshop/workspace/dvc_intro
cp -r /workshop/notebooks/dvc/data /workshop/workspace/dvc_intro

In [35]:
import os
os.chdir("/workshop/workspace/dvc_intro")

### Initialize Git

DVC works on top of git..

In [36]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /workshop/workspace/dvc_intro/.git/


You might want to set your git configuration.

In [37]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

### Initialize DVC

In [38]:
!dvc init -f

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

We can either add files to our versioning system by manually adding them or implicitly in a pipeline.

In [39]:
!dvc add data/image.jpg

Adding...                                                                       
![A
  0%|          |.GTevWHDSV3xZBGvn9LRPDX.tmp    0.00/4.18k [00:00<?,       ?it/s][A
100% Add|██████████████████████████████████████████████|1/1 [00:00,  2.99file/s][A

To track the changes with git, run:

	git add data/.gitignore data/image.jpg.dvc
[0m

Optional: We add a new remote storage (could be S3, GCS, SSH, ...)

In [40]:
!dvc remote add -d -f local_storage /tmp/dvc_introduction

Setting 'local_storage' as a default remote.
[0m

In [41]:
!git status

On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	[32mnew file:   .dvc/.gitignore[m
	[32mnew file:   .dvc/config[m
	[32mnew file:   .dvc/plots/confusion.json[m
	[32mnew file:   .dvc/plots/confusion_normalized.json[m
	[32mnew file:   .dvc/plots/default.json[m
	[32mnew file:   .dvc/plots/linear.json[m
	[32mnew file:   .dvc/plots/scatter.json[m
	[32mnew file:   .dvc/plots/smooth.json[m
	[32mnew file:   .dvcignore[m

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .dvc/config[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/[m
	[31mdeployment_location[m
	[31mdvc_exercise.py[m
	[31mdvc_introduction.py[m
	[31mparams.yaml[m



In [42]:
!git add .

In [43]:
!git commit -m "initial commit"

[master (root-commit) 305ac0f] initial commit
 15 files changed, 635 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/confusion_normalized.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/linear.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 .dvcignore
 create mode 100644 data/.gitignore
 create mode 100644 data/image.jpg.dvc
 create mode 100644 deployment_location
 create mode 100644 dvc_exercise.py
 create mode 100644 dvc_introduction.py
 create mode 100644 params.yaml


Let's check our current status. Attention: DVC does not have a sophisticated git-like `stage area`, but a cache-directory, that is being synced with the remote.

In [44]:
!dvc status -c

Cache and remote 'local_storage' are in sync.                                   
[0m

In [45]:
!dvc push

Everything is up to date.                                                       
[0m

### Building a Pipeline

In [46]:
%%sh
dvc stage add -n download \
 -d dvc_introduction.py \
 -d http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv \
 -o data/winequality-red.csv \
python dvc_introduction.py download_data http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv data/winequality-red.csv

Creating 'dvc.yaml'
Adding stage 'download' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore


In [47]:
%%sh 
dvc stage add -n split \
-d dvc_introduction.py \
-d data/winequality-red.csv \
-o data/x_train.csv -o data/y_train.csv -o data/x_test.csv -o data/y_test.csv \
python dvc_introduction.py split_data data/winequality-red.csv

Adding stage 'split' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore


In [48]:
%%sh 
dvc stage add -n train \
-d dvc_introduction.py \
-d data/x_train.csv -d data/y_train.csv \
-o data/model \
-p alpha,l1_ratio \
python dvc_introduction.py train_model data/x_train.csv data/y_train.csv

Adding stage 'train' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore


In [49]:
%%sh
dvc stage add -n evaluate \
-d dvc_introduction.py \
-d data/model -d data/x_test.csv -d data/y_test.csv \
-m data/result.json \
python dvc_introduction.py evaluate_model data/model data/x_test.csv data/y_test.csv

Adding stage 'evaluate' in 'dvc.yaml'

To track the changes with git, run:

	git add data/.gitignore dvc.yaml


In [50]:
!dvc repro

Running stage 'download':                                             core[39m>
> python dvc_introduction.py download_data http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv data/winequality-red.csv
Generating lock file 'dvc.lock'                                                 
Updating lock file 'dvc.lock'

Running stage 'split':
> python dvc_introduction.py split_data data/winequality-red.csv
Updating lock file 'dvc.lock'                                                   

Running stage 'train':
> python dvc_introduction.py train_model data/x_train.csv data/y_train.csv
Updating lock file 'dvc.lock'                                                   

Running stage 'evaluate':
> python dvc_introduction.py evaluate_model data/model data/x_test.csv data/y_test.csv
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote sto

In [51]:
!git add .
!git commit -m "Add pipeline"

[master 5e5c659] Add pipeline
 3 files changed, 125 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


### Inspecting and Modifying a Pipeline 

In [52]:
!dvc dag

        +----------+      
        | download |      
        +----------+      
              *           
              *           
              *           
          +-------+       
          | split |       
          +-------+       
         **        **     
       **            *    
      *               **  
+-------+               * 
| train |             **  
+-------+            *    
         **        **     
           **    **       
             *  *         
        +----------+      
        | evaluate |      
        +----------+      
+--------------------+ 
| data/image.jpg.dvc | 
+--------------------+ [0m

In [53]:
!dvc status -c

	new:                data/model                                                 
	new:                data/winequality-red.csv
	new:                data/x_test.csv
	new:                data/x_train.csv
	new:                data/y_test.csv
	new:                data/y_train.csv
	new:                data/model/MLmodel
	new:                data/model/conda.yaml
	new:                data/model/model.pkl
	new:                data/result.json
[0m

In [54]:
!dvc push

  0% Uploading|                                     |0/10 [00:00<?,     ?file/s]
![A
  0%|          |data/model/MLmodel                  0/287 [00:00<?,       ?it/s][A
                                                                                [A
![A
  0%|          |data/model/conda.yaml               0/153 [00:00<?,       ?it/s][A

![A[A

  0%|          |data/model/model.pkl           0.00/1.29k [00:00<?,       ?it/s][A[A
                                                                                [A

                                                                                [A[A
![A
  0%|          |data/model                          0/206 [00:00<?,       ?it/s][A
                                                                                [A
![A
  0%|          |data/x_test.csv                0.00/17.9k [00:00<?,       ?it/s][A

![A[A

  0%|          |data/x_train.csv               0.00/71.1k [00:00<?,       ?it/s][A[A


![A[A[A


  0%|         

In [55]:
!dvc status -c

Cache and remote 'local_storage' are in sync.                                   
[0m

Let's modify a file and reproduce our pipeline!

In [56]:
!dvc status

Data and pipelines are up to date.                                    core[39m>
[0m

In [57]:
!dvc repro

Stage 'download' didn't change, skipping                              core[39m>
Stage 'split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

### Compare Experiments

In [58]:
!sed -i -e "s/alpha:\s0.5/alpha: 0.6/g" params.yaml

In [64]:
!dvc params diff

Path         Param    Old    New                                      core[39m>
params.yaml  alpha    0.5    0.6
[0m

In [59]:
!dvc repro

Stage 'download' didn't change, skipping                              core[39m>
Stage 'split' didn't change, skipping
Running stage 'train':
> python dvc_introduction.py train_model data/x_train.csv data/y_train.csv
Updating lock file 'dvc.lock'                                                   

Running stage 'evaluate':
> python dvc_introduction.py evaluate_model data/model data/x_test.csv data/y_test.csv
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
Path              Metric      Old      New      Change                core[39m>
data/result.json  train.mae   0.63139  0.65773  0.02635
data/result.json  train.r2    0.13344  0.06825  -0.06519
data/result.json  train.rmse  0.75253  0.78032  0.02779
[0m

In [60]:
!dvc metrics show

Path              train.mae    train.r2    train.rmse                 core[39m>
data/result.json  0.65773      0.06825     0.78032
[0m

In [61]:
!dvc metrics diff

Path              Metric      Old      New      Change                core[39m>
data/result.json  train.mae   0.63139  0.65773  0.02635
data/result.json  train.r2    0.13344  0.06825  -0.06519
data/result.json  train.rmse  0.75253  0.78032  0.02779
[0m

It is also possible to compare results from different branches.

In [None]:
%%bash
git checkout -b experiment_1
git add .
git commit -m "changed parameter alpha"

dvc metrics diff master experiment_1

### More Features

Get a file from another (external) git+DVC repository.

In [None]:
!dvc get https://github.com/iterative/example-get-started model.pkl

In [None]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

In [None]:
!dvc import https://github.com/iterative/example-get-started model.pkl

In [None]:
!cat model.pkl.dvc

### Experiment Tracking

New in dvc2: Experiment tracking, based on git: https://dvc.org/doc/start/experiments

### Clean-up

In [None]:
import os
os.chdir("/workshop/notebooks/dvc")

In [None]:
%%sh
rm -rf /workshop/workspace/dvc_intro
rm -rf /tmp/dvc_introduction