# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information on their [website](https://dvc.org).

As a showcase we will implement a simple regression pipeline.

### Some Preparations
We create a new directory, copy some files and change the cwd.

In [None]:
%%bash
rm -rf /workshop/workspace/dvc_intro
mkdir /workshop/workspace/dvc_intro -p
cp /workshop/notebooks/dvc/{dvc_exercise.py,deployment_location,dvc_introduction.py,params.yaml} /workshop/workspace/dvc_intro
cp -r /workshop/notebooks/dvc/data /workshop/workspace/dvc_intro

In [None]:
import os
os.chdir("/workshop/workspace/dvc_intro")

### Initialize Git

DVC works on top of git..

In [None]:
!git init

You might want to set your git configuration.

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

### Initialize DVC

In [None]:
!dvc init -f

We can either add files to our versioning system by manually adding them or implicitly in a pipeline.

In [None]:
!dvc add data/image.jpg

Optional: We add a new remote storage (could be S3, GCS, SSH, ...)

In [None]:
!dvc remote add -d -f local_storage /tmp/dvc_introduction

In [None]:
!git status

In [None]:
!git add .

In [None]:
!git commit -m "initial commit"

Let's check our current status. Attention: DVC does not have a sophisticated git-like `stage area`, but a cache-directory, that is being synced with the remote.

In [None]:
!dvc status -c

In [None]:
!dvc push

### Building a Pipeline

In [None]:
%%sh
dvc stage add -n download \
 -d dvc_introduction.py \
 -d http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv \
 -o data/winequality-red.csv \
python dvc_introduction.py download_data http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv data/winequality-red.csv

In [None]:
%%sh 
dvc stage add -n split \
-d dvc_introduction.py \
-d data/winequality-red.csv \
-o data/x_train.csv -o data/y_train.csv -o data/x_test.csv -o data/y_test.csv \
python dvc_introduction.py split_data data/winequality-red.csv

In [None]:
%%sh 
dvc stage add -n train \
-d dvc_introduction.py \
-d data/x_train.csv -d data/y_train.csv \
-o data/model \
-p alpha,l1_ratio \
python dvc_introduction.py train_model data/x_train.csv data/y_train.csv

In [None]:
%%sh
dvc stage add -n evaluate \
-d dvc_introduction.py \
-d data/model -d data/x_test.csv -d data/y_test.csv \
-m data/result.json \
python dvc_introduction.py evaluate_model data/model data/x_test.csv data/y_test.csv

In [None]:
!dvc repro

In [None]:
!git add .
!git commit -m "Add pipeline"

### Inspecting and Modifying a Pipeline 

In [None]:
!dvc dag

In [None]:
!dvc status -c

In [None]:
!dvc push

In [None]:
!dvc status -c

Let's modify a file and reproduce our pipeline!

In [None]:
!dvc status

In [None]:
!dvc repro

### Compare Experiments

In [None]:
!sed -i -e "s/alpha:\s0.5/alpha: 0.6/g" params.yaml

In [None]:
!dvc params diff

In [None]:
!dvc repro

In [None]:
!dvc metrics show

In [None]:
!dvc metrics diff

It is also possible to compare results from different branches.

In [None]:
%%bash
git checkout -b experiment_1
git add .
git commit -m "changed parameter alpha"

dvc metrics diff master experiment_1

### More Features

Get a file from another (external) git+DVC repository.

In [None]:
!dvc get https://github.com/iterative/example-get-started model.pkl

In [None]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

In [None]:
!dvc import https://github.com/iterative/example-get-started model.pkl

In [None]:
!cat model.pkl.dvc

### Experiment Tracking

New in dvc2: Experiment tracking, based on git: https://dvc.org/doc/start/experiments

### Clean-up

In [None]:
import os
os.chdir("/workshop/notebooks/dvc")

In [None]:
%%sh
rm -rf /workshop/workspace/dvc_intro
rm -rf /tmp/dvc_introduction