# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information on their [website](https://dvc.org).

As a showcase we will implement a simple regression pipeline.

## Project Setup

### Some Preparations
Create a new directory 'workspace/dvc_intro', copy some files and change the cwd.

In [None]:
%%bash
rm -rf /workshop/workspace/dvc_intro
mkdir -p /workshop/workspace/dvc_intro
cp /workshop/notebooks/dvc/{dvc_exercise.py,deployment_location,dvc_introduction.py,params.yaml} /workshop/workspace/dvc_intro
cp -r /workshop/notebooks/dvc/data /workshop/workspace/dvc_intro

In [None]:
import os
os.chdir("/workshop/workspace/dvc_intro")

### Initialize Git

First initialize Git, as DVC works on top of it.

In [None]:
!git init

Optional: Set your git configuration.

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

### Initialize DVC

In [None]:
!dvc init -f

We need to add a data storage via the `dvc add remote` command.
We could add a new remote storage (could be S3, GCS, SSH, ...) or use a local storage.
For now, a local storage is sufficient.

In [1]:
!dvc remote add -d -f local_storage /tmp/dvc_introduction

zsh:1: command not found: dvc


Files can be added to our versioning system manually or implicitly in a pipeline.
For now, let's add them manually.
This command tells DVC to start tracking the added file.

In [None]:
!dvc add data/image.jpg
!dvc add data/text.txt

Let's check what has changed.

In [None]:
!git status

Two new files called `image.jpg.dvc` and `text.txt.dvc` were created. To track the changes of the added files, we will commit their dvc files to Git.

In [None]:
!git add .

In [None]:
!git commit -m "initial commit"

Let's check our current status compared to the status of the defined remote.
Attention: DVC does not have a sophisticated git-like `stage area`, but a cache-directory, that is being synced with the remote.

In [None]:
!dvc status -c

### Push the recent changes to the dvc remote storage.

In [None]:
!dvc push

### Optional: Simulate a data update.

Make changes in the `data/text.txt` file.
Add the changes to dvc.
Add and commit the changes via Git.

If you need help, have a look what you have done so far.

When you're ready, you can easily have a look at the history of your data files and switch between different versions.

In [None]:
!dvc log text.txt

## Building a DVC Pipeline

For the next excercise, you will build a simple dvc pipeline.

The first stept of the pipeline will be the `dowload` step.
The pipeline should execute the function `download_data` in the `dvc_introduction.py` file.
The data to be used is stored here: `http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv`.
The output should be stored here: `data/winequality-red.csv`.

The following command will create a configuration for the data pipeline containing the download stage. 

In [None]:
%%sh
dvc stage add -n download \
 -d dvc_introduction.py \
 -d http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv \
 -o data/winequality-red.csv \
python dvc_introduction.py download_data http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv data/winequality-red.csv

Create the command to add the next stage on your own.

The next step should be called `split`.
It will execute the `split_data` function of the `dvc_introduction.py` file.
The data to be used will be accessible here: `data/winequality-red.csv`.
The function will generate four outputs, which should be stored as followed: `data/x_train.csv`, `data_ytrain.csv`, `data/x_test.csv` and `data/y_test.csv`.

In [None]:
%%sh 
dvc stage add -n split \
-d dvc_introduction.py \
-d data/winequality-red.csv \
-o data/x_train.csv -o data/y_train.csv -o data/x_test.csv -o data/y_test.csv \
python dvc_introduction.py split_data data/winequality-red.csv

Create a third step for training. This step will use the two parameters defined in the `params.yaml`.
Hint: The `params.yaml` in the root folder will be searched for the named parameters by default.
You don't need to include its path in the configuration.

name: `train`
function: `train_model`
skript file: `dvc_introduction.py`
data: `data/x_train.csv` and `data/y_train.csv`
output: `data/model`
parameters (-p): `alpha`and `l1_ratio`

In [None]:
%%sh 
dvc stage add -n train \
-d dvc_introduction.py \
-d data/x_train.csv -d data/y_train.csv \
-o data/model \
-p alpha,l1_ratio \
python dvc_introduction.py train_model data/x_train.csv data/y_train.csv

Create a fourth step for evaluation. Here you will generate a metric file `data/result.json`.

name: `evaluate`
function: `evaluate_model`
skript file: `dvc_introduction.py`
data: `data/model`, `data/x_test.csv` and `data/y_test.csv`
metric (-m): `data/result.json`

In [None]:
%%sh
dvc stage add -n evaluate \
-d dvc_introduction.py \
-d data/model -d data/x_test.csv -d data/y_test.csv \
-m data/result.json \
python dvc_introduction.py evaluate_model data/model data/x_test.csv data/y_test.csv

Start the pipeline.

In [None]:
!dvc repro

With the execution of the pipeline, a new file called `dvc.lock` was created.
It stores information about the last run of the pipeline, including data and script file hashes.

Commit the file to your git.

In [None]:
!git add .
!git commit -m "Add pipeline"

Optional: Try re-executing the pipeline.
You will see, that DVC checks if the pipeline steps or the underlying data changed.
If you haven't changed anything, the pipeline step will not be executed again. 

### Additional: Inspecting and Modifying a Pipeline 

In [None]:
!dvc dag

In [None]:
!dvc status -c

In [None]:
!dvc push

In [None]:
!dvc status -c

Let's modify a file and reproduce our pipeline!

In [None]:
!dvc status

In [None]:
!dvc repro

### Additional: Compare Experiments

In [None]:
!sed -i -e "s/alpha:\s0.5/alpha: 0.6/g" params.yaml

In [None]:
!dvc params diff

In [None]:
!dvc repro

In [None]:
!dvc metrics show

In [None]:
!dvc metrics diff

It is also possible to compare results from different branches.

In [None]:
%%bash
git checkout -b experiment_1
git add .
git commit -m "changed parameter alpha"

dvc metrics diff master experiment_1

### Additional: More Features

Get a file from another (external) git+DVC repository.

In [None]:
!dvc get https://github.com/iterative/example-get-started model.pkl

In [None]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

In [None]:
!dvc import https://github.com/iterative/example-get-started model.pkl

In [None]:
!cat model.pkl.dvc

### Experiment Tracking

New in dvc2: Experiment tracking, based on git: https://dvc.org/doc/start/experiments

## Clean-up

In [None]:
import os
os.chdir("/workshop/notebooks/dvc")

In [None]:
%%sh
rm -rf /workshop/workspace/dvc_intro
rm -rf /tmp/dvc_introduction