# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information on
 their [website](https://dvc.org).

As a showcase we will implement a simple regression pipeline to predict the quality of red wine.

## 0 - Project Setup

We will do the exercise using a dedicated `dvc` folder in the `workspace` directory.
Therefore, we will copy all necessary files to the workspace and change our current working 
directory to the new directory.

In [None]:
%%bash
rm -rf /workshop/workspace/dvc
mkdir -p /workshop/workspace/dvc
cp /workshop/notebooks/dvc/{deployment_location,params.yaml} /workshop/workspace/dvc
cp -r /workshop/notebooks/dvc/data /workshop/workspace/dvc
cp -r /workshop/notebooks/dvc/pipeline_scripts /workshop/workspace/dvc

In [None]:
import os
os.chdir("/workshop/workspace/dvc")

## 1 - Initialize Git & DVC

### 1.1 - Initialize Git

First initialize Git, as DVC works on top of it.

In [None]:
!git init

**Optional:** Set your git configuration.

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

Add all copied files, except for the `data` folder, to your git repository, so we can 
see the dvc changes of the next steps. We don't want to add the `data` folder to git, because we 
want to track the data files inside via dvc. Files can not be tracked simultaniously by dvc and 
git.

In [None]:
!git status

In [None]:
!git add --all -- ':!data/'
!git commit -m 'initial commit'
!echo "----"
!git status

### 1.2 - Initialize DVC

Similar to initializing a git repository, we have to initialize a dvc repository first.
The `-f` flag makes sure you have a fresh dvc repo. It overwrites any existing dvc repo in the
given directory.

**Optional:** If you want to know what dvc is doing, you can add the `-v` flag to the init command.
This will run the command in `verbose` mode and shows what steps lay beneath it.

In [None]:
!dvc init -f
!echo "----"
!dvc status -c

In [None]:
!git status

dvc itself does not track any data or pipeline yet. But git recognized three new files, created and
staged by dvc. Similar to git, dvc stores some meta information inside the `.dvc` folder. 

*Git-tracked dvc Files*

- The `.dvc/.gitignore` file makes sure no unwanted files are added to the git repo.
- The `.dvc/config` file can store configs of the dvc project (global or local). E.g. if you add a remote location, it will be noted in this file.
- The `.dvcignore` file works similar to `.gitignore` files but for dvc file tracking.

In [None]:
!git commit -m 'initialized dvc'

## 2 - Add data to DVC remote data storage


### 2.1 - Configure a remote storage

We want to add a remote data storage, which we could use to share and back up copies of our data. 
This can be done via the `dvc add remote` command.
We could add a new remote storage (could be S3, GCS, SSH, ...) or use a local storage.
For now, a local storage is sufficient.

- `-d` makes sure this will be our default remote storage
- `-f` overwrites the existing remote storage
- `local_storage` is the name of our new remote storage
- `/tmp/dvc/` is the path to our new remote storage

In [None]:
!dvc remote add -d -f local_storage /tmp/dvc

Great! We now have set up our DVC project and remote storage. Let's track some files.

### 2.2 - Add data to DVC manually

Files can be added to our versioning system manually or implicitly in a pipeline.
We will implement a pipeline later. For now, add the first files manually.

In [None]:
!dvc add data/image.jpg
!dvc add data/text.txt

Let's check what has changed by adding all changed files to git, including the `data` folder.

In [None]:
!git add .
!git status

In [None]:
!cat data/image.jpg.dvc

With the `dvc add` command, DVC created hashes of your data files and adds them to its cache. The 
hash, as well as the size and path of the original file are stored in `.dvc` files. These are 
pointers to your data and its state. As they are lightweight, they can easily be tracked using git.
The `.dvc` files keep all the information needed if you want to access the dvc-tracked state of the
 underlying files, e.g. if you want to access them on a different device.

The original files are added to a `.gitignore` file automatically, as they should not be tracked by
 git.

To save the current state of the files, commit the `.dvc` files to Git.

In [None]:
!git commit -m "added sample data to dvc"

### 2.3 - Push the recent changes to the dvc remote storage.

Let's check the current status of our DVC tracked files compared to the status of the defined 
remote.


In [None]:
!dvc status -c

You can see, that the two new files are not stored on the remote storage, yet.

Performing `dvc push` will upload the tracked file including version information to your remote 
storage.

**Optional: Under the hood**

In contrast to Git, DVC does not have a stage ares or an option to explicitly commit changes. 
Instead, the local changes are registered by DVC via the `dvc add` command. The `.dvc` output files
 will then be directly compared to the remote state. So, if you want to update a file on your 
 remote, you change the file, add it to dvc, and push the change to the remote storage.

In [None]:
!dvc push

In [None]:
!dvc status -c

Now, local and remote storage should be in sync. Good work!

### 2.4 - Optional: Simulate a data update.

- Make changes in the `workspace/dvc/data/text.txt` file.
- Add the changes to dvc.
- Push the changes to the dvc remote storage.
- Add and commit the changes via Git.

If you need help, have a look what you have done so far.

## 3 - Building a DVC Pipeline

For the next excercise, you will build a simple dvc pipeline.

The first stept of the pipeline will be the `dowload` step:
- The pipeline should execute the function `download_data` in the `./pipeline_scripts/download_data.py` file.
- This is the download url for the data: `http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv`.
- The output should be stored here: `data/winequality-red.csv`.

The following `dvc stage` command will create a configuration for the data pipeline containing the
 `download` stage.

- `-n` is the name of the pipeline step
- `-d` defines a dependency of the step
- `-o` defines the path to the output file
- the last argument must be the command, which the pipeline should execute, e.g. a python command

Defining dependencies and outputs is essential for DVC to track if the input has changed. If so, 
DVC will rerun the step, when asked to. If nothing changed, it will skip the step and contiune 
with the next. This can make a huge difference in overall execution time.

In [None]:
%%sh
dvc stage add -n download \
 -d ./pipeline_scripts/download_data.py \
 -d http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv \
 -o ./data/winequality-red.csv \
python ./pipeline_scripts/download_data.py http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ./data/winequality-red.csv

Running `dvc stage add ...` will create a `dvc.yaml` file, which stores the definition of your pipeline. This way you can allways track your pipeline changes.

### 3.1. - Define the stages

#### Define the split stage

Create the command to add the next stage on your own.

- name: `split`
- function: `split_data`
- script file: `pipeline_scripts/split_data.py`
- input data: `data/winequality-red.csv`
- output data: `data/x_train.csv`, `data/y_train.csv`, `data/x_test.csv` and `data/y_test.csv`

In [None]:
%%sh 
dvc stage add -n split \
-d pipeline_scripts/split_data.py \
-d data/winequality-red.csv \
-o data/x_train.csv -o data/y_train.csv -o data/x_test.csv -o data/y_test.csv \
python pipeline_scripts/split_data.py data/winequality-red.csv

#### Define the train stage

Create a third step for training.

- This step should be named `train` and uses the `train_model` function in the `./pipeline_scripts/train_model` file. 
- It will track the two parameters.

Hint: The parameters are input parameters for the training. DVC will search for its values in the
 `params.yaml` in the root folder by default.
So, if you want to change them or add a parameter, you need to change the file. You don't need to 
include the file path in the configuration.

- name: `train`
- skript file: `train_model.py`
- input data: `data/x_train.csv` and `data/y_train.csv`
- output: `data/model`
- parameters (-p): `alpha` and `l1_ratio`

In [None]:
%%sh 
dvc stage add -n train \
-d ./pipeline_scripts/train_model.py \
-d data/x_train.csv -d data/y_train.csv \
-o data/model \
-p alpha,l1_ratio \
python ./pipeline_scripts/train_model.py data/x_train.csv data/y_train.csv

#### Define the evaluation stage

Create a fourth step for evaluation.

- name: `evaluate`
- function: `evaluate_model`
- skript file: `./pipeline_scripts/evaluate_model.py`
- input data: `data/model`, `data/x_test.csv` and `data/y_test.csv`
- metric (-m): `data/result.json`

In [None]:
%%sh
dvc stage add -n evaluate \
-d ./pipeline_scripts/evaluate_model.py \
-d data/model -d data/x_test.csv -d data/y_test.csv \
-m data/result.json \
python ./pipeline_scripts/evaluate_model.py data/model data/x_test.csv data/y_test.csv

### 3.2 - Start the Pipeline

Let's start the pipeline!

Hint: If something went wrong, you can redefine your stages using the `-f` flag.

In [None]:
!dvc repro

In [None]:
!dvc metrics show

Great, you successfully ran your first DVC Pipeline! Congrats!

### 3.3 - Track your pipeline and data in Git and DVC

In addition to the `dvc.yaml` file, a new file called `dvc.lock` was created when the pipeline was
 executed.
It stores information about the latest run of the pipeline, including data and script file hashes
 for versioning.

Commit both files to your git, so you don't lose your pipeline state.

In [None]:
!git add .
!git commit -m "Add pipeline"

In [None]:
!dvc status -c

In [None]:
!dvc push

**Optional:** Try re-executing the pipeline. You will see, that DVC checks if the pipeline steps or
 the underlying data changed.
If you haven't changed anything, the pipeline step will not be executed again. 

## 4 - Optional: Inspecting and Modifying a Pipeline

In this optional part of the exercise, you can have a look at what DVC also has to offer and how it behaves, if a pipeline is changed.

In [None]:
!dvc dag

Let's modify a file and reproduce our pipeline and see how DVC only executes the steps, from where
 the changes happened.

In [None]:
!dvc status

In [None]:
!dvc repro

### Additional: Compare Experiments

Change alpha parameter and see how dvc tracks the change.

In [None]:
!sed -i -e "s/alpha:\s0.5/alpha: 0.6/g" params.yaml

In [None]:
!dvc params diff

In [None]:
!dvc repro

Have a look at the main trainings metrics and compare the current state (`workspace`) to the state
 of `HEAD`.

In [None]:
!dvc metrics show

In [None]:
!dvc metrics diff

It is also possible to compare results from different branches.

In [None]:
%%bash
git checkout -b experiment_1
git add .
git commit -m "changed parameter alpha"

dvc metrics diff master experiment_1

## 5 - Optional: Download data from another DVC repository

Get a file from another (external) git+DVC repository.

In [None]:
!dvc get https://github.com/iterative/example-get-started model.pkl

In [None]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

In [None]:
!dvc import https://github.com/iterative/example-get-started model.pkl

In [None]:
!cat model.pkl.dvc

## 6 - Clean-up

In [None]:
import os
os.chdir("/workshop/notebooks/dvc")

In [None]:
%%sh
rm -rf /workshop/workspace/dvc
rm -rf /tmp/dvc