# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information on their [website](https://dvc.org).

As a showcase we will implement a simple regression pipeline to predict the quality of red wine.

## Project Setup

### Some Preparations
We will do the exercise using a dedicated 'dvc' folder in the 'workspace' directory.
Therefore, we will copy all necessary files to the workspace and change our current working directory to the new directory.

In [18]:
%%bash
rm -rf /workshop/workspace/dvc
mkdir -p /workshop/workspace/dvc
cp /workshop/notebooks/dvc/{deployment_location,params.yaml} /workshop/workspace/dvc
cp -r /workshop/notebooks/dvc/data /workshop/workspace/dvc
cp -r /workshop/notebooks/dvc/pipeline_scripts /workshop/workspace/dvc

In [19]:
import os
os.chdir("/workshop/workspace/dvc")

### Initialize Git

First initialize Git, as DVC works on top of it.

In [3]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /workshop/workspace/dvc/.git/


**Optional:** Set your git configuration.

In [4]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

Add all copied files to your git repository, so we can see the dvc
changes of the next steps.

In [11]:
!git status

On branch master
nothing to commit, working tree clean


In [6]:
!git add .

In [10]:
!git commit -m 'initial commit'

[master (root-commit) 1df6e86] initial commit
 10 files changed, 44017 insertions(+)
 create mode 100644 data/genres_v2.csv
 create mode 100644 data/image.jpg
 create mode 100644 data/text.txt
 create mode 100644 data/winequality-red.csv
 create mode 100644 deployment_location
 create mode 100644 params.yaml
 create mode 100644 pipeline_scripts/download_data.py
 create mode 100644 pipeline_scripts/evaluate_model.py
 create mode 100644 pipeline_scripts/split_data.py
 create mode 100644 pipeline_scripts/train_model.py


In [12]:
!git status

On branch master
nothing to commit, working tree clean


### Initialize DVC

Similar to initializing a git repository, we have to initialize a dvc repository first.
The '-f' flag makes sure you have a fresh dvc repo. It overwrites any existing dvc repo in the given directory.

**Optional:** If you want to know what dvc is doing, you can add the '-v' flag. This will run the command in 'verbose' mode and shows what steps lay beneath it.

In [13]:
!dvc init -f

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

In [14]:
!dvc status -c

There are no data or pipelines tracked in this project yet.           core[39m>
See <[36mhttps://dvc.org/doc/start[39m> to get started!
[0m

In [15]:
!git status

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   .dvc/.gitignore[m
	[32mnew file:   .dvc/config[m
	[32mnew file:   .dvcignore[m



dvc itself does not track any data or pipeline yet. But git recognized three new files, created and staged by dvc. Similar to git, dvc stores some meta information inside the '.dvc' folder. 

*Git-tracked dvc Files*

- The '.dvc/.gitignore' file makes sure no unwanted files are added to the git repo.
- The '.dvc/config' file can store configs of the dvc project (global or local). E.g. if you add a remote location, it will be noted in this file.
- The '.dvcignore' file works similar to '.gitignore' files but for dvc file tracking.

**Add remote data storage**

We want to add a remote data storage, which we could use to share and back up copies of our data. This can be done via the `dvc add remote` command.
We could add a new remote storage (could be S3, GCS, SSH, ...) or use a local storage.
For now, a local storage is sufficient.

- '-d' makes sure this will be our default remote storage
- '-f' overwrites the existing remote storage
- 'local_storage' is the name of our new remote storage
- '/tmp/dvc/' is the path to our new remote storage

In [16]:
!dvc remote add -d -f local_storage /tmp/dvc

Setting 'local_storage' as a default remote.
[0m

Great! We now have set up our dvc project and remote storage. Let's track some files.

Files can be added to our versioning system manually or implicitly in a pipeline.
We will implement a pipeline later. For now, add the first files manually.

In [17]:
!dvc add data/image.jpg
!dvc add data/text.txt

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/image.jpg |0.00 [00:00,     ?file/[A
Adding...                                                                       [A
[31mERROR[39m:  output 'data/image.jpg' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'data/image.jpg'
            git commit -m "stop tracking data/image.jpg" 
[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/text.txt |0.00 [00:00,     ?file/s[A
Adding...                                                                       [A
[31mERROR[39m:  output 'data/text.txt' is already tracked 

In the logs you can see that dvc creates hashes of the files and adds them to its cache.

Let's check what has changed.

In [33]:
!cat .dvc/cache/files/md5/bf/3b8d17dc6b65d4270af15d5babb851

This is an example text file.


In [35]:
!ls .dvc/tmp/

btime  lock  rwlock  rwlock.lock


In [36]:
!git status

On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	[32mnew file:   .dvc/.gitignore[m
	[32mnew file:   .dvc/config[m
	[32mnew file:   .dvcignore[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/[m
	[31mdeployment_location[m
	[31mparams.yaml[m
	[31mpipeline_scripts/[m



The `dvc add` command created two new files called `image.jpg.dvc` and `text.txt.dvc` inside the `data` folder.

To track the changes of the added files, we will commit all dvc files to Git.

In [None]:
!git add .

In [None]:
!git commit -m "initial commit"

Let's check our current status compared to the status of the defined remote.
Attention: DVC does not have a sophisticated git-like `stage area`, but a cache-directory, that is being synced with the remote.

In [None]:
!dvc status -c

You can see, that the two new files are not stored on the remote storage, yet.

### Push the recent changes to the dvc remote storage.

In [None]:
!dvc push

In [None]:
!dvc status -c

Now, local and remote storage should be in sync.

### Optional: Simulate a data update.

Make changes in the `data/text.txt` file.
Add the changes to dvc.
Add and commit the changes via Git.

If you need help, have a look what you have done so far.

## Building a DVC Pipeline

For the next excercise, you will build a simple dvc pipeline.

The first stept of the pipeline will be the `dowload` step.
The pipeline should execute the function `download_data` in the `./pipeline_scripts/download_data.py` file.
The data to be used is stored here: `http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv`.
The output should be stored here: `data/winequality-red.csv`.

The following command will create a configuration for the data pipeline containing the download stage. 

In [None]:
%%sh
dvc stage add -n download \
 -d ./pipeline_scripts/download_data.py \
 -d http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv \
 -o ./data/winequality-red.csv \
python ./pipeline_scripts/download_data.py http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ./data/winequality-red.csv

Create the command to add the next stage on your own.

The next step should be called `split`.
It will execute the `split_data` function of the `./pipeline_scripts/split_data.py` file.
The data to be used will be accessible here: `data/winequality-red.csv`.
The function will generate four outputs, which should be stored as followed: `data/x_train.csv`,
 `data/y_train.csv`, `data/x_test.csv` and `data/y_test.csv`.

In [None]:
%%sh 
dvc stage add -f -n split \
-d pipeline_scripts/split_data.py \
-d data/winequality-red.csv \
-o data/x_train.csv -o data/y_train.csv -o data/x_test.csv -o data/y_test.csv \
python pipeline_scripts/split_data.py data/winequality-red.csv

Create a third step for training. This step should be named 'train' and uses the `train_model` 
function in the `./pipeline_scripts/train_model` file. 
It will use the two parameters defined in the `params.yaml`.
Hint: The `params.yaml` in the root folder will be searched for the named parameters by default.
You don't need to include its path in the configuration.

name: `train`
function: `train_model`
skript file: `dvc_introduction.py`
data: `data/x_train.csv` and `data/y_train.csv`
output: `data/model`
parameters (-p): `alpha` and `l1_ratio`

In [None]:
%%sh 
dvc stage add -n train \
-d ./pipeline_scripts/train_model.py \
-d data/x_train.csv -d data/y_train.csv \
-o data/model \
-p alpha,l1_ratio \
python ./pipeline_scripts/train_model.py data/x_train.csv data/y_train.csv

Create a fourth step for evaluation. Here you will generate a metric file `data/result.json`.
It will execute the `evaluate_model` function of the `./pipeline_scripts/evaluate_model.py` file.

name: `evaluate`
function: `evaluate_model`
skript file: `dvc_introduction.py`
data: `data/model`, `data/x_test.csv` and `data/y_test.csv`
metric (-m): `data/result.json`

In [None]:
%%sh
dvc stage add -n evaluate \
-d ./pipeline_scripts/evaluate_model.py \
-d data/model -d data/x_test.csv -d data/y_test.csv \
-m data/result.json \
python ./pipeline_scripts/evaluate_model.py data/model data/x_test.csv data/y_test.csv

Start the pipeline.

In [None]:
!dvc repro

With the execution of the pipeline, a new file called `dvc.lock` was created.
It stores information about the last run of the pipeline, including data and script file hashes.

Commit the file to your git.

In [None]:
!git add .
!git commit -m "Add pipeline"

Optional: Try re-executing the pipeline.
You will see, that DVC checks if the pipeline steps or the underlying data changed.
If you haven't changed anything, the pipeline step will not be executed again. 

### Additional: Inspecting and Modifying a Pipeline 

In [None]:
!dvc dag

In [None]:
!dvc status -c

In [None]:
!dvc push

In [None]:
!dvc status -c

Let's modify a file and reproduce our pipeline!

In [None]:
!dvc status

In [None]:
!dvc repro

### Additional: Compare Experiments

Change alpha parameter and see how dvc tracks the change.

In [None]:
!sed -i -e "s/alpha:\s0.5/alpha: 0.6/g" params.yaml

In [None]:
!dvc params diff

In [None]:
!dvc repro

Have a look at the main trainings metrics and compare the current state ('workspace') to the state of 'HEAD'.

In [None]:
!dvc metrics show

In [None]:
!dvc metrics diff

It is also possible to compare results from different branches.

In [None]:
%%bash
git checkout -b experiment_1
git add .
git commit -m "changed parameter alpha"

dvc metrics diff master experiment_1

### Additional: More Features

Get a file from another (external) git+DVC repository.

In [None]:
!dvc get https://github.com/iterative/example-get-started model.pkl

In [None]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

In [None]:
!dvc import https://github.com/iterative/example-get-started model.pkl

In [None]:
!cat model.pkl.dvc

### Experiment Tracking

New in dvc2: Experiment tracking, based on git: https://dvc.org/doc/start/experiments

## Clean-up

In [None]:
import os
os.chdir("/workshop/notebooks/dvc")

In [None]:
%%sh
rm -rf /workshop/workspace/dvc_intro
rm -rf /tmp/dvc_introduction