# Lightweight Development Pipelines with DVC

In this notebook we will highlight important elements of DVC. You can find extensive information for dvc on their [website](https://dvc.org).

As a showcase we will implement a simple classification pipeline.

### Some Preparations

In [1]:
# --no-scm because we don't want to interfere with the workshops' git
!dvc init -f --no-scm

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

Optional: We add a new remote storage (could be S3, GCS, SSH, ...)

In [2]:
!dvc remote add -d -f local_storage /tmp/dvc_introduction

Setting 'local_storage' as a default remote.
[0m

Let's check our current status. Attention: DVC does not have a sophisticated git-like `stage area`, but a cache-directory, that is being synced with the remote.

In [3]:
!dvc status -c

Data and pipelines are up to date.                                              
[0m

That wasn't too surprising...

We can either add files to our DVC versioning by manually adding them or implicitly in a pipeline.

### Building a Pipeline

In [4]:
%%sh 
dvc run -f configure.dvc \
        -d dvc_introduction.py \
        -o output-introduction/config.pickle \
        python dvc_introduction.py configure output-introduction/config.pickle

Running command:
	python dvc_introduction.py configure output-introduction/config.pickle


2020-02-19 11:28:36.093266: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:36.093592: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:36.093648: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.


In [5]:
%%sh 
dvc run -f train.dvc \
        -d dvc_introduction.py \
        -d output-introduction/config.pickle \
        -d ../00-datasets/iris.data.csv \
        -o output-introduction/model \
        python dvc_introduction.py train_model ../00-datasets/iris.data.csv \
                                               output-introduction/config.pickle \
                                               output-introduction/model

Running command:
	python dvc_introduction.py train_model ../00-datasets/iris.data.csv output-introduction/config.pickle output-introduction/model
Train for 15 steps
Epoch 1/2
Epoch 2/2


2020-02-19 11:28:42.152258: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:42.152628: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:42.152683: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-02-19 11:28:43.952509: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-19 11:28:43.952577: E tensorflow/stream_executor/cuda

In [6]:
%%sh 
dvc run -f Dvcfile \
        -d dvc_introduction.py \
        -d output-introduction/model \
        -O ../04-models/iris/2 \
        python dvc_introduction.py export output-introduction/model ../04-models/iris/2

Running command:
	python dvc_introduction.py export output-introduction/model ../04-models/iris/2
Output '../04-models/iris/2' doesn't use cache. Skipping saving.


2020-02-19 11:28:49.098102: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:49.098264: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-19 11:28:49.098282: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-02-19 11:28:50.883051: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-19 11:28:50.883111: E tensorflow/stream_executor/cuda

### Inspecting and Modifying a Pipeline 

In [7]:
%%sh 
dvc pipeline show --ascii

+---------------+  
| configure.dvc |  
+---------------+  
        *          
        *          
        *          
  +-----------+    
  | train.dvc |    
  +-----------+    
        *          
        *          
        *          
   +---------+     
   | Dvcfile |     
   +---------+     



In [8]:
!dvc status -c

	new:                output-introduction/model                                  
	new:                output-introduction/model/saved_model.pb
	new:                output-introduction/model/variables/variables.data-00000-of-00001
	new:                output-introduction/model/variables/variables.index
	new:                output-introduction/config.pickle
[0m

In [9]:
!dvc push

  0% output-introduction/model|                 |0.00/258 [00:00<?,        ?B/s]
![A
  0%|          |output-introduction/model/variab0.00/210k [00:00<?,        ?B/s][A

![A[A

  0%|          |output-introduction/model/saved0.00/86.4k [00:00<?,        ?B/s][A[A


![A[A[A


  0%|          |output-introduction/model/varia0.00/1.68k [00:00<?,        ?B/s][A[A[A
[A                                                                             


[A[A[A                                                                       

                                                                                

[A[A
[A
[A

![A[A

  0%|          |output-introduction/config.pickle0.00/139 [00:00<?,        ?B/s][A[A

[A[A                                                                          
[A
[A[0m

In [10]:
!dvc repro

Data and pipelines are up to date.                                              
[0m

Let's modify a file and reproduce our pipeline!

#### New Features

Get a file from another (external) git+DVC repository.

In [13]:
!dvc import https://github.com/iterative/example-get-started model.pkl

Importing 'model.pkl (https://github.com/iterative/example-get-started)' -> 'model.pkl'
[0m                                                                            

In [11]:
!dvc get https://github.com/iterative/example-get-started model.pkl

[0m                                                                            

In [12]:
!rm model.pkl

Get a file *including* its .dvc file from another (external) git+DVC repository.

#### New Features in real life

#### Metrics and evaluation
Metrics can be used to track scores and evaluations over all branches

```bash
$ dvc metrics show --all-branches
experiment1:
    metrics.json: {"loss": 0.0012, "accuracy": 0.9765}
experiment2:
    metrics.json: {"loss": 0.0010, "accuracy": 0.9865}
working tree:
    metrics.json: {"loss": 0.0010, "accuracy": 0.9865}
```


#### Releasing and Deployment with git tags
Git tags can be used to keep track over releases:

```bash
$ git checkout master
$ git merge experiment2
$ git tag -a release/0.1 -m "0.1 release"
```

And use DVC get to download the release (e.g. using a deploy job)
```bash
$ GIT_REPO=...
$ dvc get --rev release/0.1 $GIT_REPO model.h5
```

Even metrics can be used to get an overview over the releases and their performance:

```bash
$ dvc metrics show -T
release/0.1:
    metrics.json: {"loss": 0.0112, "accuracy": 0.9865}
working tree:
    metrics.json: {"loss": 0.0112, "accuracy": 0.9865}
```

#### Debug only - pls ignore :-)

In [14]:
%%sh
rm -rf .dvc
rm -rf *.dvc
rm Dvcfile
rm -rf /tmp/dvc_introduction