# DATA AND PIPELINES USING DVC

Dror Atariah @ Sharecare Berlin

## Outline

* What are we trying to solve?
* Pillars of reproducibility
* Walk through the tutorial

# What's the problem again?

### How can we guarantee reproducibility?

![bla bla](./phone-in-toilet.png)

* Collect data
* Train a model
* Test it
* <span class="fragment">Now, do it again please...</span>

## Three Pillars of Reproducibility

* Code <span class="fragment">👍🏻</span>

* Environment <span class="fragment">👍🏻</span>

* Data... <span class="fragment">😳</span>

## Say hello to Data Version Control

https://dvc.org/

### Main use cases

* Version control models and data
* Track and reproduce experiments
* Share DS/ML work

![dvc overview](./dvc-overview.png)

## Tutorial

[Deep dive](https://dvc.org/doc/tutorials/deep)

## Setting up

Fetch the code:

```bash
mkdir classify
cd classify
git init
wget https://github.com/drorata/dvc-doc-tutorial/archive/v3.0.zip
unzip -j v3.0.zip -d code && rm -f v3.0.zip
git add code
echo code/__pycache__/ >> .gitignore && git add .gitignore
git commit -m "download code"
```

Start a virtual environment!

Initialize DVC

```bash
dvc init
ls -a .dvc
git status -s
cat .dvc/.gitignore
git commit -am "init DVC"
```

## Fetch data

```bash
mkdir data
wget -P data https://data.dvc.org/tutorial/nlp/100K/Posts.xml.zip
```

```bash
# Track data using DVC
dvc add data/Posts.xml.zip
```

What's in there?

```bash
cat data/Posts.xml.zip.dvc

ls -l .dvc/cache/
```

### Remarks

* Linking from cache to workspace ⚠️
* Import files from remote locations

## Let's do some processing

### Unzip the data

```bash
dvc run -d data/Posts.xml.zip -o data/Posts.xml \
    unzip data/Posts.xml.zip -d data/
```

**What just happened?**

```bash
cat Posts.xml.dvc
```

### More steps please

```bash
# Convert XML to TSV
dvc run -d data/Posts.xml -d code/xml_to_tsv.py -d code/conf.py \
    -o data/Posts.tsv \
    python code/xml_to_tsv.py
    
# Get the top 20k lines
dvc run -d data/Posts.tsv -d code/table_head.sh -o data/Posts_head.tsv \
    "bash code/table_head.sh data/Posts.tsv 20000 > data/Posts_head.tsv"
```

```bash
# Split to train-test sets
dvc run -d data/Posts_head.tsv -d code/split_train_test.py \
    -d code/conf.py \
    -o data/Posts-test.tsv -o data/Posts-train.tsv \
    python code/split_train_test.py 0.33 20180319
```

### Train, test...

```bash
dvc run -d code/featurization.py -d code/conf.py \
    -d data/Posts-train.tsv -d data/Posts-test.tsv \
    -o data/matrix-train.p -o data/matrix-test.p \
    python code/featurization.py
    
dvc run -d data/matrix-train.p -d code/train_model.py \
    -d code/conf.py -o data/model.p \
    python code/train_model.py 20180319
```

### and evaluate

```bash
dvc run -d data/model.p -d data/matrix-test.p \
    -d code/evaluate.py -d code/conf.py -m data/eval.txt \
    -f Dvcfile \
    python code/evaluate.py
```

**Remarks**
* `-f` and `Dvcfile`
* `-M` and metric:
    * `cat data/eval.txt` and
    * `dvc metrics show`

## To DAG or not to DAG?

```bash
dvc pipeline show --ascii
```

## Tweaking the pipeline

Try out: `dvc repro`

### Tweak features engineering

```python
# Start a new branch
# git switch -c bigrams

# Edit code/featurization.py @ L50
bag_of_words = CountVectorizer(stop_words='english',
                               max_features=6000,
                               ngram_range=(1, 2))
```

```
dvc repro
git add + commit
```

Commit changes to branch and:

```
dvc metrics show -a
```

Let's try something else.

But first, switch back to `master`:

```
git switch master
```

**How to update the data???**

```
dvc checkout
```

```
dvc repro
```

### Tweak model

```
git switch -c tuning
```


... edit:

```python
# code/train_model.py @ L27
clf = RandomForestClassifier(n_estimators=700,
                             n_jobs=6, random_state=seed)
```

... run

```
dvc repro
```

Commit and review!

### Combine changes

Let's combine the two and see what happens...

## Sharing data/models

Try:

```
dvc remote add -d upstream s3://feingoldtech-research-dror/dvc-demo
git status -s
```

* Next you can use `dvc pull/push`
* Others can clone the repo and pull the data

![Thanks](./thank-you-french-fries.jpg)