# COMMAND LINE FUNCTIONS WORTH REMEMBERING

# Data Versioning Control (DVC)

## Introduction

The main aim of this exercise is to familiarize students with the awsome `dvc` tool for data and model versioning in machine learning/data mining projects.

The easiest way is to install the library inside a virtual Python environment or using Conda, although direct installation from a repository is possible. All details regarding the installation of the library can be found at [project's website](https://dvc.org/doc/install/linux).

In [1]:
%%bash

pip install dvc

Couldn't find program: 'bash'


The first step is to create a directory and to initialize `git` inside it.

In [None]:
%%bash

mkdir dvc-tutorial

cd dvc-tutorial

git init

In [None]:
%cd dvc-tutorial

In [None]:
%%bash

dvc init

In [None]:
%%bash

git status

In [None]:
%%bash

git add .dvc/plots/*
git add .dvc/config
git add .dvc/.gitignore
git add .dvcignore

git commit -m "Initialize DVC for the project"

## Data versioning

The main goal of `dvc` is to allow for large data files versioning. Using `git` for this purpose is [quite problematic](https://docs.github.com/en/github/managing-large-files/working-with-large-files). In this laboratory we will use `dvc` to work with different versions of the same data file.

Before starting the laboratory you should download and locally store `adult.data` and `adult.names` files from [UCI ML Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)

In [None]:
%%bash

mkdir data
cp /path/to/files/adult* data

In [None]:
%%bash

dvc add data/adult.data
dvc add data/adult.names

Let's take a look at the files which were automatically created as the result of adding data files to the repo.

In [None]:
%%bash

cat data/adult.data.dvc

In [None]:
%%bash

cat data/adult.names.dvc

In order to allow for change tracking in data files we need to add `*.dvc` and   `data/.gitignore` files to the Git repository.

In [None]:
%%bash

git add data/.gitignore data/adult.data.dvc data/adult.names.dvc
git commit -m "Added ADULT dataset"

Our next step is to create a remote data repository. DVC works with many external data sources, including Amazon S3, Google Cloud Storage, remote servers accessible via `ssh`, HDFS systems, and many more. We will use a local directory to simulate an external repo.

In [None]:
%%bash

mkdir -p ~/dvcrepo
dvc remote add -d repozytorium ~/dvcrepo
git commit .dvc/config -m "Added local directory simulating remote data repository"

In [None]:
%%bash

dvc push

In [None]:
%%bash

ls -al ~/dvcrepo/

In [None]:
%%bash 

ls -al ~/dvcrepo/1a/

In [None]:
%%bash

cat ~/dvcrepo/1a/7cdb3ff7a1b709968b1c7a11def63e

Remote repo can be used to download original versions of data files when fixing the unnecessary changes, re-creating an experimental branch, etc.

In [None]:
%%bash

rm -rf .dvc/cache/
rm data/adult.data
rm data/adult.names

ls -al data/

In [None]:
%%bash

dvc pull

ls -al data/

In the next step we will change the data files by removing all information about federal employees. Let's check how many such records do we have, and then let's remove them.

In [None]:
%%bash

cat data/adult.data | wc -l
grep 'Federal-gov' data/adult.data | wc -l

In [None]:
%%bash

sed -i "/Federal-gov/d" data/adult.data
cat data/adult.data | wc -l

In [None]:
%%bash 

dvc add data/adult.data
git commit data/adult.data.dvc -m "Removed federal workers from the dataset"

dvc push

If we want to rollback this change, we need to revert to the correct version of the `adult.data.dvc` file and running `dvc checkout` command to synchronize repos.

In [None]:
%%bash 

git log

In [None]:
%%bash 

git checkout 34685237371f63dc2fa2f997ce9f2aa514c0ffe9 data/adult.data.dvc
dvc checkout

In [None]:
%%bash

grep 'Federal-gov' data/adult.data

In [None]:
%%bash 

git commit data/adult.data.dvc -m "Reverting the deletion of federal employees"

## Access to remote data repositories

Having configured a `git` repo using `dvc` we can easily use `dvc` to quickly download data and models, share the data, etc. The results of the previous chapter were stored in the [https://github.com/megaduks/dvc-tutorial](https://github.com/megaduks/dvc-tutorial) repo and now we will see how we can use remote repo to work with the data. 

In [None]:
%%bash 

dvc list https://github.com/megaduks/dvc-tutorial data

All datasets can be downloaded using a single command, e.g. to initialize a new project.

In [None]:
%%bash

mkdir new_project
cd new_project
dvc get https://github.com/megaduks/dvc-tutorial data

In [None]:
%%bash

ls -al nowy_projekt/data/

Unfortunately, using the above command we have lost the information on the origin of the data and we can't re-connect the locally downloaded data with the remote repository. The `dvc get` command resembles `wget` in this regard. If we want to keep the connection between remote and local data, we must use `dvc import`.

In [None]:
%%bash

mkdir -p newer_project/data
dvc import https://github.com/megaduks/dvc-tutorial/ data/adult.data \
    -o newer_project/data/adult.data

In [None]:
%%bash

cat newer_project/data/adult.data.dvc

As we can see, metadata of the `adult.data` file contain information on the remote repository from which the data originates. Precise hashes identifying a particular version of the data file are stored as well. In addition, we can easily track changes of the origin data in the remote repo.

In [None]:
%%bash

dvc update newer_project/data/adult.data.dvc

DVC offers also a programmatical API to access data in remote repos.

In [None]:
import dvc.api

with dvc.api.open('data/adult.data', repo='https://github.com/megaduks/dvc-tutorial') as f:
    for _ in range(10):
        print(f.readline())

## Data flows

The most interesting functionality offered by `dvc` is the ability to manage reproducible data workflows. We will use the following flow to illustrate this concept:

- we will pre-process data by removing selected records
- we will add a new feature
- we will train a simple model
- we will evaluate the quality of the model

The code in the following examples is very simplified, but it's purpose is to illustrate the concept of reproducible data flows. First, we need to install some additional dependencies.

In [None]:
%%bash

pip install pandas sklearn pyaml scikit-learn scipy

We will create the first step of the data flow. In this step we read in a text file and transform it to a serialized binary version (a pickle). 

Create a `params.yaml` file and put the following inside:

```
prepare:
  split: 0.75
  seed: 42
```

Next, create a `prepare.py` file with the following code.

In [None]:
import pandas as pd
import sklearn
import yaml
import random
import sys

from pathlib import Path
from sklearn.model_selection import train_test_split

params = yaml.safe_load(open('params.yaml'))['prepare']

split = params['split']
random.seed(params['seed'])

input_file = Path(sys.argv[1])
train_output = Path('data') / 'prepared' / 'train.csv'
test_output = Path('data') / 'prepared' / 'test.csv'

Path('data/prepared').mkdir(parents=True, exist_ok=True)

df = pd.read_csv(input_file, sep=',')
train_df, test_df = train_test_split(df, train_size=split)

train_df.to_csv(train_output, header=None)
test_df.to_csv(test_output, header=None)

Now we create the first data flow in which we:
- create a named step (`-n prepare`)
- pass parameters (`-p prepare.seed,prepare.split`)
- pass dependencies (`-d prepare.py -d data/adult.data`)
- indicate the output (`-o data/prepared/`)
- run the script and pass parameter values

In [None]:
%%bash

dvc run -n prepare \
    -p prepare.seed,prepare.split \
    -d prepare.py -d data/adult.data \
    -o data/prepared \
    python prepare.py data/adult.data

As the result, we observe output files and a special `dvc.yaml` file with human-readable description of the data flow configuration.

In [None]:
%%bash

cat dvc.yaml

In [None]:
%%bash 

ls -al data/prepared/

The second step is to add to the data flow data transformation. We will re-code all categorical attributes using [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) and we will compute feature interactions using [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures). This last class uses the `degree` parameter. Update the parameter file to account for the second step.

```
prepare:
  split: 0.75
  seed: 42
featurize:
  degree: 2
```

Create the `featurize.py` file.

In [None]:
import pandas as pd
import numpy as np
import yaml
import sys
import pickle

from pathlib import Path
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures

params = yaml.safe_load(open('params.yaml'))['featurize']
degree = params['degree']

input_dir = sys.argv[1]
output_dir = sys.argv[2]

Path(output_dir).mkdir(exist_ok=True)

train_file = Path(input_dir) / 'train.csv'
test_file = Path(input_dir) / 'test.csv'

col_names = [
        'age',
        'workclass',
        'weight',
        'education',
        'edu-num',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital-gain',
        'capital-loss',
        'hours-per-week',
        'native-country',
        'class'
]

train_df = pd.read_csv(train_file, sep=',', names=col_names)
test_df = pd.read_csv(test_file, sep=',', names=col_names)

train_df = train_df.apply(LabelEncoder().fit_transform)
test_df = test_df.apply(LabelEncoder().fit_transform)

poly = PolynomialFeatures(degree=degree, interaction_only=True)

train_y = train_df['class']
test_y = test_df['class']

train_df = train_df.drop('class', axis=1)
test_df = test_df.drop('class', axis=1)

train_df = np.column_stack((poly.fit_transform(train_df), train_y))
test_df = np.column_stack((poly.fit_transform(test_df), test_y))

train_output = Path(output_dir) / 'train.p'
test_output = Path(output_dir) / 'test.p'

with open(train_output, 'wb') as f:
    pickle.dump(train_df, f)

with open(test_output, 'wb') as f:
    pickle.dump(test_df, f)


The data flow can be executed by running the following command.

In [None]:
%%bash

dvc run -n featurize \
    -p featurize.degree \
    -d featurize.py -d data/prepared/ \
    -o data/features \
    python featurize.py data/prepared/ data/features/

In order not to loose the results of our work we should record data flow steps in the `git` repo.

In [None]:
%%bash

git add .gitignore dvc.lock dvc.yaml
git commit -m 'Added preparation and featurization steps to data pipeline'

The third step is to run model training. We will use a simple script with Random Forest, and we will use two parameters: the number of trees in the forest and the maximum depth of each tree. Change the parameter file in the following way:

```
prepare:
  split: 0.75
  seed: 42
featurize:
  degree: 2
train:
  max_depth: 2
  n_estimators: 5
```

Create the `train.py` file:

In [None]:
import sys
import yaml
import pickle

from pathlib import Path
from sklearn.ensemble import RandomForestClassifier

params = yaml.safe_load(open('params.yaml'))['train']
max_depth = params['max_depth']
n_estimators = params['n_estimators']

input_dir = sys.argv[1]
output_dir = sys.argv[2]

Path(output_dir).mkdir(exist_ok=True)

train_file = Path(input_dir) / 'train.p'
model_file = Path(output_dir) / 'model.p'

with open(train_file, 'rb') as f:
    train_df = pickle.load(f)

X = train_df[:, :-1]
y = train_df[:, -1]

clf = RandomForestClassifier(
    n_estimators=n_estimators,
    max_depth=max_depth
)
clf.fit(X, y)

with open(model_file, 'wb') as f:
    pickle.dump(clf, f)


As you can see, the script expects two parameters to be passed via the command line (the input directory with the data and the output directory to store the results of the script). To add the training step to the data flow execute the following command:

In [None]:
%%bash

dvc run -n train \
    -p train.max_depth,train.n_estimators \
    -d train.py -d data/features/ \
    -o data/models/ \
    python train.py data/features/ data/models/

As usual we record the changes in the data flow in `git`.

In [None]:
%%bash

git add .gitignore dvc.lock dvc.yaml
git commit -m 'Added training step to data pipeline'

Why have we created the `dvc.yaml` file? At the first glance it might seem overly complicated. But this is where `dvc` truly shines, the presence of the full definition of the data flow allows for full reproducibilty using a single command.

In [None]:
%%bash

dvc repro

Let's change a single parameter in the `train` section (e.g., change the number of trees in the RandomForest) and re-run the experiment. Which steps have been executed? Change another parameter in the `prepare` section (e.g. the way train/test split is performed) and re-run the experiment once again. Has something changed?

If you want to visualize the data flow, use the `dvc dag` command.

## Experiments

The last element of the `dvc` framework that we will examine is the way experiments are executed. Before we start experimenting, we need to create a `evaluate.py` file with the code to evaluate the results of training.

In [None]:
import sys
import os
import pickle
import json

from sklearn.metrics import precision_recall_curve
import sklearn.metrics as metrics
from pathlib import Path

model_file = Path(sys.argv[1]) / 'model.p'
test_file = Path(sys.argv[2]) / 'test.p'

scores_file = sys.argv[3]
plots_file = sys.argv[4]

with open(model_file, 'rb') as f:
    model = pickle.load(f)

with open(test_file, 'rb') as f:
    test_df = pickle.load(f)

X = test_df[:,:-1]
y = test_df[:,-1]

predictions_by_class = model.predict_proba(X)
y_pred = predictions_by_class[:, 1]

precision, recall, thresholds = precision_recall_curve(y, y_pred)
auc = metrics.auc(recall, precision)

with open(scores_file, 'w') as f:
    json.dump({'auc': auc}, f)

with open(plots_file, 'w') as f:
    json.dump({'prc': [{
            'precision': p,
            'recall': r,
            'threshold': t
        } for p, r, t in zip(precision, recall, thresholds)
    ]}, f)

Tym razem dodanie kroku ewaluacji do potoku będzie bardziej skomplikowane, ponieważ musimy też uwzględnić specjalny plik do przechowywania wartości metryk oraz plik przechowywania danych na potrzeby wykresów. 

This time adding a step to the data flow is more complicated, because we have to include a special file to store the metrics associated with experiment runs, and an additional file to store the visualizations.

In [None]:
%%bash

dvc run -n evaluate \
    -d evaluate.py -d data/models/ -d data/features/ \
    -M scores.json \
    --plots-no-cache prc.json \
    python evaluate.py data/models/ data/features/ scores.json prc.json

Let's see at the final data flow configuration file.

In [None]:
%%bash

cat dvc.yaml

Don't forget to record all the changes in `git`.

In [None]:
%%bash

git add dvc.lock dvc.yaml
git commit -m 'Added evaluation step to data pipeline'

As the result of the data flow a new file `scores.json` has been added. This file contains the AUROC measure for the experiment run.

In [None]:
%%bash

cat scores.json

The `prc.json` file contains the information about the training (*precision-recall curve*). Let's add both files to the repository.

In [None]:
%%bash

git add scores.json prc.json
git commit -m 'Added evaluation metrics'

Run the experiment with changed parameters and let's see if these changes affect the metric. Change the `degree` parameter to 3 and change the `n_estimators` parameter to 25. Re-run the experiment.

In [None]:
%%bash 

dvc repro

In [None]:
%%bash

dvc params diff

In [None]:
%%bash 

dvc metrics diff

In [None]:
%%bash

dvc plots diff -x recall -y precision