# Creating reproducible data science workflows with DVC

Tutorial [link](https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b).

Despite being implemented in software, development of data science and machine learning projects is dramatically different from general-purpose software development: for example, DS is primarily experiment-driven and has a much higher level of intrinsic unpredictability.

Data science and machine learning are very different. Instead of features to implement, in ML we have ideas to try out, without any guarantee of succeeding. Before long, you may test a few dozen hypotheses, often substantially varying in their complexity and the implementation required. Most of them will fail, some may look promising, and ultimately, some complicated combinations of them may succeed.

Eventually, you or your colleague will try to reproduce one of those models, the one which seems to be the best. You may find out that you cannot easily reconstruct how exactly that model was created. It may turn out that the training parameters are buried somewhere in a notebook, overwritten by subsequent experiments and multiple Git commits in several branches. In addition, perhaps the training/cross-validation split was performed without setting a random seed. And if you’re still not worried enough, the cherry on top is that when examined, it turns out the current features look nothing like what they were when the model was created. Too bad.

## Data Science processes

Some teams use the variants of [Coockiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/), others follow the approach outlined in [Guerilla Analytics](http://guerrilla-analytics.net/the-principles/). Both address the same problems, both offer powerful tools and we recommend you to try them out.  

Chances are, you or your team already use something similar. But regardless of which approach you use to write reproducible data science code, you need tooling. 

The bare minimum requirements are the following:
- a way to **version-control the data**, especially intermediate artefacts like pre-computed features and models,
- being able to **pass data files inside the team** in a controllable and trackable way,
- a tool to **reproduce any artefacts or results**, in a simple and automated way, regardless of how long ago they were originally created.

**Note:** Other terms like documents, deliverables, or work products are widely used in software development communities instead of the term **artefact**. 

More: https://www.quora.com/What-does-the-term-artefact-mean-in-software-engineering-or-programming

In this tutorial we will explore, how DVC implements all of the processes we’ve outlined and makes reproducible data science easier. DVC is open-source and attempts to be a Git for machine learning, while working closely together with Git itself.  

**DVC** is actually quite a large tool, but the most common operations are simple enough, in the same way that basic Git commands are easy to learn and incorporate into daily practice.  

DVC is not the only tool for the job. It works best for small to middle-sized projects and solves the problem without adding too much complexity. However, depending on your needs, project size and deployment considerations you may find Kedro or other tools more suitable.

## Versioning the data

Let’s first define what **data version control** is:
- state of any data file, whether original one or derived, must be recorded,
- there must be a tool to switch between different versions of data files.

Consider the following scenario: the training data comes from a relational database and is stored as a CSV file. Once in a while, you want to update the dataset with recent records from the database. Each time you do so, you record the state of the dataset. If you have a way to *switch to any of the previous versions and back* — congratulations, your data is version controlled.  

Git is not suitable for this, as it was not designed to serve large or binary files, while extensions like Git LFS are general-purpose and can be used for data version control only with some limitations and inconvenience. DVC offers a more flexible approach.  

To illustrate it we will use [Titanic dataset](https://www.kaggle.com/c/titanic) from Kaggle and build a simple model and a submission file. With this miniature data science project, we will see, how **DVC helps to ensure data lineage and reproducibility**.

### Install DVC

In [None]:
!pip install dvc
# Or
!conda install -c conda-forge dvc

> *Note, that you can configure DVC to use external storage to hold and exchange data, and in that case you’ll also need to install additional dependencies. For example, if you plan to use Amazon Web Services S3, you need to install boto and some other packages with*

In [None]:
!pip install dvc[s3]

- Install [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)
- Configure [AWS](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

### Project Layout

Our project starts with the Titanic dataset, which contains two files (one for training and one for testing), and the project structure.  

It may be tempting to have all the data files and code in the same directory for the project this small.  

However, it’s strategically wiser to stick to the same project structure for any project, regardless of its size. A disciplined approach to project structure and operations saves a lot of headache and time along the road.

Let’s create a skeleton for our project:

In [1]:
!mkdir titanic-dvc

In [2]:
!cd titanic-dvc/

In [11]:
!mkdir './titanic-dvc/'{data,features,results,pytitanic}

mkdir: ./titanic-dvc/data: File exists
mkdir: ./titanic-dvc/features: File exists
mkdir: ./titanic-dvc/results: File exists
mkdir: ./titanic-dvc/pytitanic: File exists


In [12]:
!touch './titanic-dvc/README.md'

In [14]:
!tree './titanic-dvc/'

[01;34m./titanic-dvc/[00m
├── README.md
├── [01;34mdata[00m
├── [01;34mfeatures[00m
├── [01;34mpytitanic[00m
└── [01;34mresults[00m

4 directories, 1 file


The original data goes in the `data` directory. Although we have only two original data files, for larger projects there may be tens, thousands or even millions of data files, so it’s reasonable to have a separate directory for them.

All derived features and intermediate data files go to `features` directory. Results (for example, trained models and submission files) will live in `results` directory.

`pytitanic` directory will contain Python code for the project: scripts, modules, packages, etc. Additionally, you may have `notebooks` directory, and directories for code in other languages (for example, in R or Julia).

We will keep `README.md` empty out of simplicity, although we still create it for the sake of the general procedure.

To finish the project setup, we need to add data files and initialize Git repository and DVC. Given that you have already downloaded the data as a Zip archive in data directory:

In [17]:
!unzip './titanic-dvc/data/titanic.zip' -d './titanic-dvc/data'

Archive:  ./titanic-dvc/data/titanic.zip
  inflating: ./titanic-dvc/data/gender_submission.csv  
  inflating: ./titanic-dvc/data/test.csv  
  inflating: ./titanic-dvc/data/train.csv  


You should see three new files (gender_submission.csv, test.csv and train.csv) in data directory now. We will not use gender_submission.csv, so let’s remove it along with Zip archive:

In [18]:
!tree

[01;34m.[00m
├── Data\ Science\ Workflows.ipynb
├── Machine\ Learning\ Pipeline.ipynb
├── README.md
├── bigmart_ml_pipeline.py
├── [01;34mdata[00m
│   ├── SampleSubmission_Big_Mart_Sales_3.csv
│   ├── gender_submission.csv
│   ├── test.csv
│   ├── test_Big_Mart_Sales_3.csv
│   ├── train.csv
│   └── train_Big_Mart_Sales_3.csv
├── [01;34mresources[00m
│   └── final_pipeline.webp
└── [01;34mtitanic-dvc[00m
    ├── README.md
    ├── [01;34mdata[00m
    │   ├── gender_submission.csv
    │   ├── test.csv
    │   ├── [01;31mtitanic.zip[00m
    │   └── train.csv
    ├── [01;34mfeatures[00m
    ├── [01;34mpytitanic[00m
    └── [01;34mresults[00m

7 directories, 16 files


In [19]:
!rm './titanic-dvc/data/gender_submission.csv' './titanic-dvc/data/titanic.zip'

In [21]:
!tree

[01;34m.[00m
├── Data\ Science\ Workflows.ipynb
├── Machine\ Learning\ Pipeline.ipynb
├── README.md
├── bigmart_ml_pipeline.py
├── [01;34mdata[00m
│   ├── SampleSubmission_Big_Mart_Sales_3.csv
│   ├── test_Big_Mart_Sales_3.csv
│   └── train_Big_Mart_Sales_3.csv
├── [01;34mresources[00m
│   └── final_pipeline.webp
└── [01;34mtitanic-dvc[00m
    ├── README.md
    ├── [01;34mdata[00m
    │   ├── test.csv
    │   └── train.csv
    ├── [01;34mfeatures[00m
    ├── [01;34mpytitanic[00m
    └── [01;34mresults[00m

7 directories, 11 files


> Note, that we do not put data files under Git control, as their versioning will be handled by DVC. From now on, data files won’t be managed by Git directly.

### Managing data with DVC

We are now ready to initialize DVC for our project. To do this, launch:

In [22]:
!dvc init


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

Several things happen when DVC performs initialization. First, it creates `.dvc` directory to hold its own files needed for operation. `.dvc` directory is the same for DVC, as `.git` is for Git.

Second, DVC instructs Git on how to handle newly created files. If you look at current status for Git (with `git status`), you’ll see, that DVC staged its files to commit:

In [24]:
!git status

Sur la branche master
Votre branche est à jour avec 'origin/master'.

Modifications qui seront validées :
  (utilisez "git restore --staged <fichier>..." pour désindexer)
	[32mnouveau fichier : .dvc/.gitignore[m
	[32mnouveau fichier : .dvc/config[m

Modifications qui ne seront pas validées :
  (utilisez "git add <fichier>..." pour mettre à jour ce qui sera validé)
  (utilisez "git restore <fichier>..." pour annuler les modifications dans le répertoire de travail)
	[31mmodifié :         .DS_Store[m

Fichiers non suivis:
  (utilisez "git add <fichier>..." pour inclure dans ce qui sera validé)
	[31m.ipynb_checkpoints/Data Science Workflows-checkpoint.ipynb[m
	[31mData Science Workflows.ipynb[m
	[31mtitanic-dvc/[m



`.dvc/.gitignore` file instructs Git to skip some DVC internal files from .dvc, 

while `.dvc/config` contains the newly created configuration for DVC, which is empty for now.

>DVC tries to name commands in a familiar way. Most of the time, DVC command does exactly what you would expect it to do based on your Git experience.

>Moreover, DVC is a pretty verbose tool and most of the commands output meaningful and useful messages, so that you can understand what’s going on and what to do next.

Let’s commit the changes:

In [25]:
!git commit -m "DVC was initialized for the project."

[master a454c1a] DVC was initialized for the project.
 2 files changed, 9 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config


We are now ready to track the data files with DVC. To tell DVC about `data/train.csv` and `data/test.csv` we’ll use `dvc add`:

In [26]:
!dvc add './titanic-dvc/data/train.csv' './titanic-dvc/data/test.csv'

100% Add|██████████████████████████████████|2.00/2.00 [00:02<00:00,  1.15s/file][0m

To track the changes with git, run:

	git add titanic-dvc/data/train.csv.dvc titanic-dvc/data/test.csv.dvc titanic-dvc/data/.gitignore
[0m

Let’s break this down. First, DVC creates yet another `.gitignore` file to exclude original data files from Git tracking.

Second, something more important happens: DVC puts data files in its cache, and creates two metafiles (`data/train.csv.dvc` and `data/test.csv.dvc`) with the information about original data files.

Metafiles follow YAML standard and have a specific set of attributes (use `cat data/train.csv.dvc` to look into the file):

In [28]:
!tree './titanic-dvc'

[01;34m./titanic-dvc[00m
├── README.md
├── [01;34mdata[00m
│   ├── test.csv
│   ├── test.csv.dvc
│   ├── train.csv
│   └── train.csv.dvc
├── [01;34mfeatures[00m
├── [01;34mpytitanic[00m
└── [01;34mresults[00m

4 directories, 5 files


In [29]:
!cat './titanic-dvc/data/train.csv.dvc'

md5: ee3ee8f3eac58359eb30beaa0e56aa02
outs:
- md5: 61fdd54abdbf6a85b778e937122e1194
  path: train.csv
  cache: true
  metric: false
  persist: false


Top-level md5 attribute contains MD5 checksum for *.dvc file contents, while md5 attribute under outs contains the checksum of the data file itself.

>You may notice, that md5sum utility calculates different values for both data file and *.dvc file. That’s ok, as DVC calculates MD5 on a transformed version of a file. For text files it changes EOL sequence from \r\n (which is the case for Titanic dataset files) to \n.

> For *.dvc file itself it’s even more elaborate: top-level MD5 attribute contains checksum not for the *.dvc file itself (think about this for a moment), but for properly encoded string representation of the contents, with some filtering applied.

Now let’s look at the cache. It’s located by default in `.dvc/cache`:

In [30]:
!tree .dvc/cache/

[01;34m.dvc/cache/[00m
├── [01;34m02[00m
│   └── 9c9cd22461f6dbe8d9ab01def965c6
└── [01;34m61[00m
    └── fdd54abdbf6a85b778e937122e1194

2 directories, 2 files


As you can see, DVC stores data files in the cache according to their MD5: first two symbols form directory name, while remaining ones are used as the cache file name.

We can now commit DVC metafiles to Git:

In [31]:
cd ./titanic-dvc/ 

/Users/mac/Desktop/sideProjects/learnML/titanic-dvc


In [32]:
!git add data/.gitignore data/test.csv.dvc data/train.csv.dvc

In [33]:
!git commit -m "Original data files added"

[master c5e6acb] Original data files added
 3 files changed, 16 insertions(+)
 create mode 100644 titanic-dvc/data/.gitignore
 create mode 100644 titanic-dvc/data/test.csv.dvc
 create mode 100644 titanic-dvc/data/train.csv.dvc


**Note**, that Git knows nothing about data files themselves, all the information needed to track them is stored in DVC files, while Git serves as an upper-level tool to track DVC itself.

### Moving data around in a controllable way

As the data files are now under DVC control, we can start using it. For example, if you accidentally deleted one of the data files, you can recreate it from cache with `dvc checkout`:

In [40]:
!rm data/train.csv

In [41]:
!tree

[01;34m.[00m
├── README.md
├── [01;34mdata[00m
│   ├── test.csv
│   ├── test.csv.dvc
│   └── train.csv.dvc
├── [01;34mfeatures[00m
├── [01;34mpytitanic[00m
└── [01;34mresults[00m

4 directories, 4 files


In [42]:
!dvc checkout data/train.csv.dvc

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
[0m                                                                            

In [43]:
!tree

[01;34m.[00m
├── README.md
├── [01;34mdata[00m
│   ├── test.csv
│   ├── test.csv.dvc
│   ├── train.csv
│   └── train.csv.dvc
├── [01;34mfeatures[00m
├── [01;34mpytitanic[00m
└── [01;34mresults[00m

4 directories, 5 files


`dvc checkout` looks into the target file (data/train.csv.dvc in this case) and retrieves the corresponding version from the cache. This is, of course, a simple example, but it illustrates the pattern.

A more elaborate example would include **remote storage**. DVC can store files outside the working directory. This allows to easily share files using DVC tools. DVC allows using a local directory, AWS S3, Azure, and other destinations as remotes.

Let’s create local remote storage for the data files:

In [47]:
pwd

'/Users/mac/Desktop/sideProjects/learnML/titanic-dvc'

In [44]:
!mkdir ../titanic-remote

In [49]:
!dvc remote add -d localremote ../titanic-remote/

Setting 'localremote' as a default remote.
[0m

We can now push the data to the newly created remote:

In [51]:
!dvc push data/train.csv.dvc data/test.csv.dvc

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
[0m                                                                            

Remotes are structured similarly to local cache:

In [52]:
!tree ../titanic-remote/

[01;34m../titanic-remote/[00m
├── [01;34m02[00m
│   └── 9c9cd22461f6dbe8d9ab01def965c6
└── [01;34m61[00m
    └── fdd54abdbf6a85b778e937122e1194

2 directories, 2 files


By itself, remotes are of limited usefulness. However, they become crucial when you work in a team. Let’s illustrate this by creating a clone of our current repository:


In [53]:
cd ..

/Users/mac/Desktop/sideProjects/learnML


In [54]:
# Transform titanic-dvc into a git repo if you didn't before
!git clone titanic-dvc titanic-dvc-copy 

fatal: le dépôt 'titanic-dvc' n'existe pas


In [56]:
cd titanic-dvc-copy

/Users/mac/Desktop/sideProjects/learnML/titanic-dvc-copy


In [57]:
!tree

[01;34m.[00m
├── README.md
└── [01;34mdata[00m
    ├── test.csv.dvc
    └── train.csv.dvc

1 directory, 3 files


Original data files are not there, but DVC metafiles are, as they are tracked by Git. This allows us to easily fetch data files from existing remote storage:

In [58]:
!dvc remote add -d localremote --local ../titanic-remote/

Setting 'localremote' as a default remote.
[0m

In [59]:
!dvc pull data/train.csv.dvc data/test.csv.dvc

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
Everything is up to date.                                                       
[0m

In [60]:
!tree .

[01;34m.[00m
├── README.md
└── [01;34mdata[00m
    ├── test.csv
    ├── test.csv.dvc
    ├── train.csv
    └── train.csv.dvc

1 directory, 5 files


Information about remotes is stored in DVC config files (.dvc/config). Let’s get back to the original repository and look at how the configuration file changed:

In [61]:
!cat ../.dvc/config

['remote "localremote"']
url = ../titanic-remote
[core]
remote = localremote


We can now commit the changes to DVC config:

In [63]:
cd ..

/Users/mac/Desktop/sideProjects/learnML


In [64]:
!git add .dvc/config

In [65]:
!git commit -m "add local remote"

[master bc5ad3e] add local remote
 1 file changed, 4 insertions(+)


> **Note**, that the newly created remote is local and thus its configuration may be kept outside of Git. DVC allows user to have several types of configuration, and the one called local is excluded from Git tracking. To use local configuration when creating remotes, just add `— local` option to `dvc remote add` command. We added this option for titanic-dvc-copy repository for illustration.

> The same `— local` option should be used when creating **cloud-backed remotes**, as you’ll need to add credentials to access AWS S3, Azure or GCP remotes and it’s not recommended to have them in Git.

### Data Versioning

So far, we used DVC to only add files to cache and remote storage. This is only a part of the story. More importantly, data files can be versioned. Of course, Titanic dataset would not change, but real datasets can change over time.

We will simulate changes in data by just renaming Name column to FullName in both data files. This is enough for our purposes, as DVC doesn’t care, what actually changed, it just tracks the changes.

For convenience, let’s tag the latest Git commit so that we can easily checkout files from it without messing with hashes:

In [68]:
!git tag base-dataset

We can now add edited data files to DVC:

In [77]:
!dvc add data/train.csv data/test.csv

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
Stage is cached, skipping.                                                      
100% Add|██████████████████████████████████|1.00/1.00 [00:01<00:00,  1.65s/file]

To track the changes with git, run:

	git add data/test.csv.dvc
[0m

In [78]:
!git add data/test.csv.dvc data/train.csv.dvc

In [79]:
!git commit -m "rename dataset columns"

[master 69340fa] rename dataset columns
 1 file changed, 7 insertions(+)
 create mode 100644 data/test.csv.dvc


In [80]:
!git tag renamed-dataset

fatal: l'étiquette 'renamed-dataset' existe déjà


In [81]:
!dvc push data/train.csv.dvc data/test.csv.dvc

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
[0m                                                                            

In [82]:
!tree ../.dvc/cache

[01;34m../.dvc/cache[00m
├── [01;34m02[00m
│   └── 9c9cd22461f6dbe8d9ab01def965c6
├── [01;34m61[00m
│   └── fdd54abdbf6a85b778e937122e1194
├── [01;34m8b[00m
│   └── 686a9e84994cb693999608b9201619
└── [01;34mee[00m
    └── f968ede3c2968e56e886c6b94d265a

4 directories, 4 files


At the moment, data files in the working directory contain renamed columns, and DVC added them to cache (check this with `tree .dvc/cache`) and we pushed it to the local remote. So far, all the changes were propagated to all locations.

Lets’ assume, that you want to get the original version of the data. We intentionally tagged the corresponding commit, and now can easily checkout DVC metafiles for that version:

In [83]:
!git checkout base-dataset data/train.csv.dvc data/test.csv.dvc

2 chemins mis à jour depuis 3792fb7


With the metafiles for `base-dataset`, we can easily get the original version of the data files from cache or remote storage:

In [84]:
!dvc checkout data/train.csv.dvc data/test.csv.dvc

[33m+-------------------------------------------+
[39m[33m|[39m                                           [33m|[39m
[33m|[39m     Update available [31m0.77.3[39m -> [32m0.83.0[39m     [33m|[39m
[33m|[39m       Run [33mpip[39m install dvc [34m--upgrade[39m       [33m|[39m
[33m|[39m                                           [33m|[39m
[33m+-------------------------------------------+
[39m
[0m                                                                            

If you checkout the data files, you will see, that column is named Name, as it was in the original files. To revert data files to the current form with FullName instead of Name, just reset corresponding metafiles to `Git HEAD`and perform `dvc checkout` again.

> Note, that with DVC we effectively version control metafiles. All the tracking of actual data files is performed by DVC based on information in metafiles.

As you can see, DVC is convenient and simple enough to be used for data versioning. It allows to easily record the state of the data, and switch between different versions (with some help from Git).

This reduces the mess and helps to keep data coherent across teammates and locations. However, DVC can do more: it can track calculations and results, allowing to recreate any previous result without a lot of trouble.

## Managing calculations with DVC

DVC has two main concepts for reproducible calculations: **stages** and **pipelines**. 

Let’s start with the simpler one and create a DVC stage, which calculates some features. We will not go too far right now, and will just make some columns categorical (see `pytitanic/features.py`):

In [1]:
pwd

'/Users/mac/Desktop/sideProjects/learnML'

In [2]:
cd titanic-dvc

/Users/mac/Desktop/sideProjects/learnML/titanic-dvc


In [4]:
!touch pytitanic/__init__.py pytitanic/features.py

In [5]:
import pandas as pd

df_dummy = pd.read_csv("./data/train.csv")

In [6]:
df_dummy.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [16]:
df_dummy["Cabin"]

0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object

In [14]:
df_dummy["CabinId"] = df_dummy["Cabin"].str.get(0)

In [15]:
df_dummy["CabinId"]

0      NaN
1        C
2      NaN
3        C
4      NaN
      ... 
886    NaN
887      B
888    NaN
889      C
890    NaN
Name: CabinId, Length: 891, dtype: object

In [17]:
df_dummy["CabinId"].astype("category").cat.codes

0     -1
1      2
2     -1
3      2
4     -1
      ..
886   -1
887    1
888   -1
889    2
890   -1
Length: 891, dtype: int8

The code is simple and self-explanatory so we would not go through it. In a typical environment, you would immediately launch `python -m pytitanic.features …`, but with DVC it works a bit differently.
First, let’s add newly created Python files to Git:

In [18]:
!git add pytitanic/features.py pytitanic/__init__.py

In [20]:
!git commit -m "Basic features calculation added."

[master 5c7f5a9] Basic features calculation added.
 4 files changed, 42 insertions(+), 4 deletions(-)
 create mode 100644 pytitanic/__init__.py
 create mode 100644 pytitanic/features.py


Now, let’s create a DVC stage:

In [2]:
cd titanic-dvc

/Users/mac/Desktop/sideProjects/learnML/titanic-dvc


In [3]:
!dvc run -f features/features.dvc \
                     -d data/train.csv -d data/test.csv -d pytitanic/features.py \
                     -o features/train_features.csv \
                     -o features/test_features.csv \
                     python3 -m pytitanic.features \
                     -o features -r data/train.csv -s data/test.csv

Running command:                                                        
	python3 -m pytitanic.features -o features -r data/train.csv -s data/test.csv
                                                                        
To track the changes with git, run:

	git add features/features.dvc features/.gitignore
[0m

Let’s decrypt this. First, we instruct DVC to run command and create a stage (and a corresponding stage file) with `dvc run -f features/features.dvc`. Stage in **DVC is just a chunk of tracked computation**. All information about how stage is performed is stored in the stage file (`features/features.dvc` in this case).

To tell DVC about dependencies, we use `-d` option and list all the files needed for the computation. With `-o` we provide outputs of the stage. The final part is the command itself.

As we launch the command above, DVC will create the stage file, which contains all the information needed to recreate the stage results at any time in the future:

Note, that DVC adds output files to the cache, so you do not need to do this manually. However, we need to commit stage file and `.gitignore` created by DVC in features directory (again, .gitignore was added by DVC to exclude output files from Git tracking) and create a tag for convenience:

In [4]:
!git add features/features.dvc features/.gitignore

In [5]:
!git commit -m "Basic features calculated." 

[master 9ef7023] Basic features calculated.
 2 files changed, 23 insertions(+)
 create mode 100644 features/.gitignore
 create mode 100644 features/features.dvc


In [6]:
!git tag base-features

You may ask, what’s the point here? Can’t we just put output files under DVC control manually as we did before? Yes, we can, but now we not only have them in cache (and so can push to any remote to share with others) but also can easily recreate the calculation with updated code.

## Creating pipelines

Calculations, however, may be more complex than just a single stage. In that case, we want some modularity instead of a single monolitic computation.

DVC has tools for that too. To illustrate that, we will now proceed to actual machine learning tools. For that, we will create a simple CatBoost model with the features we have:

In [7]:
!touch pytitanic/model.py

This file looks large, but it’s very straightforward. First, we create a stratified random split. We can stratify on any categorical column, but let’s use Pclass as the default (think for a moment - why are we doing this?). We then perform some missing values imputation using the training set and finally, we train the model. All the results are stored on disk.

Several things to note:
- **parameters are explicit** and captured in the command line. This allows us to have reproducible commands without any implicit defaults,
- this script creates what is called **metrics file** alongside model file and submission. We will see later, how useful this is in a combination with DVC,
- **random state is explicit**, and can be provided as a command-line parameter which has a default value. We won’t lose track of it.
    
We are now ready to train the model using dvc run:

In [8]:
!git add pytitanic/model.py

In [9]:
!git commit -m "ML model training added."

[master 105fd1c] ML model training added.
 1 file changed, 85 insertions(+)
 create mode 100644 pytitanic/model.py


In [17]:
!dvc --version

0.84.0
[0m

In [None]:
!dvc run -f results/model.dvc \ 
                     -d features/train_features.csv \ 
                     -d features/test_features.csv \ 
                     -d pytitanic/model.py \ 
                     -o results/cb-model.cbm \ 
                     -o results/cb-submission.csv \ 
                     -m results/cb-metrics.json \ 
                     python3 -m pytitanic.model -o results \ 
                     -r features/train_features.csv -s features/test_features.csv

What we actually created is a DVC *pipeline*. To generate the output required, DVC checks if all the dependencies are at place since features/train_features.csv and features/test_features.csv are themselves results of another stage (features/festures.dvc). Now that we have defined the stage for results/model.dvc, we can use it with dvc repro. Additionally, you can inspect a pipeline:

In [19]:
!dvc pipeline show --tree results/model.dvc

results/model.dvc

[0m

When DVC tries to reproduce results/model.dvc, it first constructs a dependencies graph and decides on whether any of the dependencies must be reproduced. Right now they are all up to date and DVC does not perform any calculations:

In [20]:
!dvc repro results/model.dvc

Running command:
	 
                                                                        
To track the changes with git, run:

	git add results/model.dvc
[0m

However, if we remove one or both of the features files, DVC will recognize that and reproduce them first:

In [21]:
!rm features/*_features.csv

In [22]:
!dvc repro results/model.dvc

Running command:
	 
                                                                        
To track the changes with git, run:

	git add results/model.dvc
[0m

>You may note somewhat unfortunate name of dvc repro command. It does not reproduce the target in the scientific sense but rather recalculates it. To reproduce one of the previous versions you either should checkout it from DVC cache, or checkout corresponding Git commit and actually recalculate it again.

As you can see, DVC goes back through all the stages we defined so far and recognizes that some of the intermediate results needed for results/model.dvc are missing. It reproduces features/features.dvc stage, but correctly determines, that the actual reproduced files are up to date, so there’s no need to either save them to cache or to reproduce results/model.dvc.

Another ingredient is the metrics file. DVC can track metrics alongside other outputs. This allows to later recall the performance of the model:

<b><p style='color:red'> ALERT: </p></b>
I don't know if the problem is in my environment but if I don't run the last python command alone, then no training/ file output is done.

In [None]:
!dvc pipeline show --tree results/model.dvc

In [None]:
!dvc repro results/model.dvc

In [None]:
!dvc metrics show

**Note**: Best run all the commands in a cmd and not in jupyter notebook.

If you run into a dvc lock problem run: `dvc init -f -v`

Moreover, DVC can fetch specific metrics directly:

In [None]:
! dvc metrics show -x AUC

>In this case, DVC recognizes, that the metrics file is in JSON format, and uses path (-x or — xpath) to find the actual field inside the file.

We can now commit the newly created stage:

In [None]:
!git add results/model.dvc results/.gitignore
!git commit -m "Basic CatBoost model created."
!dvc push

## Running experiments

With stages, pipelines and metrics files, DVC allows performing even more flexible operations. Consider this: after you’ve created the basic features and your first model, you want to add some additional features and retrain the model. You want to determine if the new features are better. How would you achieve this?

The workflow may be as follows:
- **create a new Git branch**,
- **add new code** to calculate additional features,
- **reproduce** the pipeline for `results/model.dvc`
- **compare metrics** for the model on initial features with the one, trained on new features.

Let’s try this out. First, we create a branch:

In [None]:
!git checkout -b experimental-features

Now, we add a new feature (well known PclassSex) in `pytitanic/features.py`.

We now need to reproduce `results/model.dvc`

DVC will report, that since one of the dependencies changed, the stage must be recreated and then will launch the calculation. Both `features/features.dv`c and `results/model.dvc` will be updated and we can commit it now:

In [None]:
titanic-dvc> git add pytitanic/features.py features/features.dvc results/model.dvc
titanic-dvc> git commit -m "Combination of class and sex added as a feature."
titanic-dvc> git tag pclass-sex

We can now launch `dvc metrics` and see, how the newly created model compares with the previous one:

In [None]:
!dvc metrics show -a

DVC has an option for `dvc metrics` show to show all available metrics over all branches, namely -a. This helps us to compare metrics either between different models inside a branch or different versions of the same model. Here we can see, that the new model is slightly better in terms of accuracy.

We may now decide to merge this branch back to master or submit both submission files to Kaggle to compare leaderboard metrics.

## Moving forward

DVC is a powerful tool and we covered only the fundamentals of it. DVC can be more flexible: it can be configured to use links between the working directory and cache to save space, can use any of three main cloud providers for remote storage, or even install Git hooks.

However, similar to Git, DVC is easy to introduce into daily practice with a set of simple rules:
- once you started a project, add each original data file to DVC with dvc add,
- create artefacts (intermediate data files, model, etc.) only using DVC stages or pipelines,
- commit corresponding *.dvc metafiles to Git,
- record metrics when running calculations or training with dvc run,
- compare different experiments with dvc metrics show,
- remember, that DVC does not assign any special meaning to metrics, and you can store any important information as metrics; for example, you may want to store information about running time performance of a calculation,
- to move through history and reproduce earlier results, use a combination of git checkout, dvc checkout (with optional dvc pull) and dvc repro,
- share your data files through convenient remote storage with dvc push.