<center>
<h1>The Full Machine Learning Lifecycle - How to Use Machine Learning in Production (MLOps)</h1>
<hr>
<h2>DVC Tutorial</h2>
<hr>
 </center>

# Introduction
This tutorial will teach you how to use DVC to versionize your data. You will learn how to set up data versioning and how to track and switch between dataset versions. To get started, let's navigate into our project home directory.


In [None]:
import os
import shutil
import sys

sys.path.append('/cd4ml/plugins/')

os.makedirs('/cd4ml/dvc-tutorial')
os.chdir("/cd4ml/dvc-tutorial")

# 1. Initialize the Git repository
DVC works hand-in-hand with Git. To get started tracking the data, we need to initialize a Git repository. 

In [None]:
! git init
! git config user.name "mlops-workshop"
! git config user.email "mlops@workshop.com"

# 2. Initialize DVC
Once we are within a Git repository, we can initialize DVC by running `DVC init`. This creates a `.dvc` folder that DVC used for data versioning.

In [None]:
! dvc init
! ls -a

### Exploring the contents of the `.dvc` folder

In [None]:
! ls .dvc

The `.dvc` directory contains a `config` file, a `tmp` folder which DVC uses as a cache and a `.gitignore`. The config file is empty for now, but it will store configuration information about the DVC setup when we are done defining everything.

In [None]:
! cat .dvc/config

DVC adds its internal configuration files to the `.gitignore` to exclude it from Git tracking.

In [None]:
! cat .dvc/.gitignore

We are now ready to commit our DVC initialization to the Git repository.

In [None]:
! git status

In [None]:
! git commit -m "Initialize DVC repository"

# 3. Set up remote data storage for DVC
Next, we would like to define the remote data storage where the raw data is being stored. This can be a cloud storage (e.g. Amazon S3, Azure Blob Storage, Google Drive), or a local folder on your system.

In [None]:
! dvc remote add -d remote_storage ./dvc_remote

The information about the remote storage is saved in DVC's `config` file.

In [None]:
! cat .dvc/config

Let's commit this change to the Git repository.

In [None]:
! git add .dvc/config
! git commit -m "Configuring remote storage"
! git log -n 2

# 4. Tracking data
With the DVC setup complete, we can start versioning the data. Let's use the ingestion script to make the data available.

In [None]:
import os
from cd4ml.data_processing import ingest_data


# paths and variables
_raw_data_dir = '/data/batch1'
    
_data_dir = 'data'

# ingest the data from blobstroage
ingest_data(_raw_data_dir, data_files = {'raw_data_file': os.path.join(_data_dir, 'data.csv')})

The folder `data` now contrains the dataset `data.csv` which we want to verison with DVC. It contains 52384 rows of data.

In [None]:
! wc -l data/data.csv

Adding tracking to this dataset can be achieved using `dvc add <filename>`.

In [None]:
! dvc add data/data.csv

Running `dvc add` created a `<filename>.dvc` file which we will track with Git and which DVC used to detected changes in the data. The `.gitignore` was also updated to ignore the data itself from Git tracking (Git tracks only the `<filename>.dvc` file). The `.dvc` file contains the file hash and some file metadata.

In [None]:
! cat data/data.csv.dvc

In [None]:
! cat data/.gitignore

Now, we can add the `data.csv.dvc` file and the modified `.gitignore` to a Git commit.

In [None]:
! git add data/data.csv.dvc data/.gitignore

In [None]:
! git commit -m "Dataset version 1"
! git tag "v1"

In [None]:
! git log -n 3

Finally, we push the data to the remote storage location (in this example a local folder in our directory) using `dvc push`.

In [None]:
! dvc push

That's it. We now have properly versioned our dataset.

# New data has arrived!
You have been informed that new data has arrived. We want to track this new version of the dataset so that we can later easily switch between dataset versions.

First, we we run our ingestion script again to fetch the new "day 2" data.

In [None]:
# paths and variables
_raw_data_dir = '/data/batch2'

# ingest the data from blobstroage
ingest_data(_raw_data_dir, data_files = {'raw_data_file': os.path.join(_data_dir, 'data.csv')})

We can detect changes in the dataset by running `dvc status`.

In [None]:
! dvc status

Let us have a quick look at this modified dataset.

In [None]:
! wc -l data/data.csv

As you can see, our dataset has grown from 52384 to 104188 rows.

To track the changes of the dataset, we run `dvc add` again and commit the change to the Git repository.

In [None]:
! dvc add data/data.csv

In [None]:
! git add data/data.csv.dvc
! git commit -m "Dataset version 2"
! git tag "v2"

Let's confirm that our changes have been committed.

In [None]:
! git log -n 4

Finally, we push our latest version of the dataset to our remote storage location.

In [None]:
! dvc push

Inspecting the `dvc_remote` folder shows that there is one subfolder for each version of the dataset.

In [None]:
! ls dvc_remote

# Switching between dataset versions
Switching between dataset versions involves a combination of `git checkout` and `dvc checkout` (or `dvc pull`). The correct version of the `<filename>.dvc` file is loaded into workspace via `git checkout` and running `dvc checkout` then pulls the associated data from our local cache (to pull the data from the remote, you would run `dvc pull`). 

Let's look again at the size of our current dataset (version 2).

In [None]:
! wc -l data/data.csv

Now, we will check out version 1 of our dataset and look at the contents again.

In [None]:
! git checkout tags/v1 
! dvc checkout

In [None]:
! wc -l data/data.csv

As you can see, we have indeed switched to the previous version of our dataset.

# Summary
And there we have it! This is how you can use DVC to keep track of versions of data and switch between different versions. We started by initializing a Git repository, then we initialized DVC inside the Git repository. A combination of `dvc add` and `git commit` allowed us to add tracking to our dataset which we pushed to remote storage with `dvc push`. Accessing different dataset version was done with a combination of `git checkout` and `dvc checkout`. 

In the next part of this workshop, you will learn how to incorporate DVC into an end-to-end Machine Learning workflow using MLFlow and Apache Airflow.

In [None]:
# clean up
os.chdir('..')
shutil.rmtree('dvc-tutorial')