## 💿 Data Versioning with DVC

When working with machine learning, keeping track of data versions is critical for reproducibility and consistency. Data often evolves over time, and understanding which dataset version was used for training or evaluation is key to debugging and comparing results. 

In this notebook, we will explore using DVC (Data Version Control) to manage data versioning effectively.

### Quiz Time 🤓

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../.dontlookhere/'))
from quiz5 import *

In [None]:
quiz_data()

In [None]:
quiz_versioning()

## 🐠 Install DVC

First of all, we will need to install DVC dependencies to use dvc CLI tooling in our notebook.

In [None]:
# Install dependencies
!pip install -q dvc[s3]

## 📽️ Initializing a project

The first step will be to initiliaze our DVC project. Let's initialize it by running dvc init inside a Git project:

In [None]:
# Initialize DVC. This will create a cache, config file, and a few other things
!cd ..;dvc init

Once initialized in a project, DVC populates 🧙‍♂️ its installation directory (.dvc/) with the [Internal Directories](https://dvc.org/doc/user-guide/project-structure/internal-files)
 and files needed for DVC operation 

In [None]:
!ls -lhrta ../.dvc

## ⚙️ Configuring DVC Remotes

You can upload DVC-tracked data to various storage systems, either remote or local, which are collectively referred to as 'remotes.' 

In our case, we will use S3 for our remotes. We will configure two distinct remotes: 

* one for storing the actual data (s3://data)
* another for storing cached versions of the data (s3://data-cache)

Before pushing data to a remote we need to set it up using the dvc remote add command. As we described, we will add first two distinct remotes, one for storing the data and another for storing cached versions of the data (that will be our remote DVC default):



In [None]:
# Add the data versioning repository as a remote storage
# This will be our default storage
!dvc remote add --default s3-version s3://data-cache
!dvc remote modify s3-version endpointurl $AWS_S3_ENDPOINT

In [None]:
# Add the data source as a remote
!dvc remote add data-source s3://data
!dvc remote modify data-source endpointurl $AWS_S3_ENDPOINT

The .dvc/config file contains detailed information about the DVC configuration. This file is intended to be tracked by Git. 

Upon inspection, you'll notice that it defines two distinct remotes, each pointing to a different data store (in this case, S3 buckets).

In [None]:
# Our config now looks like this
!cat ../.dvc/config

* The core is main section with the general config options
* The remote `s3-version` refers to the s3 remote for storing the cached versions of the data
* The remote `data-source` refers to the s3 remote for storing the data 

## 🐾 Tracking Data

Now it’s time to track the dataset and push it to the data-cache DVC remote!

We will not store the data locally; instead, we will use an S3 remote to store our data. The `dvc import-url` command allows you to create an external data dependency without manually copying files from S3 or installing additional tools for different storage types.

By using the `--to-remote` option, you can create an import .dvc file while transferring the file or directory directly to the remote storage, ensuring efficient and streamlined data management.

In [None]:
# Track the dataset and push it to the data caching repo
!dvc import-url remote://data-source/song_properties.parquet --to-remote

You can check the contents of the `data-cache` bucket in MinIO to see what has been stored.

After running the data-tracking process, a new file named `song_properties.parquet.dvc` is created. This file contains the DVC hash, which identifies the specific version of the data that was just added. 

To understand the structure of the `.dvc` file, run the next cell to inspect its contents. 

In [None]:
!cat song_properties.parquet.dvc

Additionally, you can verify that the version recorded in your `.dvc` file matches the data stored in the `data-cache` bucket in MinIO.  
Navigate to your MinIO bucket (`data-cache`) and cross-check the hash to confirm consistency.

## 📁 Managing DVC Files and Ignoring Unnecessary Data

To maintain a clear relationship between the version of the data and the version of the code, all DVC-related files will be checked into Git. This ensures reproducibility and consistency across your project.

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

!git add song_properties.parquet.dvc .gitignore
!git commit -m "Initial data tracked"

In some cases, you might want DVC to ignore certain files while working on your project. For example:

- Working in a workspace directory with a large number of data files might result in extended execution times for operations like `dvc status`.
- Some files or folders may be irrelevant to the project.

To handle these scenarios, DVC supports the use of `.dvcignore` files, which work similarly to `.gitignore` in Git. 