# Data Version Control with Data Version Control: The Daughter of All Demos
Demo by Wenqi Cao and David Streuli

## Structure
1. Introduction
2. How to Set Up DVC
3. Setting Up a Pipeline and Tracking Changes
4. Conclusion
5. Take Home Message

## Introduction
![Alt Text](./dvc.png)
### What Is DVC
- Open-source tool designed for version control of datasets and ML models
- Works alongside Git
- Automatically detects changes in tracked files

![Alt Text](./dvc_diagram.png)
### Why Should We Use It
- Dynamic datasets: Datasets evolve over time
- Large datasets: DVC can handle large files
- Change detection
- Versioning


## How to Setup DVC

First install DVC.

In [None]:
%pip install dvc

Initialise a DVC project. This will create a new directory `/.dvc`which contains configuration files and metadata.

In [None]:
!dvc init

We will use Google Drive for remote storage and need to install an additional dependency for this.

In [None]:
%pip install "dvc[gdrive]"

We link the folder on our Google Drive.

In [None]:
!dvc remote add -d gdrive_remote gdrive://1nwS0cuebPIGgrNYEdOC8v2ykgIi2ISKQ

In [None]:
!dvc remote list

Set the credentials. 

In [None]:
!dvc remote modify gdrive_remote gdrive_client_id 'client-id'
!dvc remote modify gdrive_remote gdrive_client_secret 'client-secret'

## Remote Storage and Data Tracking

In [None]:
!dvc add data/bikesharing/train/bikeshare_v1.0.txt
!dvc add data/bikesharing/validation/validation.txt

In [None]:
!dvc push

## Setting Up a Pipeline and Tracking Changes

In [None]:
!dvc stage add -n prepare \
  -d demo/prepare.py \
  -d data/bikesharing/train/bikeshare_v1.0.txt \
  -d data/bikesharing/validation/validation.txt \
  -o data/bikesharing/train/bikeshare_prepared.txt \
  -o data/bikesharing/validation/validation_prepared.txt \
  python demo/prepare.py

In [None]:
!dvc stage add -n train \
  -d demo/train.py \
  -d data/bikesharing/validation/validation_prepared.txt \
  -d data/bikesharing/train/bikeshare_prepared.txt \
  python demo/train.py

In [None]:
!dvc repro

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

image_path = './scatter_plot.png'

img = Image.open(image_path)

plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

image_path = './model_plot.png'

img = Image.open(image_path)

plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.axis('off')
plt.show()


In [None]:
!dvc push

## Demonstrate Change of Data

In [None]:
!dvc add data/bikesharing/train/bikeshare_v1.0.txt
!dvc push

In [None]:
!dvc repro

## Conclusion

Pros:
- Data Versioning: DVC enables **version control for datasets** and machine learning models, similar to how Git handles source code
- Efficient Storage: DVC avoids data duplication by **storing only the differences** between data versions, using external storage
- Collaboration: Improves productivity by allowing you to **collaborate on data and models** without needing to share large files.

Cons:
- Learning Curve
- Not Ideal for Small Projects
- Overhead in Workflow (Setup)
- Setup of Remote Storage (DVC App is currently blocked by Google, does not support environment variables)

## Take Home Message
Use data versioning for longterm ML projects that are collaborative or involve large datasets. DVC (the tool) is a great option for this.