
# Data Versioning Control (DVC) in MLOps

This workflow describes the implementation of **DVC (Data Version Control)** as a core MLOps practice for managing large datasets and model artifacts in parallel with source code.

**Authoritative documentation**:
[https://dvc.org/doc](https://dvc.org/doc)

---

## The Necessity of Data Versioning

In a production machine learning pipeline, stages such as data ingestion, preprocessing, feature engineering, and model evaluation produce intermediate and final artifacts. These pipelines are inherently dynamic.

* A data engineering team may update an upstream data source such as an Amazon S3 bucket.
* A data scientist may change preprocessing logic (e.g., Z-score → IQR for outlier handling).
* Feature definitions or filtering logic may evolve.

Each change propagates downstream and alters model behavior.

Traditional version control systems such as Git are unsuitable for tracking these artifacts because:

* They are optimized for small, line-oriented text files.
* They perform poorly with large binary files and datasets with millions of rows.
* Repository size and performance degrade rapidly.

DVC addresses this limitation by tracking data versions externally while remaining tightly coupled to Git.

---

## Architecture: Git and DVC Parallelism

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://doc.dvc.org/static/39d86590fa8ead1cd1247c883a8cf2c0/cb690/project-versions.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1200/1%2ATQ3_QnmqzqfA_EdoiI9QqQ.png)

The Git–DVC relationship follows a strict separation of responsibilities.

### Responsibility Split

* **Git**

  * Tracks source code.
  * Tracks lightweight DVC metadata files (`.dvc`).
* **DVC**

  * Tracks large datasets and model artifacts.
  * Stores them in external storage (S3, GCS, Azure Blob, NFS, or local).

### Metadata Linking Mechanism

* When data is added to DVC, it is hashed (MD5 by default).
* DVC generates a `.dvc` file containing:

  * File path
  * Content hash
  * Remote storage reference
* Git tracks only this metadata file.

Deterministic mapping:

```
Git commit → .dvc file → MD5 hash → exact data version
```

**Reference**:
[https://dvc.org/doc/user-guide/project-structure/dvc-files](https://dvc.org/doc/user-guide/project-structure/dvc-files)

---

## Step-by-Step Project Workflow

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://editor.analyticsvidhya.com/uploads/86351git-dvc.png)

![Image](https://mlops-guide.github.io/assets/dvc/complete_pipeline.png)

---

### 1. Initial Setup

1. Create a repository on GitHub and clone it locally.
2. Create a Python script to generate a baseline dataset.

#### Baseline Python Code (`my_code.py`)

```python
import pandas as pd
import os

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["NY", "London", "Paris"]
}

df = pd.DataFrame(data)

os.makedirs("data", exist_ok=True)
df.to_csv("data/sample_data.csv", index=False)
```

3. Run the script:

```bash
python my_code.py
```

4. Perform an initial Git commit:

```bash
git add .
git commit -m "initial commit"
git push
```

---

### 2. DVC Initialization and Remote Storage

Initialize DVC and configure a local directory to simulate remote storage.

```bash
dvc init
mkdir s3
dvc remote add -d myremote s3
```

**References**:
[https://dvc.org/doc/command-reference/init](https://dvc.org/doc/command-reference/init)
[https://dvc.org/doc/command-reference/remote/add](https://dvc.org/doc/command-reference/remote/add)

---

### 3. Transitioning Tracking to DVC

Transfer responsibility for the dataset from Git to DVC.

```bash
dvc add data
git rm -r --cached data
git commit -m "Stop tracking data with Git"
```

This ensures Git tracks only metadata while DVC manages the actual data.

---

### 4. Versioning Cycle (Version 1)

Persist the first synchronized version of code and data.

```bash
dvc commit
dvc push
git add data.dvc .gitignore
git commit -m "First version of data"
git push
```

* DVC stores the dataset in remote storage.
* Git stores the `.dvc` file containing the MD5 hash.

---

### 5. Updating and Creating New Versions

When the dataset changes:

1. Modify `my_code.py` and regenerate the CSV.
2. Verify changes:

```bash
dvc status
```

3. Commit the new version:

```bash
dvc commit
dvc push
git add .
git commit -m "Second version of data"
git push
```

Each iteration produces a new data hash linked to a Git commit.

---

### 6. Rolling Back to a Specific Version

![Image](https://www.dasca.org/content/images/main/steps-for-branching-workflow-with-dvc.jpg)

![Image](https://miro.medium.com/1%2A0PSGD5wapOUOgdRkof1Ymw.png)

![Image](https://i0.wp.com/dvc.org/wp-content/uploads/2025/10/dependency-graph.png?quality=80\&ssl=1\&w=800)

To restore a previous experiment state:

```bash
git log --oneline
git checkout <commit_id>
dvc pull
```

* Git restores the corresponding `.dvc` file.
* DVC reads the MD5 hash and retrieves the exact dataset version from remote storage.

**Reference**:
[https://dvc.org/doc/command-reference/pull](https://dvc.org/doc/command-reference/pull)

---







---

# Data Versioning with DVC (Data Version Control)

This phase formalizes **data versioning** in an MLOps workflow using DVC, ensuring that every model experiment is reproducible, auditable, and reversible.

**Authoritative reference**:
Official documentation — [https://dvc.org/doc](https://dvc.org/doc)

---

## 1. Why Data Versioning Is Required

Machine learning pipelines consist of multiple stages—data ingestion, preprocessing, feature engineering, and model training. Each stage produces artifacts whose outputs depend directly on upstream inputs.

### 1.1 Pipeline Dynamics

* Source data may change (e.g., updates in an Amazon S3 bucket).
* Preprocessing logic may change (e.g., Z-score → IQR outlier handling).
* Feature definitions may evolve.

Any such change propagates downstream and alters model behavior.

### 1.2 Rollback Requirement

Without explicit data versioning:

* Exact reconstruction of past experiments is impossible.
* Performance regressions cannot be reliably diagnosed.
* Model comparisons are invalid.

### 1.3 Why Git Alone Is Insufficient

Git is optimized for:

* Small, line-oriented text files
* Code diffs and merges

Git performs poorly with:

* Large binary datasets
* Frequent rewrites of tabular data
* Multi-gigabyte artifacts

---

## 2. DVC Architecture: Git + DVC Separation of Concerns

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://doc.dvc.org/static/39d86590fa8ead1cd1247c883a8cf2c0/cb690/project-versions.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1200/1%2ATQ3_QnmqzqfA_EdoiI9QqQ.png)

DVC complements Git by separating **metadata tracking** from **artifact storage**.

### 2.1 Responsibility Split

* **Git**

  * Tracks source code
  * Tracks small `.dvc` metadata files
* **DVC**

  * Tracks large datasets and model artifacts
  * Stores them in external storage (S3, GCS, Azure Blob, NFS, local)

### 2.2 Metadata Linking Mechanism

* Data tracked by DVC is hashed (MD5 by default).
* DVC generates a `.dvc` file containing:

  * Path
  * Hash
  * Remote storage reference
* Git tracks only this metadata file.

Deterministic mapping:

```
Git commit → .dvc file → MD5 hash → exact data version
```

**Authoritative reference**:
[https://dvc.org/doc/user-guide/project-structure/dvc-files](https://dvc.org/doc/user-guide/project-structure/dvc-files)

---

## 3. End-to-End Implementation Workflow

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://editor.analyticsvidhya.com/uploads/86351git-dvc.png)

![Image](https://mlops-guide.github.io/assets/dvc/complete_pipeline.png)

### Phase 1: Project Initialization

1. Create a repository on GitHub.
2. Clone it locally.
3. Add a Python script to generate an initial dataset.
4. Commit the baseline structure to Git.

---

### Phase 2: DVC Setup

1. Initialize DVC:

   ```bash
   dvc init
   ```
2. Configure remote storage (local simulation):

   ```bash
   mkdir s3
   dvc remote add -d myremote s3
   ```

**Authoritative reference**:
[https://dvc.org/doc/command-reference/init](https://dvc.org/doc/command-reference/init)
[https://dvc.org/doc/command-reference/remote/add](https://dvc.org/doc/command-reference/remote/add)

---

### Phase 3: Data Tracking and Versioning

1. Track data with DVC:

   ```bash
   dvc add data/
   ```
2. Remove data from Git index if previously tracked:

   ```bash
   git rm -r --cached data/
   ```
3. Commit and push data:

   ```bash
   dvc commit
   dvc push
   ```
4. Commit metadata to Git:

   ```bash
   git add data.dvc .gitignore
   git commit -m "Track data with DVC"
   ```

**Authoritative reference**:
[https://dvc.org/doc/command-reference/add](https://dvc.org/doc/command-reference/add)
[https://dvc.org/doc/command-reference/push](https://dvc.org/doc/command-reference/push)

---

### Phase 4: Updating Data

For every dataset change:

1. Modify or regenerate data.
2. Run:

   ```bash
   dvc commit
   dvc push
   ```
3. Commit the updated `.dvc` file to Git.

Each Git commit corresponds to a unique data snapshot.

---

### Phase 5: Rollback and Reproducibility

![Image](https://www.dasca.org/content/images/main/steps-for-branching-workflow-with-dvc.jpg)

![Image](https://miro.medium.com/1%2A0PSGD5wapOUOgdRkof1Ymw.png)

![Image](https://i0.wp.com/dvc.org/wp-content/uploads/2025/10/dependency-graph.png?quality=80\&ssl=1\&w=800)

1. Identify the target commit:

   ```bash
   git log --oneline
   ```
2. Revert code and metadata:

   ```bash
   git checkout <commit_id>
   ```
3. Restore the matching dataset:

   ```bash
   dvc pull
   ```

DVC reads the hash stored in the checked-out `.dvc` file and retrieves the exact dataset version from remote storage.

**Authoritative reference**:
[https://dvc.org/doc/command-reference/pull](https://dvc.org/doc/command-reference/pull)

---

## 4. Example Data Generation Script

```python
import pandas as pd
import os

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "London", "Paris"]
}

df = pd.DataFrame(data)

os.makedirs("data", exist_ok=True)
df.to_csv("data/sample_data.csv", index=False)
```

---

## 5. Guarantees Provided by DVC

* Deterministic experiment reproduction
* Exact alignment between code, data, and models
* Safe rollback across project history
* Scalable handling of large datasets

---

