## ❤️‍🩹 Data Changing and Recovering with DVC

In machine learning and data science projects, data is not static, it changes as new insights emerge or corrections are made. These updates, while necessary, can introduce challenges: How do you ensure previous versions are preserved? What if you need to rollback to a previous version to troubleshoot an issue or compare results?

This is where **DVC (Data Version Control)** steps in. Imagine having a safety way that allows you to track every change made to your data, document those changes, and recover any previous version easily. Whether you're modifying a file stored in your storage or experimenting with different data preprocessing steps, DVC helps you maintain control and traceability.

In this notebook, we will focus on:

1. Making changes to a Parquet file stored in S3.
2. Using DVC to track and document those changes.
3. Recovering previous versions of the data when needed.

By the end, you'll see how DVC simplifies the process of managing data changes and ensures that recovering an earlier version is never more than a few commands away.

## 🐠 Import Dependencies

First of all, we will need to import some dependencies to be able to run our notebook.

In [1]:
# Import dependencies
import boto3
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import os

## ✏️ Modifying the Parquet File in S3

Let’s say you have a Parquet file stored in an S3 bucket. Now, you need to update it with the latest data. How can you do this while keeping track of the changes and ensuring you can restore the original if needed?

Here’s the plan:
1. Download and open the Parquet file from the S3 bucket.
2. Make the changes (like appending new data).
3. Save the updated file back to S3, ready for use and properly versioned.

These steps will help you update your data while keeping everything organized and easy to track.


In [None]:
# Download the file
fs = pyarrow.fs.S3FileSystem(
        endpoint_override=os.environ.get('AWS_S3_ENDPOINT'),
        access_key=os.environ.get('AWS_ACCESS_KEY_ID'),
        secret_key=os.environ.get('AWS_SECRET_ACCESS_KEY')
    )

with fs.open_input_file('data/song_properties.parquet') as file:
    df = pd.read_parquet(file)

# Make some change
df = pd.concat([df, df], ignore_index=True)

# Upload the file
pq.write_table(pyarrow.table(df), 'data/song_properties.parquet', filesystem=fs)

## 📦 Creating a New Data Version with DVC

After modifying your data, it's essential to track the changes using **DVC (Data Version Control)**. For files or directories imported via `dvc import`, `dvc import-url`, or `dvc import-db`, use `dvc update` to bring them in sync with the latest state of the data source.

In [None]:
# Update our version with the new change
!dvc update song_properties.parquet.dvc --to-remote

## 🛠️ Tracking Changes in Git

To link the data version to your code, commit the changes in Git.

In [None]:
!git diff ../.dvc/config

In [None]:
# Track the change in git
!git add song_properties.parquet.dvc
!git commit -m "updated data"

## 🔄 Reverting to a Previous Data Version



So now when we want to, we can just check out an old version from git to know what data version was used with that git

In [None]:
# Revert to our old dvc file
!git checkout HEAD~1 song_properties.parquet.dvc

Pull down the original file and push it to the data storage (we don't have a way to push it directly through DVC)

In [None]:
!dvc pull
df = pd.read_parquet('song_properties.parquet', engine='pyarrow')
pq.write_table(pyarrow.table(df), 'data/song_properties.parquet', filesystem=fs)

## ✅ Restoring and Tracking Reverted Data

And we are now back at the original data and able to track the revert!

In [None]:
# Update and version dvc again with the reverted data
!dvc update song_properties.parquet.dvc --to-remote
!git add song_properties.parquet.dvc
!git commit -m "reverted data"

NOTE: After reverting to the original version, you can optionally create a new version to track the revert.

In [None]:
# Let's take back all last commits
# spoiler alert; because we'll automate these steps!
!git reset --hard HEAD~3

## 🎯 Summary

This workflow demonstrates how DVC helps manage data versioning:

* Modify and track changes in datasets.
* Use Git to link data and code versions.
* Revert to previous dataset versions when needed.