## Setting up DVC

Data Version Control (DVC) offers a structured method for managing data versioning, a crucial aspect that is sometimes neglected. By using DVC, you can accurately monitor modifications in your datasets, ensuring reproducibility, collaboration, and simplified troubleshooting. It acts as a protective shield against data-related hurdles, promoting confidence and efficiency in your data-centric endeavors

In this exercise, you will practice initializing a DVC project and checking how DVC is installed. Git has already been initialized for this project.

### Ide Exercise Instruction
    - Initialize DVC in the workspace.
    - Verify that .dvcignore file and .dvc folder are present.
    - Check DVC version and learn how it was installed.
    - Commit the changes using Git with "initial commit" commit message.

In [None]:
#$ dvc init
#$ dvc version
#$ git add .
#$ git commit -m "initial commit"

## .dvcignore Patterns

The .dvcignore file plays a crucial role in DVC (Data Version Control) projects by marking which files and/or directories should be excluded when traversing a DVC project. It allows you to specify patterns or paths that DVC should ignore during operations.

In this exercise, you will modify the contents of a .dvcignore file to set file patterns that DVC should ignore during operation. You will also use the dvc check-ignore command to verify whether specific targets are ignored by DVC according to the .dvcignore file.

### Ide Exercise Instruction
    - Ignore all files in the dataset directory
    - Make an exception for dataset/myData.csv to be tracked by DVC.
    - Ignore all JSON files in the current workspace.
    - Using dvc check-ignore -d <file_name or file_pattern> command, check if JSON files are actually ignored.

In [1]:
#### ========> .dvcignore
# # Add patterns of files dvc should ignore, which could improve
# # the performance. Learn more at
# # https://dvc.org/doc/user-guide/dvcignore
# 
# # Ignore all files in the 'dataset' directory
# dataset/*
# 
# # But don't ignore 'dataset/myData.csv'
# !dataset/myData.csv
# 
# # Ignore all .json files
# *.json

#### ========> Command-line
#$ dvc check-ignore -d dataset/myData.csv

## Working with DVC Cache

In this exercise, you'll explore how to add and remove data to the DVC cache.

You are working on a machine learning project that involves a weather dataset, predicting if it would rain given the atmospheric conditions. As you make updates to the dataset, you want to ensure that changes are tracked and that you can easily revert to previous versions if needed.

Git and DVC are already initialized in the workspace.

### Ide Exercise Instruction
    - Add dataset.csv to DVC and examine the contents of dataset.csv.dvc by opening it in the file editor.
    - Verify that DVC cache is populated by running find .dvc/cache -type f command in terminal. Open the dataset.csv.dvc file and compare the output of this command to md5 field.
    - Unstage the DVC metadata file and clear the workspace cache by running appropriate commands in terminal.
    - Verify that DVC cache is now empty by running find .dvc/cache -type f command in terminal.

In [2]:
#$ dvc add dataset.csv
#$ find .dvc/cache -type f
#$ dvc remove dataset.csv.dvc
#$ find .dvc/cache -type f

## Setup a DVC Remote

In this editor exercise, you'll practice the process of configuring and modifying DVC remotes for securely storing and distributing your datasets. DVC remotes allow you to store and share data with appropriate versions. We've already initialized DVC for this exercise. Your focus will be solely on setting up DVC remotes within a local filesystem environment.

The syntax for adding a default DVC remote is

dvc remote add --local <remote_name> </path/to/folder>

where --local indicates that the DVC remote is pointed locally.

While modifying a DVC remote's location, we can use the following command

dvc remote modify --local <remote_name> url </path/to/new-location>

notice how we modify the url key in the config that stores the location.

### Ide Exercise Instruction
    - Set up a local DVC remote named localremote pointed at /tmp/dvc. Check the context for command reference.
    - List the local DVC remotes using the appropriate DVC subcommand. Is the location and name correctly printed?
    - Modify the existing remote to a new location /tmp/dvc-new. Check the context for command reference.
    - List the local DVC remotes using the appropriate DVC subcommand. Are changes in location accurately reflected?

In [None]:
#$ dvc remote add --local localremote /tmp/dvc
#$ dvc remote list
#$ dvc remote modify --local localremote url /tmp/dvc-new
#$ dvc remote list

## Versioning Data using DVC Remote

In this editor exercise, you'll practice how to version your datasets and push them into DVC remote. Data versioning and storage is the fundamental value proposition of DVC, and you'll learn the mechanics of the interplay between Git and DVC to achieve this. The dataset you'll be working with is a weather dataset that is used for rainfall prediction, given the atmospheric conditions.

We've already initialized DVC, configured a local remote at /tmp/dvc, and added a setup commit.

### Ide Exercise Instruction
    - Add the dataset.csv to DVC cache.
    - Commit the corresponding .dvc file to Git, with the commit message "tracking dataset.csv".
    - Push the dataset to the DVC remote.
    - Though you are the only one working with this DVC setup, run the dvc pull command to ensure everything is up to date.

In [3]:
#$ dvc add dataset.csv
#$ git commit -m "tracking dataset.csv"
#$ dvc push
#$ dvc pull

## Checking out Versioned Data

In this editor exercise, you'll practice moving between versions of your datasets by checking out corresponding metadata versions from the Git repository. This exercise builds on the previous one by tracking the initial state of the weather dataset, followed by removing 1000 lines from it and committing it to DVC remote. Your task will be to roll back the Git commit to a previous state, check out the DVC dataset at that corresponding state, and observe the changes.

We've already initialized DVC, configured a local remote at /tmp/dvc, and added a setup commit. Then, we added two more commits marking the dataset tracking and changes.

NOTE: To rollback changes we have committed to git repository by N commits, you can use

<code>git reset --hard HEAD~N</code>

### Ide Exercise Instruction
    - Inspect the Git commit history using git log command. Notice the top two commit messages reflecting the updates to the dataset. Press q to get out of interactive mode.
    - Inspect the md5 value in the dataset.csv.dvc file and compare it to the file by running md5sum dataset.csv.
    - Roll back the changes made to the dataset metadata file by one commit. The md5 value would have changed, but will be inconsistent with the md5sum dataset.csv.
    - Update the dataset by checking out the version consistent with the metadata file. The md5 value in the metadata should now be consistent with md5sum dataset.csv.

In [None]:
#$ git log
#$ md5sum dataset.csv
#$ git reset --hard HEAD~1
#$ dvc checkout