# Lecture 23: Data Version Control I

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1jwHTnoQx45v_19H6Vy6zD4IrMVKOheac)

In [1]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Last executed: 2024-01-10 00:48:29


This lecture is part of series on [Data Version Control (DVC)](https://dvc.org), a way of systematically keeping track of different versions of models and datasets.

This first lecture in the series will cover:
- Why using DVC is a good idea.
- How to track files and move between versions.

## What is version control?

<figure style="margin: auto; text-align: center;">
  <img src="http://phdcomics.com/comics/archive/phd101212s.gif" width="550" alt="Series of drawings of a graduate student making changes to a manuscript based on his supervisor's comments, with his frustration and file names progressively increasing." style="margin: auto; text-align: center;"/>
    <figcaption>From <a href="http://phdcomics.com/comics/archive.php?comicid=1531">PHD Comics.</a></figcaption>
</figure>

Instead of having multiple copies or working on a shared version:
- **track changes** in distinct stages (_commits_) as you work,
- move backwards and forwards in history,
- explore different alternatives (_branches_),
- share entire history with others.

Different systems: **Git**, Subversion, Mercurial, ...

We start our work with by committing the state of our code or data. Each commit we create is given a unique identifier:

<img alt="Diagram of a single commit, represented as a labelled circle" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-one.svg" width="12%" />

As we work, we make more commits:

<img alt="Diagram of two commits with a link from the first to the second" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-two.svg" width="22.5%" />

Sometimes we make mistakes:

<img alt="Diagram of three linked commits, where the third is highlighted as wrong" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-wrong.svg" width="33.75%" />

After realising the error, we can go back and fix it, replacing it with a new commit:

<img alt="Diagram of three commits, where the previous mistake has been replaced with a new, fixed commit" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-fixed.svg" width="33.75%" />

Often, we want to try out different approaches before we decide on what's best:

<img alt="Diagram of a commit history with two branches" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-branch.svg" width="52.5%" />

This results in a non-linear history. If we want, we can also merge the two branches:

<img alt="Diagram of a commit history where two branches split off and are then rejoined" src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture23_Images/git-merge.svg" width="75%" />

## Why data version control?

Similar principles apply to data workflows as to code:

- Mistakes happen!
- New data appearing.
- Try variants of model (e.g. algorithm or its parameters) or data pipeline (e.g. preprocessing).

Git is not only for source code files. However, a dedicated data-focused solution is more attractive:

- Git does not handle very large files efficiently.
- Thinking in terms of data workflows (models, parameters, inputs, ...) offers new useful functionality, e.g. reproducibility, metrics.
- Better integration with remote data providers e.g. Amazon Web Services S3.
- Can still use git under the hood, keeping code and data versioned simultaneously.

## Getting started with DVC

DVC is a command-line application that runs on any platform. Follow the [installation instructions](https://dvc.org/doc/install) to get it on your computer.

To follow along, first create a new directory and switch it to be the current working directory by running the following cell

In [2]:
import os
os.mkdir("dvc-get-started-example")
os.chdir("dvc-get-started-example")

This walkthrough is based on the [official tutorial](https://dvc.org/doc/start).

### Initializing project

First, initialize your directory as a DVC (and Git) repository, so they can start tracking changes:

In [3]:
%%sh
git init
dvc init

hint: Using 'master' as the name for the initial branch. This default branch name


hint: is subject to change. To configure the initial branch name to use in all




hint: 


hint: 	git config --global init.defaultBranch <name>


hint: 


hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and


hint: 'development'. The just-created branch can be renamed via this command:


hint: 


hint: 	git branch -m <name>


Initialized empty Git repository in /home/runner/work/course_mlbd/course_mlbd/Lectures/dvc-get-started-example/.git/


Initialized DVC repository.





You can now commit the changes to git.





+---------------------------------------------------------------------+


|                                                                     |


|        DVC has enabled anonymous aggregate usage analytics.         |


|     Read the analytics documentation (and how to opt-out) here:     |


|             <https://dvc.org/doc/user-guide/analytics>              |


|                                                                     |


+---------------------------------------------------------------------+





What's next?


------------


- Check out the documentation: <https://dvc.org/doc>


- Get help and share ideas: <https://dvc.org/chat>


- Star us on GitHub: <https://github.com/iterative/dvc>


After the above, DVC creates some new files and gives you a hint about what to run: _"You can now commit the changes to git."_

In [4]:
%%sh
git commit -m "Initial setup"

[master (root-commit) 26d89b5] Initial setup


 3 files changed, 6 insertions(+)


 create mode 100644 .dvc/.gitignore


 create mode 100644 .dvc/config


 create mode 100644 .dvcignore


### Downloading data

Download a sample data file by running

In [5]:
%%sh
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

This should create a directory called `data` in your new directory, with a file called `data.xml` inside it.

### Initializing tracking
We are not tracking any files yet. Let's tell DVC to track the dataset we downloaded:

In [6]:
%%sh
dvc add data/data.xml

[?25l⠋ Checking graph





To track the changes with git, run:





	git add data/.gitignore data/data.xml.dvc





To enable auto staging, run:





	dvc config core.autostage true


[?25h

As before, DVC creates some internal files and tells us what to commit with Git.

Run the command it suggests, and then commit:

In [7]:
%%sh
git add data/data.xml.dvc data/.gitignore
git commit -m "Add initial version of dataset"

[master 4a6bafd] Add initial version of dataset


 2 files changed, 6 insertions(+)


 create mode 100644 data/.gitignore


 create mode 100644 data/data.xml.dvc


Note that this is different from the usual git workflow. Normally, we would be `add`ing the data file itself (`data.xml`).

Instead, we are adding a smaller "proxy" file (`data.xml.dvc`). This file is much smaller, and DVC knows it represents the original dataset.

To verify the size difference, run

In [8]:
%%sh
ls -lh data

total 14M


-rw-r--r-- 1 runner docker 14M Jan 10 00:48 data.xml


-rw-r--r-- 1 runner docker  92 Jan 10 00:48 data.xml.dvc


The original data takes up 14MB, while the proxy file is only 80 bytes long.

### Making changes
During the course of our work, the dataset may change - intentionally or by accident. For simplicity, we will simulate a change by repeating the dataset twice:

In [9]:
%%sh
cp data/data.xml temp.xml  # create a temporary copy
cat temp.xml >> data/data.xml  # append the copy to the original
rm temp.xml  # remove the copy

We can check the size of the file with 

In [10]:
%%sh
ls -lh data

total 28M


-rw-r--r-- 1 runner docker 28M Jan 10 00:48 data.xml


-rw-r--r-- 1 runner docker  92 Jan 10 00:48 data.xml.dvc


to verify it has doubled.

To register the changes with Git and DVC, we run similar commands to before:

In [11]:
%%sh
dvc add data/data.xml
git add data/data.xml.dvc  # as suggested by dvc
git commit -m "Double size of dataset"

[?25l⠋ Checking graph





To track the changes with git, run:





	git add data/data.xml.dvc





To enable auto staging, run:





	dvc config core.autostage true


[master 26b5db8] Double size of dataset


 1 file changed, 2 insertions(+), 2 deletions(-)


[?25h

### Switching versions
Switching to another version happens in two stages.

First, we switch with Git:

In [12]:
%%sh
git checkout HEAD~

Note: switching to 'HEAD~'.





You are in 'detached HEAD' state. You can look around, make experimental


changes and commit them, and you can discard any commits you make in this


state without impacting any branches by switching back to a branch.





If you want to create a new branch to retain commits you create, you may


do so (now or later) by using -c with the switch command. Example:





  git switch -c <new-branch-name>





Or undo this operation with:





  git switch -





Turn off this advice by setting config variable advice.detachedHead to false





HEAD is now at 4a6bafd Add initial version of dataset


`HEAD~`refers to the previous commit, so in this case the original dataset.

Then we "synchronise" the files under DVC with

In [13]:
%%sh
dvc checkout

M       data/data.xml


This will find the version of the data when that commit was made, and check it out.

Verify that the version changed with 

In [14]:
%%sh
ls -lh data

total 14M


-rw-r--r-- 1 runner docker 14M Jan 10 00:48 data.xml


-rw-r--r-- 1 runner docker  92 Jan 10 00:48 data.xml.dvc


Notice `data.xml` is back to its original size.

Go back to the newest version (doubled data) with

In [15]:
%%sh
git checkout master
dvc checkout

Previous HEAD position was 4a6bafd Add initial version of dataset


Switched to branch 'master'


M       data/data.xml


## Summary

This has been the basic usage of DVC to track and revert changes to a file. Building on this, in the next lecture we will see how DVC can be used to track models and entire machine learning workflows.