The first tool that I want to introduce today is called DVC ([https://dvc.org](https://dvc.org)). Many people see DVC as synonymous for git for data but as I outlined in my last blog post, there are a couple of tools available.

In the first blog post, I outlined why empirical researchers may need data version control. In this post, I want to offer a hands-on approach for DVC and how it could be integrated into your own research.

Why did I choose DVC: DVC is open source and storage agnostic and allows you to work with multiple storage providers. Particularly, for research purposes this might be useful as you can store your data in your usual location.

I will also do a quick evaluation

This tutorial expects that you have basic familiarity with Git. I will use Google Drive to showcase the tutorial (as this is what most people during my survey were using.

First I am going to load the data that we will be versioning. I am using data from my current resarch project on online communities.

In [9]:
import pandas as pd
df = pd.read_csv("../../data/dvc_comments.csv")

In [10]:
df.head(2)

Unnamed: 0,comment_date_published,user_name,comment_text
0,2007-10-15,Tobi,"Hallo, guter Tip, von wann ist das Angebot ?? ..."
1,2007-10-15,Schnappi,steht links oben neben dem artikel : September...


In [11]:
len(df)

100000

In [32]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   dvc.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [33]:
!dvc list

[31mERROR[39m: the following arguments are required: url
usage: dvc list [-h] [-q | -v] [-R] [--dvc-only] [--show-json]
                [--rev [<commit>]]
                url [path]

List repository contents, including files and directories tracked by DVC and by Git.
Documentation: <[36mhttps://man.dvc.org/list[39m>

positional arguments:
  url               Location of DVC repository to list
  path              Path to directory within the repository to list outputs for

optional arguments:
  -h, --help        show this help message and exit
  -q, --quiet       Be quiet.
  -v, --verbose     Be verbose.
  -R, --recursive   Recursively list files.
  --dvc-only        Show only DVC outputs.
  --show-json       Show output in JSON format.
  --rev [<commit>]  Git revision (e.g. SHA, branch, tag)
[0m

In [29]:
!dvc status

Data and pipelines are up to date.                                              
[0m

Add new file and track with dvc

In [19]:
!dvc add data/dvc_comments.csv

Adding...                                                                       
[31mERROR[39m: stage working dir '/Users/florianpethig/Documents/datavc/tools/dvc/data' does not exist
[0m

In [58]:
!git add data/.gitignore data/dvc_comments.csv.dvc

In [59]:
!git commit -m "Add raw data"

[main bc573d3] Add raw data
 2 files changed, 4 insertions(+)
 create mode 100644 data/dvc_comments.csv.dvc


In [60]:
!git push

Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 446 bytes | 446.00 KiB/s, done.
Total 5 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/florianpethig/datavc.git
   0010af1..bc573d3  main -> main


In [18]:
!dvc remote list

storage	gdrive://18YTEnKHyl1LVgA7-Wlx7E_oo7Zw4hx5Q
[0m

In [None]:
!dvc push

In [63]:
!rm -f data/dvc_comments.csv
!rm -rf .dvc/cache

In [64]:
!dvc pull

[33m+-----------------------------------------+
[39m[33m|[39m                                         [33m|[39m
[33m|[39m     Update available [31m1.2.2[39m -> [32m1.9.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`     [33m|[39m
[33m|[39m                                         [33m|[39m
[33m+-----------------------------------------+
[39m
  0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0% data/dvc_comments.csv|                       |0.00/? [00:00<?,        ?B/s][A
100% data/dvc_comments.csv|██████████████|18.7M/18.7M [00:04<00:00,    4.18MB/s][A
[32mA[39m	data/dvc_comments.csv                                               [A
1 file added and 1 file fetched
[0m

In [65]:
df['comment_tokenized'] = df['comment_text']
df['comment_tokenized'] = df.comment_tokenized.str.replace(r'[^\w\s]+', ' ')
df['comment_tokenized'] = [str(token).lower() for token in df.comment_tokenized]
df['comment_tokenized'] = df.comment_tokenized.str.strip().str.split()
df['comment_length'] = df.comment_tokenized.str.len()

In [3]:
#df.head(10)

In [67]:
df.to_csv("data/dvc_comments.csv", index=False)

In [15]:
!dvc add ../../data/dvc_comments.csv

100% Add|██████████████████████████████████████████████|1/1 [00:02,  2.56s/file]

To track the changes with git, run:

	git add ../../data/dvc_comments.csv.dvc
[0m

In [21]:
!git add ../../data/dvc_comments.csv.dvc
!git commit -m "Dataset updates"

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
	[31mmodified:   dvc.ipynb[m

no changes added to commit


In [25]:
!dvc remote list

storage	gdrive://18YTEnKHyl1LVgA7-Wlx7E_oo7Zw4hx5Q
[0m

In [27]:
!dvc push

  0% Querying cache in gdrive://18YTEnKHyl1LVgA7-Wlx7E_oo7Zw4hx5Q| |0/1 [00:00<?Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Enter verification code: ^C
[31mERROR[39m: interrupted by the user                                        
[0m

In [31]:
!git push

Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 432 bytes | 432.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/florianpethig/datavc.git
   0418976..22864b2  main -> main


In [44]:
#!git log --oneline

In [71]:
!git checkout HEAD^1 data/dvc_comments.csv.dvc

Updated 1 path from 40571ee


In [72]:
!dvc checkout

[33m+-----------------------------------------+
[39m[33m|[39m                                         [33m|[39m
[33m|[39m     Update available [31m1.2.2[39m -> [32m1.9.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`     [33m|[39m
[33m|[39m                                         [33m|[39m
[33m+-----------------------------------------+
[39m
[33mM[39m	data/dvc_comments.csv                                               
[0m

In [2]:
#df = pd.read_csv("data/dvc_comments.csv")
#df.head(10)

In [35]:
!git log --oneline

[33m22864b2[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m, [m[1;31morigin/main[m[33m)[m Dataset updates
[33m0418976[m reorganized folder
[33m230f5ee[m updated blog post and readme file
[33m61b68d7[m updated blog post
[33md378bf3[m first blog post
[33m2ce466b[m updated readme
[33m20b5627[m removed ipynb checkpoints
[33m2dfa280[m updated gitignore
[33mee31e19[m added readme and survey
[33m4de9448[m Dataset updates
[33mbc573d3[m Add raw data
[33m0010af1[m Clean repo
[33m419d03a[m Clean up
[33m478d77f[m Dataset updates
[33m4c3642d[m Remove old dvc file
[33m2cc1a91[m Add raw data
[33m822f86f[m Revert dataset updates
[33m7f791db[m Dataset updates
[33m60dcae8[m Configure remote storage
[33ma8ec3aa[m Add raw data
[33mdb135e3[m Initialize DVC
