Skip to content
jmshapir edited this page Feb 14, 2023 · 17 revisions

The main principles we follow in storing data are summarized in chapter 5 of Code and Data for the Social Sciences. The key points are:

  • Store all data in tables with unique, non-missing keys
  • Keep data normalized as far into the code pipeline as possible
  • Eliminate redundancy -- a given set of data cleaning / building steps should only be executed once

Raw Data Directories

Raw data files must be stored in a raw data directory that follows specific rules. This is normally in the /raw/ directory of a Github repository or in an analogous directory on Dropbox.

Every raw directory must have a detailed readme.md file which includes the source of the data, when and how it was obtained, and other any information necessary to understand the provenance and meaning of the data.

Codebooks, data use agreements, and other documentation should be placed in a /docs/ subdirectory.

We aim to store enough documentation in readme.md and /docs/ that if we lost access to the original data source we would have everything we need to understand the data, reference it in a paper, and make sure we are adhering to the terms of our agreements.

Raw directories can contain code to perform preprocessing steps necessary to produce files ready to be used downstream (e.g., file conversions, appending files together, etc.). In this case the data in its original form should be stored in an /orig/ subdirectory and the preprocessed data should be stored in an /output/ or /data/ subdirectory.

Storage

We store small to medium sized data files that relate to a single project in our repositories either directly or using Git LFS.

  • Diffable files (.txt, .csv, .R, .do, etc.) that are under 5 MB can be stored directly in Git
  • Non-diffable files (binaries, such as: .pdf, .dta, .rds, etc.) and diffable files over 5 MB should be stored using Git LFS

The total size of the files within the regular GitHub storage should not exceed 1 GB per repository (including version history). The total size of a repository, including all files in LFS and their version history, should not exceed 5 GB. If a repository uses input or produces output that is larger than this, the large input or output files should be stored separately.

It is important to remember that because github stores every commit, once a file is committed, its impact on repository size is permanent. Think very carefully before committing large files, especially large binaries--this is one of the few mistakes that cannot easily be undone in github.

We store large data files and data that need to be shared across multiple projects on Dropbox when possible, or occasionally in other large-scale storage locations.

When tasks are completed, all related files should either completed (all files are clean, documented, and stored in shared locations ready for others to use them) or abandoned (all files deleted), but never left indefinitely in a half-finished state. At any given time, we should be able to wipe clean the storage on our local machines, scratch spaces, etc. with little or no substantive loss.

save_data

Our gslab tools libraries include a command called save_data that is designed to enforce best practices w/ data files. It (i) requires that a user explicitly specify a key for each data file at the time it's saved, (ii) checks that key is unique and non-missing, (iii) [optionally] sorts the data by the key, and (iv) outputs a data manifest log recording summary statistics of the variables. The data manifest makes it easy to diff data in Git and to quickly spot check features of the data on peer review without opening the data files themselves.

We have versions of the command for R (here), Stata (here), and Python (here). Detailed documentation is provided in the code and/or help files for these commands.

Any code that saves data files should use save_data rather than the built-in save commands in R / Stata / Python. Exceptions should be rare and the rationale for them should be documented in a comment in the code.

Checking data integrity

Any time data and/or data building code are merged back to main, the issue assignee and peer reviewer should confirm that all data files are accompanied by data file manifest(s) from save_data, and/or that any exceptions to this are intentional and documented. They should also take time to confirm that the structure and features of data reported in the manifests matches expectations, including:

  • Total number of observations in the table
  • Number of missing values for each variable
  • Mean, min, and max of each variable

For data that we expect to be used extensively downstream, the assignee or peer reviewer may suggest additional data integrity checks such as:

  • Relationships among table keys match expectations (e.g., the number of unique values of the variable county_id in an individual-level table is equal to or less than the number of unique values in the associated county-level table)

  • Relationships among variables match expectations (e.g., the variable total_spending in a county-level table matches the sum of spending by county in the individual-level table)

  • Values in downstream tables pass spot checks against the raw data (e.g., if we re-compute the county-level total_spending by hand for a half dozen cases in the raw data file this matches the downstream value)