DAta Nudged into GIT - File-based datasets that use git for version control of individual records
DANGIT is an experimental way to do version control on datasets. By storing each row/record of a dataset as a file in a github repository, it is possible to easily track changes and allow for anyone (yes, anyone... the feedback loop is open!) to submit changes using the same workflows as open source software development. The files for the individual rows/records are then built into a single dataset file. Here's the gist with a braindump for the idea.
In the future, a simple UI with Github Single Sign-on would allow non-technical users to perform the entire fork/edit/build/pull request workflow without using the command line or editing text files.
- Clone this repo
- Install Dependencies
npm install
- Clone the sample dataset nyc-pizzashops
- Edit or add data to the sample dataset by editing files in
/rows
- Use DANGIT to build the dataset with your new changes
node dangit build ../nyc-pizzashops
- Create a Pull Request to submit your changes to the source repo
A dataset is maintained in its own github repository with file structure like this:
/build - the build directory, where dangit writes the built dataset file (a geojson FeatureCollection or a CSV or a JSON array of objects) the build filename should be the same as the dataset's name, with the appropriate file extension
/rows - the rows directory, where individual rows are stored as geojson features or 2D json objects
dangit.json - the dangit configuration file, which includes name, type, uid field, etc.
Edits are made on the files in /rows
, new data are added by creating new files (for now, increment uid manually. Someday the build process should validate unique ids, data types, etc)
Run DANGIT build using node, passing in the path of the dataset you would like to build:
node dangit build ../nyc-pizzashops
DANGIT looks for a dangit.json
file in the root of the directory you pass in, and starts the build based on type
. For type geojson
, it will expect each file in /rows
to be a valid geojson feature, and will write a geojson FeatureCollection into /build
.
You can participate in our early experimentation by adding or editing (or deleting) rows to the dataset nyc-pizzashops. Fork the repo, make your changes to the rows, build the distribution file, and do a pull request back to the source repo.
Commit messages should include as much info as possible about the rows that were edited/added/removed.
Pull requests on dataset repos should include a successful build of the data. (how should we validate this)