Connect processes into powerful data pipelines with a simple git-like filesystem interface
OCaml Go Other
Latest commit 8e3efe7 Jan 20, 2017 @talex5 talex5 committed on GitHub Merge pull request #456 from talex5/replace-travis
ci: allow rewriting the Dockerfile FROM to a public image
Permalink
Failed to load latest commit information.
api go: Do not squash enoent in Snapshot.Read into "" Oct 21, 2016
bridge/github bridge: enable Prometheus monitoring Jan 19, 2017
ci ci: allow rewriting the Dockerfile FROM to a public image Jan 19, 2017
doc ci: rename DKCI to Datakit_ci Dec 11, 2016
examples/ocaml-client client: simplify path handling Nov 1, 2016
pkg bridge: enable Prometheus monitoring Jan 19, 2017
prometheus prometheus: move metrics report generation to Prometheus_app Jan 16, 2017
scripts Move all of the github code into bridge/github Aug 18, 2016
src prometheus: add convenience cohttp server Jan 16, 2017
tests github: use an unlimited number of fids and walk in parallel Dec 12, 2016
.dockerignore Dockerfile: workaround various issues to get the correct `--version` … Jul 26, 2016
.gitattributes Add a .gitattributes file May 19, 2016
.gitignore Add more files to .gitignore Aug 19, 2016
.merlin bridge: enable Prometheus monitoring Jan 19, 2017
.travis.yml prometheus: add convenience cohttp server Jan 16, 2017
CHANGES.md More CHANGES reformating Dec 4, 2016
CONTRIBUTING.md Add a CONTRIBUTING file May 17, 2016
Dockerfile cleanups the Dockerfiles Nov 10, 2016
Dockerfile.ci prometheus: move metrics report generation to Prometheus_app Jan 16, 2017
Dockerfile.client TMP: run `opam update` in Dockerfile to pick up latest releases Nov 10, 2016
Dockerfile.github github: tweak Docker.github to empty the Docker cache Dec 12, 2016
Dockerfile.prometheus Split prometheus into its own opam package Jan 12, 2017
Dockerfile.server prometheus: add convenience cohttp server Jan 16, 2017
LICENSE.md Switch to topkg Jun 28, 2016
MAINTAINERS convert maintainers file to toml May 18, 2016
Makefile build: stop using path pins Jan 17, 2017
README.md prometheus: add convenience cohttp server Jan 16, 2017
_tags bridge: enable Prometheus monitoring Jan 19, 2017
appveyor.yml prometheus: add convenience cohttp server Jan 16, 2017
check-libev.ml Updates for Lwt 3 API changes Jan 11, 2017
circle.yml prometheus: add convenience cohttp server Jan 16, 2017
datakit-ci.opam prometheus: move metrics report generation to Prometheus_app Jan 16, 2017
datakit-client.descr Update opam package descriptions Oct 3, 2016
datakit-client.opam Use latest released versions of named-pipe and protocol-9p Nov 10, 2016
datakit-github.descr Add missing descr files Oct 3, 2016
datakit-github.opam bridge: enable Prometheus monitoring Jan 19, 2017
datakit-server.descr Update opam package descriptions Oct 3, 2016
datakit-server.opam Use latest released versions of named-pipe and protocol-9p Nov 10, 2016
myocamlbuild.ml Add library for writing DataKit-based Continuous Integration systems Oct 27, 2016
opam prometheus: add convenience cohttp server Jan 16, 2017
prometheus-app.opam prometheus: add convenience cohttp server Jan 16, 2017
prometheus.opam Split prometheus into its own opam package Jan 12, 2017

README.md

DataKit -- Orchestrate applications using a 9P dataflow

DataKit is a tool to orchestrate applications using a 9P dataflow. It revisits the UNIX pipeline concept, with a modern twist: streams of tree-structured data instead of raw text. DataKit allows you to define complex build pipelines over version-controlled data, using shell scripts interacting with the filesystem.

DataKit is currently used as the coordination layer for HyperKit, the hypervisor component of Docker for Mac and Windows.


Build Status (OSX, Linux) Build status (Windows) docs

Quick Start

The easiest way to use DataKit is to start both the server and the client in a container.

To expose a Git repository as a 9p endpoint on port 5640 on a private network, just run:

$ docker network create datakit-net # create a private network
$ docker run -it --net datakit-net --name datakit -v <path/to/git/repo>:/data docker/datakit

Note: The --name datakit option is mandatory. It will allow the client to connect to a known name on the private network.

You can then start a DataKit client, which will mount the 9p endpoint and expose the database as a filesystem API:

# In an other terminal
$ docker run -it --privileged --net datakit-net docker/datakit:client
$ ls /db
branch     remotes    snapshots  trees

Note: the --privileged option is needed because the container will have to mount the 9p endpoint into its local filesystem.

Now you can explore, edit and script /db. See the Filesystem API for more details.

Experimental GitHub API bindings

To start DataKit with the experimental GitHub bindings:

$ docker run -it --net datakit-net --name datakit -v <path/to/git/repo>:/data docker/datakit:github
$ docker run -it --privileged --net datakit-net docker/datakit:client
$ ls /db
branch      github.com  remotes     snapshots   trees

Building

The easiest way to build the DataKit project is to use docker, (which is what the start-datakit.sh script does under the hood):

$ docker build -t datakit .
$ docker run datakit

These commands will expose the database's 9p endpoint on port 5640.

If you really want to build the project from source, you will need to install ocaml and opam. Then write:

$ opam pin add datakit . -n -y
$ opam depext datakit -y
$ opam install alcotest datakit --deps-only -y
$ make && make test

Usage

$ datakit --help

Filesystem API

The /branch directory contains one subdirectory for each branch. Use mkdir to create a new branch and rm to delete one.

Each branch directory contains:

  • fast-forward will do a fast-forward merge to any commit ID written to this file.

  • head gives the commit ID of the head of the branch when read (or the empty string if the branch is empty).

  • head.live is a stream which produces a list of commit IDs, one per line, starting with the current commit and returning new commits as the branch is updated. A branch with no commits is represented by a blank line.

  • reflog is a stream which outputs a new line each time the current HEAD is updated. The line gives the commit hash (or is blank if the branch has been deleted). Unlike head.live, reflog does not start by outputting the current commit and it does not skip commits.

  • ro is a live read-only view of the current contents of the head of the branch.

  • transactions is used to update the branch.

  • watch can be used to watch specific files or directories for changes.

Note that reading from head.live will skip directly to the latest commit: even if you read continuously from it, you will not necessarily see all intermediate commits.

The root also contains /snapshots, which can be used to explore any commit in the repository, if you know its ID. The directory will always appear empty, but attempting to access a subdirectory named by a commit ID will work.

The /trees directory works in a similar way to /snapshots, but is indexed by directory tree or file hashes (as read from tree.live) rather than by commit hashes.

Transactions

Read/write transactions can be created by making a new directory for the transaction in transactions. The newly created directory will contain:

  • rw, a directory with the current contents of the transaction. Initially, this is a copy of the branch's ro directory. Modify this as desired.

  • msg, the commit message to use.

  • parents, the list of commit hashes of the parents, one per line. Initially, this is the single head commit at the time the transaction was created, but it can be modified to produce other effects. Simply appending another branch's 'head' here is equivalent to doing a Git merge with strategy 'ours' (which is not the same as "recursive/ours").

  • ctl, which can be used to commit the transaction (by writing commit to it) or to cancel it (by writing close).

  • merge, which can be used to start a merge (see below).

  • diff/ is a directory containing hidden files. diff/<commit-id> contains the diff between the given commit-id and the current state of the transaction.

For example, to create a file somefile:

~/db $ mkdir branch/master/transactions/foo
~/db $ echo somedata > branch/master/transactions/foo/rw/somefile
~/db $ echo commit > branch/master/transactions/foo/ctl

If the branch has been updated since the transaction was created then, when you try to commit, Irmin will try to merge the changes.

If there is a conflict (two edits to the same file) then the commit will fail. Merge errors are reported as 9p error strings. When a commit succeeds the transaction directory is automatically removed.

Each 9p connection has its own set of transactions, and the changes in a transaction cannot be seen by other clients until the transaction is committed.

Merging

Within a transaction, write a commit ID to the merge file to begin a merge. The transaction directory will change slightly:

  • ours is a read-only directory, containing whatever was previously in rw
  • theirs is the commit being merged
  • base is a common ancestor (or empty, if the commits share no history)
  • rw contains irmin9p's initial attempt at a merge
  • conflicts is a list of files in rw that need to be resolved manually
  • parents has the new commit appended to it

Note that, unlike Git, irmin9p does not attempt to merge within files. It simply replaces files with conflicting changes with a message noting the conflict.

For each file in conflicts you should resolve the problem by either deleting the file or doing your own three-way merge using ours, theirs and base. When a file has been edited, it is removed from conflicts. You cannot commit the transaction while conflicts is non-empty.

You may merge several commits in a single transaction, if desired. However, doing multiple non-trivial merges at once will make viewing the resulting merge commit difficult with most tools.

Snapshots

A snapshot for a given commit can be opened by accessing the directory /snapshots/COMMIT_ID, which is created on demand.

~/db $ cd snapshots/4b6557542ec9cc578d5fe09b664110ba3b68e2c2
~/d/s/4b6557542ec9cc578d5fe09b664110ba3b68e2c2 $ ls
hash  ro/
~/d/s/4b6557542ec9cc578d5fe09b664110ba3b68e2c2 $ cat hash
4b6557542ec9cc578d5fe09b664110ba3b68e2c2
~/d/s/4b6557542ec9cc578d5fe09b664110ba3b68e2c2 $ ls ro
somefile
~/d/s/4b6557542ec9cc578d5fe09b664110ba3b68e2c2 $

The contents of a snapshot directory are:

  • ro is the read-only snapshot, which will never change.

  • hash contains the commit hash.

  • msg contains the commit message.

  • parents contains the hashes of the parent commits, one per line.

  • diff/ is a directory containing hidden files. diff/<commit-id> contains the diff between the given commit-id and this snapshot.

Watches

To watch for changes affecting a specific file or subdirectory in a branch, use the branch's watch directory.

Each directory under watch contains a tree.live file that outputs the current hash of the object that directory watches. The top watch/tree.live file tracks changes to all files and directories. To watch for changes under src/ui, read the file watch/src.node/ui.node/tree.live. That is, add .node to each path component to get a directory for that node.

Reading from a tree.live file outputs first one line for the current state of the path. This can be:

  • A blank line, if the path does not currently exist.
  • D-HASH if the path is a directory (the hash is the tree hash).
  • F-HASH if the path is a file (the hash is the hash of the blob).
  • X-HASH if the path is an executable file (the hash is the hash of the blob).
  • L-HASH if the path is a symlink (the hash is the hash of the blob containing the target string).

When the branch head changes so that the path has a different output, a new line will be produced, in the same format. As with head.live, watching for changes is triggered by reading on the open file, so if several changes occur between reads then you will only see the latest one.

Note: Listing a watch directory shows .node subdirectories for paths that currently exist. However, these are just suggestions; you can watch any path, whether it currently exists or not.

Diff

To see the difference between a given commit ID and the head of a branch, use the branch's diff directory.

Each file under the diff directory contains a line per change of the form:

  • + <path> means that the file path has been added between commit-id and HEAD;
  • - <path> means that the file path has been removed between commid-id and HEAD;
  • * <path> means that the file path has been modified betweeen commit-id and HEAD.

For instance:

~/db $ cat branches/master/diff/6b2e00a0be59c0335568dd9415a7d93640e7099c
+ foo
* bar

Means that foo have been added and bar modified in the master branch since the commit 6b2e00a0be59c0335568dd9415a7d93640e7099c took place.

Note: this also works when you are inside a transaction.

Fetch

To fetch from a remote repository, use the /remotes root directory. This directory is not persisted so will disappear across reboots.

Each directory under /remotes/<name> corresponds to the configuration of a remote server called <name>. Create a new directory (with mkdir) to add a new configuration. Every configuration folder contains:

  • A writable file: url, which contains the remote url.
  • A control file: fetch, which is used to fetch branches from the remote server.
  • A read-only stream file: head which contains the last known commit ID of the remote. On every fetch, a new line is added with the commit ID of the remote branch.

To fetch https://github.com/docker/datakit's master branch using the git protocol:

~/db $ cd remotes
~/db/remotes $ mkdir origin
~/db/remotes $ echo git://github.com/docker/datakit > origin/url
~/db/remotes $ echo master > origin/fetch
~/db/remotes $ cat origin/head
4b6557542ec9cc578d5fe09b664110ba3b68e2c2

GitHub PRs

There is basic support for interacting with GitHub PRs.

~/db $ ls github.com/docker/datakit
41  42
~/db $ cat github.com/docker/datakit/pr/41/status/default/state
pending
~/db $ echo success > github.com/docker/datakit/pr/41/status/default/state

This first queries the status of the pull request on the GitHub interface, then updates the default status to success.

To create a new status and set its description, url and status:

~/db $ PR=github.com/docker/datakit/pr/41
~/db $ mkdir $PR/status/test
~/db $ echo "My status" > $PR/status/test/descr
~/db $ echo "http://example.com" > $PR/status/test/url
~/db $ echo success > $PR/status/test/state

To read the last GitHub events related to a repository:

~/db $ cat github.com/docker/datakit/events

This is a non-blocking read, and will produce a file where every line is a new event.

Prometheus metric reporting

Run with --listen-prometheus 9090 to expose metrics at http://*:9090/metrics.

Note: there is no encryption and no access control. You are expected to run the database in a container and to not export this port to the outside world. You can either collect the metrics by running a Prometheus service in a container on the same Docker network, or front the service with nginx or similar if you want to collect metrics remotely.

How do I...

Create a new branch

mkdir branch/foo

Fork an existing branch

cd branch
mkdir new-branch
cp old-branch/head new-branch/fast-forward

Rename a branch

mv branch/old-name branch/new-name

Delete a branch

rmdir branch/foo

Merge a branch

cd branch/master/transactions
mkdir my-merge
cd my-merge
cat ../../../feature/head > merge
cat conflicts
meld --auto-merge ours base theirs --output rw
echo commit > ctl

Language bindings

  • Go bindings are in the api/go directory.
  • OCaml bindings are in the api/ocaml directory. See examples/ocaml-client for an example.

Licensing

DataKit is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.