Permalink
Browse files

update documentation

  • Loading branch information...
jhpoelen committed Feb 1, 2019
1 parent 0483d00 commit 20cb27010f57ed466612df9a600150c370aa62af
@@ -318,10 +318,9 @@ $ preston ls -l tsv | grep "/.well-known/genid/" | grep "Version" | cut -f1,3 |
Keeping track of changes across a diverse consortium of data publishers is necessary for reproducible workflows and reliable results. As datasets change, Preston can help you give insights into what changed *exactly*. For instance, the GBIF dataset registry changes as datasets are added, updated or deprecated. Below is an example of two version of the https://api.gbif.org/v1/dataset endpoint, one from 2018-09-03 and the other from 2018-09-04. Using ```jq``` and ```diff``` in combination with ```preston get``` and ```preston history``` gives us a way to check and see what changed.

```console
$ preston history https://api.gbif.org/v1/dataset
$ preston ls | grep https://api.gbif.org/v1/dataset
<https://api.gbif.org/v1/dataset> <http://purl.org/pav/hasVersion> <hash://sha256/184886cc6ae4490a49a70b6fd9a3e1dfafce433fc8e3d022c89e0b75ea3cda0b> .
<hash://sha256/1846abf2b9623697cf9b2212e019bc1f6dc4a20da51b3b5629bfb964dc808c02> <http://www.w3.org/ns/prov#generatedAtTime> "2018-09-03T02:19:14.636Z" .
<hash://sha256/1846abf2b9623697cf9b2212e019bc1f6dc4a20da51b3b5629bfb964dc808c02> <http://purl.org/pav/previousVersion> <hash://sha256/184886cc6ae4490a49a70b6fd9a3e1dfafce433fc8e3d022c89e0b75ea3cda0b> .
<https://api.gbif.org/v1/dataset> <http://purl.org/pav/hasVersion> <hash://sha256/1846abf2b9623697cf9b2212e019bc1f6dc4a20da51b3b5629bfb964dc808c02> .
$ preston get hash://sha256/184886cc6ae4490a49a70b6fd9a3e1dfafce433fc8e3d022c89e0b75ea3cda0b | jq . > one.json
$ preston get hash://sha256/1846abf2b9623697cf9b2212e019bc1f6dc4a20da51b3b5629bfb964dc808c02 | jq . > two.json
$ diff one.json two.json
@@ -491,10 +490,6 @@ Please use [maven](https://maven.apache.org) version 3.3+.

## Examples

### preston update



### Maven, Gradle, SBT
Preston is made available through a [maven](https://maven.apache.org) repository.

@@ -548,7 +543,7 @@ Usage: <main class> [command] [command options]
urls are provided.
Default: [https://idigbio.org, https://gbif.org, http://biocase.org]
history show history of biodiversity resource
history show history of biodiversity dataset graph
Usage: history [options] biodiversity resource locator
Options:
-l, --log
@@ -57,14 +57,12 @@ An archiver listens to statements containing a _blank_ . On receiving such a st

## `blob store`

On succesfully saving the content into the blob store, a unique identifier is returned in the form of a SHA256 hash. The unique content identifier is now used to store a relation between the resource and it's unique content identifier. This identifier is now used to point to the content. Also, the content is saved in an hierarchical file structure derived from the content hash. For example, if the url https://search.idigbio.org/v2/search/publishers resolved to content with a hash of hash://sha256/3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362 (see [the "official" spec of hash uri notation](https://github.com/hash-uri/hash-uri/blob/master/README.md) with examples at [hash-archive.org](https://hash-archive.org)), then a file called "data" is stored in the following structure:
On succesfully saving the content into the blob store, a unique identifier is returned in the form of a SHA256 hash. The unique content identifier is now used to store a relation between the resource and it's unique content identifier. This identifier is now used to point to the content. Also, the content is saved in an hierarchical file structure derived from the content hash. For example, if the url https://search.idigbio.org/v2/search/publishers resolved to content with a hash of hash://sha256/3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362 (see [the "official" spec of hash uri notation](https://github.com/hash-uri/hash-uri/blob/master/README.md) with examples at [hash-archive.org](https://hash-archive.org)), then a file is stored in the following structure:

```
3e/
ff/
98/
3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362/
data
3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362
```

With the file path being derived from the hash of the data itself, you can now easily locate the content by its hash. For instance, on the server at https://deeplinker.bio , the nginx webserver is configured such that you can retrieve the said datafile by requesting https://deeplinker.bio/3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362 . Note that this content hash is "real" and you can download the copy (or version) of the content that was served by https://search.idigbio.org/v2/search/publishers at some point in the past. So, using the blob store, we know have a way to easily access content as long as we know the content hash.
@@ -100,9 +98,7 @@ The simplified hexastore itself uses the same folder structure as the blob store
```
a2/
1d/
81/
a21d81acb039ca8daa013b4eebe52d5eda4f23d29c95d0f04888583ca5c8af4e/
data
a21d81acb039ca8daa013b4eebe52d5eda4f23d29c95d0f04888583ca5c8af4e
```
As you might have seen, deeplinker.bio, resolves https://deeplinker.bio/a21d81acb039ca8daa013b4eebe52d5eda4f23d29c95d0f04888583ca5c8af4e to hash://sha256/3eff98d4b66368fd8d1f8fa1af6a057774d8a407a4771490beeb9e7add76f362 . The latter hash can now be used to resolves to the specific version of https://search.idigbio.org/v2/search/publishers . Using curl, jq, and head, the first the lines of the json content can be shown:
@@ -72,7 +72,7 @@
//add(Seeds.DATA_ONE.getIRIString());
}};

@Parameter(description = "content URLs to update. If specified, the seeds will not be used.",
@Parameter(description = "[url1] [url2] ...",
validateWith = IRIValidator.class)
private List<String> IRIs = new ArrayList<>();

@@ -15,7 +15,7 @@
import static bio.guoda.preston.RefNodeConstants.ARCHIVE;
import static bio.guoda.preston.model.RefNodeFactory.toBlank;

@Parameters(separators = "= ", commandDescription = "show history of biodiversity resource")
@Parameters(separators = "= ", commandDescription = "show history of biodiversity dataset graph")
public class CmdHistory extends LoggingPersisting implements Runnable {

private static final Log LOG = LogFactory.getLog(CmdHistory.class);

0 comments on commit 20cb270

Please sign in to comment.