Skip to content

Latest commit

 

History

History
79 lines (46 loc) · 7.89 KB

bagit.md

File metadata and controls

79 lines (46 loc) · 7.89 KB

CWLProv BagIt profile

The CWLProv folder structure complies with BagIt so that its content and completeness can be verified with any BagIt tool or libraries.

Overview

A rough overview of the CWLProv folder structure (the bag), is here explained using the revsort-run-1 example:

It is out of scope of this document to cover the full details of the BagIt specification, but describe the CWLProv bag constraints; the CWLProv BagIt profile.

BagIt files

The base directory of a bag MUST contain the marker file bagit.txt which BagIt-Version SHOULD be 1.0 (corresponding to draft-kunze-bagit-16). Tag-File-Character-Encoding MUST be UTF-8 in CWLProv.

In CWLProv the metadata file bag-info.txt MUST be present and MUST contain an External-Identifier header. The headers Bagging-Date and Bag-Software-Agent SHOULD be present. The header BagIt-Profile-Identifier MUST be present and SHOULD have the value https://w3id.org/ro/bagit/profile to indicate conformance to the Research Object BagIt profile.

CWLProv file names MUST be lower case, except for snapshot files which filenames SHOULD be derived from their original file name.

Payload

In CWLProv the payload directory data/ SHOULD only contain data files or structured that have been used in the workflow execution (e.g. input and output files). Other files (such as provenance traces or workflow definitions0 SHOULD be stored as tag files in other directories.

All checksums of data/ files MUST be included in every manifest file, e.g. manifest-sha1.txt. CWLProv bags SHOULD include the manifest as sha1 and sha512.

In CWLProv the payload files SHOULD have file paths derived from their own hashcode (content-addressable), however CWLProv consumers MUST NOT assume this, as implementations MAY use other unique filenames like UUIDs. Implementations MAY use any hash algorithm, but it is RECOMMENDED that the algorithm corresponds to a manifest file.

Reasonable subdirectory structures SHOULD be used to avoid a single directory with a large amount of files. For instance data/97/ contains 97fe1b50b4582cebc7d853796ebd62e3e163aa3f which happens to have the SHA1 checksum 97fe1b50b4582cebc7d853796ebd62e3e163aa3f.

Tag files

All files outside data/ (except bagit.txt and manifest*txt) SHOULD be listed in a corresponding tag manifest file, e.g. tagmanifest-sha1.txt. CWLProv bags SHOULD include the tag manifest using algorithms sha1 and sha512.

CWLProv bags SHOULD include an system-independent runnable version of the executed workflow under workflow/.

CWLProv SHOULD contain a CWL workflow in workflow/packed.cwl - which SHOULD avoid any includes or references external to the CWLProv bag folder structure. This file can be created with cwltool --pack or through other means.

CWLProv bags MAY include direct copies of arbitrarily named workflow files used at execution times under snapshot/. Consumers SHOULD NOT assume these files are CWL workflows (unless so declared in the RO manifest, see below). Consumers SHOULD NOT assume that snapshot files have valid relative paths in their internal cross-referencing. Producers SHOULD use snapshot file names that reflect their original file names, taking reasonable care to avoid duplication or overlaps.

CWLPROV bags MUST include provenance traces of the workflow run under metadata/provenance, of which the file primary.cwlprov.provn MUST be present in PROV-N format, describing the top level workflow execution according to the CWLProv PROV profile. Other provenance files and formats MAY be present, in which case they SHOULD have conformsTo declared in the RO manifest.

External Identifier

CWLProv is reusing Linked Data standards like JSON-LD, W3C PROV and Research Object.

A challenge with Linked Data in distributed and desktop computing is how to make identifiers that are absolute URIs (and hence globally unique); e.g. for CWLProv a workflow may be executed by an engine that do not know where its workflow provenance will be stored, published or integrated in the end.

To this end CWLProv generators SHOULD use the proposed arcp URI scheme to map local file paths within the RO BagIt folder structure to absolute URIs for use within the RO manifest and PROV traces.

In this example the root URI for the bag is arcp://uuid,4cca2dd8-3bd5-45cb-8d34-e7f346be027e/, as declared in bag-info.txt with External-Identifier.

Consumers of CWLProv bags that do not contain an arcp-based External-Identifier SHOULD generate a temporary arcp base to safely resolve any relative URI references without climbing outside the CWLProv folder.

Implementations processing a CWLProv RO MAY convert arcp URIs to their local file:/// or http:// URIs depending on how and where the CWLProv bag was saved, for instance using the arcp.py library.