CWLProv BagIt profile
A rough overview of the CWLProv folder structure (the bag), is here explained using the revsort-run-1 example:
- bagit.txt - bag marker for BagIt
- bag-info.txt - minimal bag metadata (notably the
- manifest-*.txt - checksums of files under data/ (algorithms subject to change)
- tagmanifest-*.txt - checksums of the remaining files (algorithms subject to change)
- metadata/manifest.json - Research Object manifest as JSON-LD. Types and relates files within bag.
- metadata/logs/* - raw output logs from workflow engine
- metadata/provenance/primary.cwlprov* - provenance traces of workflow execution
- data/ - bag payload: workflow/step input/output data files (content-addressable)
- data/32/327fc7aedf4f6b69a42a7c8b808dc5a7aff61376 - a data item with checksum
327fc7aedf4f6b69a42a7c8b808dc5a7aff61376(checksum algorithm is subject to change)
- workflow/packed.cwl - The
cwltool --packstandalone version of the executed workflow
- workflow/primary-job.json - Job input JSON document for use with packed.cwl (references
- workflow/primary-output.json - Job output JSON document (references
- snapshot/ - Direct copies of original
*.cwlfiles used for execution. Note: may have broken relative/absolute paths
It is out of scope of this document to cover the full details of the BagIt specification, but describe the CWLProv bag constraints; the CWLProv BagIt profile.
In CWLProv the metadata file bag-info.txt MUST be present and MUST contain an
External-Identifier header. The headers
Bag-Software-Agent SHOULD be present. The header
BagIt-Profile-Identifier MUST be present and SHOULD have the value
https://w3id.org/ro/bagit/profile to indicate conformance to the Research Object BagIt profile.
CWLProv file names MUST be lower case, except for snapshot files which filenames SHOULD be derived from their original file name.
In CWLProv the payload directory
data/ SHOULD only contain data files or structured that have been used in the workflow execution (e.g. input and output files). Other files (such as provenance traces or workflow definitions0 SHOULD be stored as tag files in other directories.
In CWLProv the payload files SHOULD have file paths derived from their own hashcode (content-addressable), however CWLProv consumers MUST NOT assume this, as implementations MAY use other unique filenames like UUIDs. Implementations MAY use any hash algorithm, but it is RECOMMENDED that the algorithm corresponds to a manifest file.
Reasonable subdirectory structures SHOULD be used to avoid a single directory with a large amount of files. For instance data/97/ contains 97fe1b50b4582cebc7d853796ebd62e3e163aa3f which happens to have the SHA1 checksum
All files outside
manifest*txt) SHOULD be listed in a corresponding tag manifest file, e.g. tagmanifest-sha1.txt. CWLProv bags SHOULD include the tag manifest using algorithms
CWLProv bags SHOULD include an system-independent runnable version of the executed workflow under
CWLProv SHOULD contain a CWL workflow in
workflow/packed.cwl - which SHOULD avoid any includes or references external to the CWLProv bag folder structure. This file can be created with
cwltool --pack or through other means.
CWLProv bags MAY include direct copies of arbitrarily named workflow files used at execution times under
snapshot/. Consumers SHOULD NOT assume these files are CWL workflows (unless so declared in the RO manifest, see below). Consumers SHOULD NOT assume that snapshot files have valid relative paths in their internal cross-referencing. Producers SHOULD use snapshot file names that reflect their original file names, taking reasonable care to avoid duplication or overlaps.
CWLPROV bags MUST include provenance traces of the workflow run under
metadata/provenance, of which the file
primary.cwlprov.provn MUST be present in PROV-N format, describing the top level workflow execution according to the CWLProv PROV profile. Other provenance files and formats MAY be present, in which case they SHOULD have
conformsTo declared in the RO manifest.
A challenge with Linked Data in distributed and desktop computing is how to make identifiers that are absolute URIs (and hence globally unique); e.g. for CWLProv a workflow may be executed by an engine that do not know where its workflow provenance will be stored, published or integrated in the end.
In this example the root URI for the bag is
arcp://uuid,4cca2dd8-3bd5-45cb-8d34-e7f346be027e/, as declared in bag-info.txt with
Consumers of CWLProv bags that do not contain an arcp-based
External-Identifier SHOULD generate a temporary arcp base to safely resolve any relative URI references without climbing outside the CWLProv folder.
Implementations processing a CWLProv RO MAY convert arcp URIs to their local
http:// URIs depending on how and where the CWLProv bag was saved, for instance using the arcp.py library.