-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin idea: automatic metadata annotation #15
Comments
it shouldn't be that hard to record metadata, just look for where the MD5 hash is updated and stick another hook in there. |
James Taylor suggests using PROV: https://twitter.com/jxtx/status/916406694674132992 |
I am thinking of making a start on this, very soon using PROV-O as the vocabulary. This is also used by projects like wf4ever the basic model has 3 classes, entity, agent and activity I think the primary agent would be biomake itself, with an acted-on-behalf-of edge to the person executing the workflow. The entity would be the file, and the activity would be the makefile recipe/rule. The primary output would be rdf/turtle, but we could also have json too (as well as a native prolog representation). Having some kind of dot/grpahviz export should also be simple. |
@cmungall I like this, especially how clean the mapping to PROV-O is: I think most/all of those things in the diagram are already being calculated at some point in biomake. |
Default interceptor is a persistent store that logs actions as unit clauses. This could be extended to provide a complete workflow record, as specified in #15
Reproducibility and provenance are increasingly important.
Makefiles and Makefile-like solutions such as biomake help with reproducibility; if the recipe and input files are provided in a github repo then in theory it is easy to re-executed and hopefully get the same answer.
However, if the final output files are submitted to a data repository, the provenance may not be immediately obvious. Initiatives such as BD2K are emphasizing the importance of metadata on all digital objects, which includes analysis results. Of course it is possible to manually annotate these artefacts, but why do that when this can be automated.
It should be possible for any file derived from biomake to immediately see a graph of objects used to derive it, together with complete metadata on each; this includes standard filesystem metadata e.g. timestamp but additional metadata too. See also https://github.com/W3C-HCLSIG/HCLSDatasetDescriptions
This may be a heavyweight feature so may be best implemented as some kind of plugin.
The text was updated successfully, but these errors were encountered: