Lightweight data provenance tracking. More comprehensive/involved approaches include recordr, rProvenance, and RDataTracker.
Currently this works as follows:
- Each object to be tracked has an attribute "provenance" attached to it once via
createProvenance
. This attribute essentially holds the provenance lookup name, and is a combination of the object's name and date/time. - The lookup name is used to pull out an element from a
.provenance_list
list, currently stored by default in the global environment. For example if objectx
gets assigned a "provenance" attribute of "x_12345", then.provenance_list[["x_12345"]]
holds the provenance record. - This record is a data frame with timestamp, caller, message, and digest (MD5) of
x
at that point in time.updateProvenance
only changes this, not the original objectx
, so avoiding (I hope) expensive copies. - The disadantage of this approach is that if
x
is deleted, its record just hangs around; and ify <- x
then y and x will share the same record. Ugh, hadn't thought about this last case. - Finally, note that this is opt-in provenance tracking. That is, we have to update the record for
x
manually each time a change is made. This is by design (lightweight...) but worth considering.
> library(dataprov)
> x <- 1
> x <- createProvenance(x)
Creating provenance list
> updateProvenance(x, "created x and its provenance record")
> x <- x + 1
> updateProvenance(x, "added 1")
> provenance(x)
timestamp caller message digest
1 2015-08-21 14:26:14 <NA> created x and its provenanc... 5fd7a492
2 2015-08-21 14:26:27 <NA> added 1 d934fb15
> listProvenances()
Object Creation Entries
1 x 1440181552.45085 2
Data provenance broadly refers to a description of the origins of a piece of data and the process by which it arrived in a database or analysis (see e.g. http://db.cis.upenn.edu/DL/fsttcs.pdf). This package implements a simple system for tracking operations performed on any R object. It's not automatic though--you have to call updateProvenance()
each time you want to put a new entry into the provenance.
The provenance is stored as an object attribute (currently a data frame, but don't depend on this--use the accessor functions below!). Available functions include:
Public
createProvenance(x, env = .GlobalEnv)
- attach a "provenance" attribute tox
, create the provenance_list if necessary, and create an empty provenance record. Necessary before any of the functions below can be called.updateProvenance(x, message, caller = NULL, env = .GlobalEnv)
- add a provenance entryprovenance(x, n = NULL, env = .GlobalEnv)
- gets the entire data frame, with attribute "dataprov". Ifn
is supplied, return a list corresponding to the nth entryprint.dataprov()
- summarize a data provenance - abbreviating all fields (digest, message, etc) for easy readinglistProvenances(env = .GlobalEnv)
- summary list of current provenances records
Private
- dataprov() - object creation
replaceProvenance(x, p)
- replace entire provenance ofx
withp
???
mergeProvenance(a, b)
- merge provenance of b into that of a. Ugh. Useful?- Could offer an option to save
x
each time the provenance is updated? Not a version 1.0 feature I don't think provenance(x) <- p
- replaces entire provenance ofx
withp
(must be data.frame with correct fields)