-
Notifications
You must be signed in to change notification settings - Fork 8
Description
The current epi_signal
implementation contains issue
as a column, allowing it to contain the full revision history of a signal, but functions using epi_signal
appear to work only with the latest issue, and expects that issue to constitute a full snapshot of the data as of that issue, rather than, e.g., just the new & updated values. It makes sense to separate out the concept of such a snapshot from the full revision history, giving these concepts different names, classes, functions, etc., and requiring the user to explicitly convert between them (for conceptual clarity and code readability).
- The current
epi_signal
should add the metadata fieldas-of-issue
oras-of
or something along those lines. Maybe it should be renamed toepi_snapshot
? Theissue
column should be removed or renamed; some users might want to characterize how mature an observation is; some might do that withas_of_issue - time_value
, and others might want to uselatest_issue_that_updated_this_row - time_value
. - A new class
epi_archive
(orepi_signal_archive
?) should be added. It would have a structure similar to the structure of the currentepi_signal
table, but might benefit from being based on adata.table
orR6
class wrapping adata.table
, and may also need or benefit from some extra metadata.- The primary function (after construction) would be
epi.archive$snapshot_as_of(as.of.issue)
, which would return anepi_signal
or an error if theas.of.issue
isn't in the range covered by the archive. LOCF could be used to fill in the gaps between issues but not after the latest issue. For the LOCF, something like one of the below:unique(diff.DT[issue <= as.of.issue], by=prefixed.nonissue.key.names, fromLast=TRUE)[status == "new.or.updated"]
diff.DT[setkeyv(set(unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names],,"issue",as.of.issue), c(prefixed.nonissue.key.names,"issue")), roll=TRUE][status == "new.or.updated"]
- The entries would represent a sort of git-like diff. The most memory-efficient and canonical version would store ONLY entries that were added, changed, or removed in each issue. The operations above are based on a design where the columns are
c(paste0("diff.", names(epi.signal)), "issue", "status")
, wherestatus
is either"new.or.updated"
or"removed"
. It could be made more complex to allow for more error checking or to provide extra info to users interested in revision modeling. - Extra metadata that may be useful:
- a vector of issues for which the archive has recorded snapshots. This, or a
max.issue
value, is required if the archive covers all updates through, e.g., Sep 15, but data did not change from Sep 2 to Sep 15, in order to not complain when the user requests Sep 14 snapshot but to complain when they request Sep 16 snapshot, and in order to complain when the user tries to "add" a Sep 15 snapshot a second time by accident. It is also necessary to enable a "reproducible" option; this option would assume that the latest recorded issue could be subject to revisions (e.g., because the data for an issue date was revised in the middle of that date, and it's either currently that date & before the change was made, or any time/day after the change was made & before the user found out about it, due to pipeline delays and data fetching frequencies). It is also necessary if we want to allow a user to revise the archive by adding a new snapshot in the middle of the currently recorded issue range. - something equivalent to
unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names]
, to improve speed of the second query approach above. But users would also benefit from a function that returns this data, so that they know what geos× were ever included in a data set. - class & typeof information for the columns, to help check that any snapshots added later are compatible with the existing data.
- a vector of issues for which the archive has recorded snapshots. This, or a
- Implementing construction and updating the archive may be a little inconvenient, especially if users are allowed to insert new snapshots in the middle of the range of issues already recorded (the new diff needs to be computed and added + the already-recorded diff for the following issue needs to be updated). The user would be softly forbidden from providing a column redundant with
as_of_issue
when inserting snapshots, as that would make the diffs very inefficient.
- The primary function (after construction) would be
- Functions currently taking
epi_signal
s should continue to work onepi_signal
s; instead of working on the latest "issue" by default, they will require the user to provide just a single snapshot of the data.
Secondly, we should disambiguate "issue". E.g., at least for epi_signal
, it should be called "as.of.issue" or "as.of"; the archive could still use just "issue" or maybe it should also have a more specific term.