Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue

The current `epi_signal` implementation contains `issue` as a column, allowing it to contain the full revision history of a signal, but functions using `epi_signal` appear to work only with the latest issue, and expects that issue to constitute a full snapshot of the data as of that issue, rather than, e.g., just the new & updated values.  It makes sense to separate out the concept of such a snapshot from the full revision history, giving these concepts different names, classes, functions, etc., and requiring the user to explicitly convert between them (for conceptual clarity and code readability).
- The current `epi_signal` should add the metadata field `as-of-issue` or `as-of` or something along those lines.  Maybe it should be renamed to `epi_snapshot`?  The `issue` column should be removed or renamed; some users might want to characterize how mature an observation is; some might do that with `as_of_issue - time_value`, and others might want to use `latest_issue_that_updated_this_row - time_value`.
- A new class `epi_archive` (or `epi_signal_archive`?) should be added.  It would have a structure similar to the structure of the current `epi_signal` table, but might benefit from being based on a `data.table` or `R6` class wrapping a `data.table`, and may also need or benefit from some extra metadata.
  - The primary function (after construction) would be `epi.archive$snapshot_as_of(as.of.issue)`, which would return an `epi_signal` or an error if the `as.of.issue` isn't in the range covered by the archive.  LOCF could be used to fill in the gaps between issues but not after the latest issue.  For the LOCF, something like one of the below:
    - `unique(diff.DT[issue <= as.of.issue], by=prefixed.nonissue.key.names, fromLast=TRUE)[status == "new.or.updated"]`
    - `diff.DT[setkeyv(set(unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names],,"issue",as.of.issue), c(prefixed.nonissue.key.names,"issue")), roll=TRUE][status == "new.or.updated"]`
  - The entries would represent a sort of git-like diff.  The most memory-efficient and canonical version would store ONLY entries that were added, changed, or removed in each issue.  The operations above are based on a design where the columns are `c(paste0("diff.", names(epi.signal)), "issue", "status")`, where `status` is either `"new.or.updated"` or `"removed"`.  It could be made more complex to allow for more error checking or to provide extra info to users interested in revision modeling.
  - Extra metadata that may be useful:
    - a vector of issues for which the archive has recorded snapshots.  This, or a `max.issue` value, is required if the archive covers all updates through, e.g., Sep 15, but data did not change from Sep 2 to Sep 15, in order to not complain when the user requests Sep 14 snapshot but to complain when they request Sep 16 snapshot, and in order to complain when the user tries to "add" a Sep 15 snapshot a second time by accident.  It is also necessary to enable a "reproducible" option; this option would assume that the latest recorded issue could be subject to revisions (e.g., because the data for an issue date was revised in the middle of that date, and it's either currently that date & before the change was made, or any time/day after the change was made & before the user found out about it, due to pipeline delays and data fetching frequencies).  It is also necessary if we want to allow a user to revise the archive by adding a new snapshot in the middle of the currently recorded issue range.
    - something equivalent to `unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names]`, to improve speed of the second query approach above.  But users would also benefit from a function that returns this data, so that they know what geos&times were ever included in a data set.
    - class & typeof information for the columns, to help check that any snapshots added later are compatible with the existing data.
  - Implementing construction and updating the archive may be a little inconvenient, especially if users are allowed to insert new snapshots in the middle of the range of issues already recorded  (the new diff needs to be computed and added + the already-recorded diff for the following issue needs to be updated).  The user would be softly forbidden from providing a column redundant with `as_of_issue` when inserting snapshots, as that would make the diffs very inefficient.
- Functions currently taking `epi_signal`s should continue to work on `epi_signal`s; instead of working on the latest "issue" by default, they will require the user to provide just a single snapshot of the data.

Secondly, we should disambiguate "issue".  E.g., at least for `epi_signal`, it should be called "as.of.issue" or "as.of"; the archive could still use just "issue" or maybe it should also have a more specific term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions