Skip to content

Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue #8

@brookslogan

Description

@brookslogan

The current epi_signal implementation contains issue as a column, allowing it to contain the full revision history of a signal, but functions using epi_signal appear to work only with the latest issue, and expects that issue to constitute a full snapshot of the data as of that issue, rather than, e.g., just the new & updated values. It makes sense to separate out the concept of such a snapshot from the full revision history, giving these concepts different names, classes, functions, etc., and requiring the user to explicitly convert between them (for conceptual clarity and code readability).

  • The current epi_signal should add the metadata field as-of-issue or as-of or something along those lines. Maybe it should be renamed to epi_snapshot? The issue column should be removed or renamed; some users might want to characterize how mature an observation is; some might do that with as_of_issue - time_value, and others might want to use latest_issue_that_updated_this_row - time_value.
  • A new class epi_archive (or epi_signal_archive?) should be added. It would have a structure similar to the structure of the current epi_signal table, but might benefit from being based on a data.table or R6 class wrapping a data.table, and may also need or benefit from some extra metadata.
    • The primary function (after construction) would be epi.archive$snapshot_as_of(as.of.issue), which would return an epi_signal or an error if the as.of.issue isn't in the range covered by the archive. LOCF could be used to fill in the gaps between issues but not after the latest issue. For the LOCF, something like one of the below:
      • unique(diff.DT[issue <= as.of.issue], by=prefixed.nonissue.key.names, fromLast=TRUE)[status == "new.or.updated"]
      • diff.DT[setkeyv(set(unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names],,"issue",as.of.issue), c(prefixed.nonissue.key.names,"issue")), roll=TRUE][status == "new.or.updated"]
    • The entries would represent a sort of git-like diff. The most memory-efficient and canonical version would store ONLY entries that were added, changed, or removed in each issue. The operations above are based on a design where the columns are c(paste0("diff.", names(epi.signal)), "issue", "status"), where status is either "new.or.updated" or "removed". It could be made more complex to allow for more error checking or to provide extra info to users interested in revision modeling.
    • Extra metadata that may be useful:
      • a vector of issues for which the archive has recorded snapshots. This, or a max.issue value, is required if the archive covers all updates through, e.g., Sep 15, but data did not change from Sep 2 to Sep 15, in order to not complain when the user requests Sep 14 snapshot but to complain when they request Sep 16 snapshot, and in order to complain when the user tries to "add" a Sep 15 snapshot a second time by accident. It is also necessary to enable a "reproducible" option; this option would assume that the latest recorded issue could be subject to revisions (e.g., because the data for an issue date was revised in the middle of that date, and it's either currently that date & before the change was made, or any time/day after the change was made & before the user found out about it, due to pipeline delays and data fetching frequencies). It is also necessary if we want to allow a user to revise the archive by adding a new snapshot in the middle of the currently recorded issue range.
      • something equivalent to unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names], to improve speed of the second query approach above. But users would also benefit from a function that returns this data, so that they know what geos&times were ever included in a data set.
      • class & typeof information for the columns, to help check that any snapshots added later are compatible with the existing data.
    • Implementing construction and updating the archive may be a little inconvenient, especially if users are allowed to insert new snapshots in the middle of the range of issues already recorded (the new diff needs to be computed and added + the already-recorded diff for the following issue needs to be updated). The user would be softly forbidden from providing a column redundant with as_of_issue when inserting snapshots, as that would make the diffs very inefficient.
  • Functions currently taking epi_signals should continue to work on epi_signals; instead of working on the latest "issue" by default, they will require the user to provide just a single snapshot of the data.

Secondly, we should disambiguate "issue". E.g., at least for epi_signal, it should be called "as.of.issue" or "as.of"; the archive could still use just "issue" or maybe it should also have a more specific term.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions