Skip to content

Add option to compactify issue data in epi_archive #62

@brookslogan

Description

@brookslogan

epi_archives can be formed based on a conglomeration of full snapshots, issue data with duplicate re-reporting, and/or minimal patch-like issues. Some space (unsure about time) can be saved by removing rows that match LOCF of previous issues. Space can be essential if we are attempting in-memory analysis.

Proposal: introduce a constructor argument compactify:

  • TRUE: remove unnecessary rows to give same LOCF results. Make sure to maintain the same max_issue value as the original data.
  • FALSE: leave data as-is
  • default value (say, NULL): same as TRUE except message the user if this actually changed the data, and telling them how to silence the message

Use cases:

  • User inputs full snapshot data, to prevent using space quadratic in the number of snapshots. (A further enhancement would be to directly work off of a directory of snapshot files or something similar.)
  • We are working off of a covidcast data source that historically did not use diff-based issues and/or has many full re-issues. (E.g., repeating the analysis here gives covidcast jhu-csse state-level case issue data at 79% duplicates despite the shift to having routine issues being diff-based. This is still just reducing what object.size says is 40MB--50MB down to ~10MB, but at the county level it might matter more.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions