-
Notifications
You must be signed in to change notification settings - Fork 8
Closed
Labels
Description
epi_archive
s can be formed based on a conglomeration of full snapshots, issue data with duplicate re-reporting, and/or minimal patch-like issues. Some space (unsure about time) can be saved by removing rows that match LOCF of previous issues. Space can be essential if we are attempting in-memory analysis.
Proposal: introduce a constructor argument compactify
:
TRUE
: remove unnecessary rows to give same LOCF results. Make sure to maintain the samemax_issue
value as the original data.FALSE
: leave data as-is- default value (say,
NULL
): same asTRUE
except message the user if this actually changed the data, and telling them how to silence the message
Use cases:
- User inputs full snapshot data, to prevent using space quadratic in the number of snapshots. (A further enhancement would be to directly work off of a directory of snapshot files or something similar.)
- We are working off of a covidcast data source that historically did not use diff-based issues and/or has many full re-issues. (E.g., repeating the analysis here gives covidcast jhu-csse state-level case issue data at 79% duplicates despite the shift to having routine issues being diff-based. This is still just reducing what
object.size
says is 40MB--50MB down to ~10MB, but at the county level it might matter more.)