-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Actual Behavior:
The covidcast issue data includes rows where all entries besides issue
and lag
appear to be the same as in the preceding version of that row.
Reprex with commentary:
[2022-02-14: Note that this snippet's original code is no longer supported: as_of + issues cannot be used together. I've updated it with something that hopefully is equivalent, but gives 10566 update rows rather than 10465, perhaps due to other version data patches?]
library("tibble")
library("dplyr")
analysis.date = as.Date("2021-11-09")
geo.values = c("ak", "al")
## jhu.case.updates.original.code =
## delphi.epidata::covidcast("jhu-csse", "confirmed_incidence_num",
## "day", "state",
## delphi.epidata::epirange(12340101,34560101), geo.values,
## as_of = format(analysis.date-1L, "%Y%m%d"), # try to make this very reproducible
## issues = delphi.epidata::epirange(12340101,34560101)) %>%
## delphi.epidata::fetch_tbl()
jhu.case.updates =
epidatr::covidcast("jhu-csse", "confirmed_incidence_num",
"state", "day",
geo.values, epidatr::epirange(12340101,34560101),
issues = epidatr::epirange(12340101,format(analysis.date-1L, "%Y%m%d"))) %>%
epidatr::fetch_tbl()
## See if we have any issue data that repeats the same `value` for an
## observation as the preceding issue (or issues) by performing an RLE of the
## value across issues for each observation (geo_value x time_value)
jhu.case.updates %>%
group_by(geo_value, time_value) %>%
arrange(issue, .by_group=TRUE) %>%
summarize(value.rle.tbl = {
value.rle = rle(value)
tibble(length = value.rle[["lengths"]], value = value.rle[["value"]])
}, .groups="drop") %>%
mutate(
value.run.length = value.rle.tbl[["length"]],
value.run.value = value.rle.tbl[["value"]],
value.rle.tbl = NULL
) %>%
arrange(-value.run.length) %>%
print()
## We see that there are updates that don't update the value, but maybe
## something else (besides `issue`&`lag`) could have been updated. Let's check
## on the other columns.
## - Examine the observation with the longest run in more detail.
jhu.case.updates %>%
filter(geo_value=="ak", time_value==as.Date("2020-04-23")) %>%
arrange(issue) %>%
select(issue, lag, value, missing_value, stderr, missing_stderr, sample_size, missing_sample_size) %>%
count(across(-c(issue, lag))) %>%
print(n=100L)
## There are two versions of the observation factoring in these additional
## columns. However, since one of these only appears 1x and the other 56x,
## there must be at least 54 instances of re-reporting the same row (with
## different `issue`&`lag`)
## Still, to avoid the complications above regarding the other columns and the
## row ordering, let's try to directly detect re-reporting of "entire
## observations":
rereporting =
jhu.case.updates %>%
select(-lag) %>%
group_by(geo_value, time_value) %>%
arrange(issue, .by_group=TRUE) %>%
## "lag the row" by `lead`ing the `issue`
mutate(issue = lead(issue)) %>% filter(!is.na(issue)) %>% ungroup() %>%
## find the re-reporting:
inner_join(., jhu.case.updates %>% select(-lag), by=names(.))
rereporting %>%
nrow()
## 8212 re-reported rows in this extract
jhu.case.updates %>%
nrow()
## 10465 total rows in this extract
## So, at least for ak&al cases, the amount of re-reporting appears substantial.
## If duplicate rows continue to be steadily re-reported, we will expect "very"
## quadratic growth in the size of the data set; with sparser re-reporting and
## sparse revisions, we would expect "linear-ish" growth.
rereporting %>%
count(issue) %>%
arrange(issue) %>%
print(n=100L)
## The last issue re-reporting an observation appears to be 2020-10-30 for these
## states, so it looks like there isn't the "very" quadratic growth, at least
## for the current signal, geo_type, and geo_values.
Expected behavior
I soft-expected an issue-query to return only data with changes to entries (other than to issue
itself and lag
), for two reasons:
?covidcast::covidcast_signal
describes issue-queries as "[f]etch[ing] only data that was published or updated ("issued") on these [issues]", which might leave this impression.- storage efficiency-wise, this would make sense; query efficiency-wise, I'm not sure but I'd guess that it'd probably help, especially if there is an index over
source,signal,geo_value,time_value,issue
.
Context
I am working on cmu-delphi/epiprocess#23 to build some utilities for working with data version history from delphi-epidata or elsewhere. A natural example was to use version history from something in delphi-epidata using an issue-query, which unearthed this surprise.
The re-reporting isn't really a problem for me right now. The utilities I am working on should be built to accept "update" or snapshot (as-of-query) data that contains duplicate data, and to eventually to compact the data to remove these duplicates. Removing re-reporting in covidcast issue data wouldn't ensure that users wouldn't input such re-reported data from other data providers.